7,306 Matching Annotations
  1. Last 7 days
    1. Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      Strengths

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies

      (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Weaknesses

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      In the response to this comment the authors have pointed out their own previous work showing that system neglect can occur even when numerical probabilities are not used. This is reassuring but there remains a large body of classic work showing that observers do struggle with conditional probabilities of the type presented in the task.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers, resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, Pt always increases with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? To control for this the authors include, in a supplementary analysis, an 'intertemporal prior.' I would have preferred to see the results of this better-controlled analysis presented in the main figure. From the tables in the SI it is very difficult to tell how the results change with the includion of the control regressors.

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example, in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

    2. Author response:

      The following is the authors’ response to the current reviews

      eLife Assessment

      This study offers valuable insights into how humans detect and adapt to regime shifts, highlighting dissociable contributions of the frontoparietal network and ventromedial prefrontal cortex to sensitivity to signal diagnosticity and transition probabilities. The combination of an innovative instructed-probability task, Bayesian behavioural modeling, and model-based fMRI analyses provides a solid foundation for the main claims; however, major interpretational limitations remain, particularly a potential confound between posterior switch probability and time in the neuroimaging results. At the behavioural level, reliance on explicitly instructed conditional probabilities leaves open alternative explanations that complicate attribution to a single computational mechanism, such that clearer disambiguation between competing accounts and stronger control of temporal and representational confounds would further strengthen the evidence.

      Thank you. In this revision, we will focus on addressing Reviewer 3’s concern on the potential confound between posterior probability and time in neuroimaging results. First, we will present whole-brain results of subjects’ probability estimates (their subjective posterior probability of switch) after controlling for the effect of time on probability of switch (the intertemporal prior). Second, we will compare the effect of probability estimates (Pt) on vmPFC and ventral striatum activity—which we found to correlate with Pt—with and without including intertemporal prior in the GLM. Third, to address Reviewer 3’s comment that from the Tables of activation in the supplement vmPFC and ventral striatum cannot be located, we will add slice-by-slice image of the whole-brain results on Pt in the Supplemental Information in addition to the Tables of Activation.

      Public Reviews:

      Reviewer #1 (Public review):<br /> Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well. The model is comprehensively validated.

      The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      Weaknesses:

      The authors have adequately addressed my prior concerns.

      Thank you for reviewing our paper and providing constructive comments that helped us improve our paper.

      Reviewer #3 (Public review):

      Thank you again for reviewing the manuscript. In this revision, we will focus on addressing your concern on the potential confound between posterior probability and time in neuroimaging results. First, we will present whole-brain results of subjects’ probability estimates (Pt, their subjective posterior probability of switch) after controlling for the effect of time on probability of switch (the intertemporal prior). Second, we will compare the effect of probability estimates (Pt) on vmPFC and ventral striatum activity—which we found to correlate with Pt—with and without including intertemporal prior in the GLM. These results will be summarized in a new figure (Figure 4).

      Finally, to address that you were not able to locate vmPFC and ventral striatum from the Tables of activation, we will add slice-by-slice image of the whole-brain results on Pt in the supplement in addition to the Tables of Activation.

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      Strengths

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies

      (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Weaknesses

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      In the response to this comment the authors have pointed out their own previous work showing that system neglect can occur even when numerical probabilities are not used. This is reassuring but there remains a large body of classic work showing that observers do struggle with conditional probabilities of the type presented in the task.

      Thank you. Yes, people do struggle with conditional probabilities in many studies. However, as our previous work suggested (Massey and Wu, 2005), system-neglect was likely not due to response mode (having to enter probability estimates or making binary predictions, and etc.).

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      We thank the reviewer for this comment. We do not disagree that there are alternative models that can describe over- and underreactions seen in the dataset. However, we do wish to point out that since we began with the normative Bayesian model, the natural progression in case the normative model fails to capture data is to modify the starting model. It is under this context that we developed the system-neglect model. It was a simple extension (a parameterized version) of the normative Bayesian model.

      Regarding the hyperprior idea, even if the participants have a hyperprior, there has to be some function that describes/implements attraction to the mean. Having a hyperprior itself does not imply attraction to this hyperprior. We therefore were not sure why the hyperprior itself can produce attraction to the mean.

      We do look further into the possibility of attraction to the mean. First, as suggested by the reviewer, we looked into another dataset with different mean ground-truth value. In Massey and Wu (2005), the transition probabilities were [0.02 0.05 0.1 0.2], which is different from the current study [0.01 0.05 0.1], and there they also found over- and underreactions as well. Second, we reason that for the attraction to the mean idea to work subjects need to know the mean of the system parameters. This would take time to develop because we did not tell subjects about the mean. If this is caused by attraction to the mean, subjects’ behavior would be different in the early stage of the experiment where they had little idea about the mean, compared with the late stage of the experiment where they knew about the mean. We will further analyze and compare participants’ data at the beginning of the experiment with data at the end of the experiment.

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers, resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      We thank the reviewer for pointing out these potential explanations. Again, we do not disagree that any model in which participants don’t fully use numerical information they were given would produce system neglect. It is hard to separate ‘not fully using numerical information’ from ‘lack of sensitivity to the numerical information’. We will respond in more details to the four example reasons later.

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      Again, we do not disagree with the reviewer on the modeling statement. However, we also wish to point out that the system-neglect model we had is a simple extension of the normative Bayesian model. Had we gone to a non-Bayesian framework, we would have faced the criticism of why we simply do not consider a simple extension of the starting model. In response, we will add a section in Discussion summarizing our exchange on this matter.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, Pt always increases with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? To control for this the authors include, in a supplementary analysis, an 'intertemporal prior.' I would have preferred to see the results of this better-controlled analysis presented in the main figure. From the tables in the SI it is very difficult to tell how the results change with the includion of the control regressors.

      Thank you. In response, we will add a new figure, now Figure 4, showing the results of Pt and delta Pt from GLM-2 where we added the intertemporal prior as a regressor to control for temporal confounds. We compared Pt and delta Pt results in vmPFC and ventral striatum between GLM-1 and GLM-2. We also will show the results of intertemporal prior on vmPFC and ventral striatum under GLM-2.

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example, in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      We thank the reviewer for this comment. On the one hand, the effect of Pt we see in brain activity can be simply due to motor confounds and the purpose of Experiment 3 was to control for them. Our question was, if subjects saw the similar visual layout and were just instructed to press buttons to indicate two-digit numbers, would we observe the vmPFC, ventral striatum, and the frontoparietal network like what we did in the main experiment (Experiment 1)?

      On the other hand, the effect of Pt can simply reflect probability estimates of that the current regime is the blue regime, and therefore not particularly about change detection. In Experiment 2, we tested that idea, namely whether what we found about Pt was unique to change detection. In Experiment 2, subjects estimated the probability that the current regime is the blue regime (just as they did in Experiment 1) except that there were no regime shifts involved. In other words, it is possible that the regions we identified were generally associated with probability estimation and not particularly about probability estimates of change. We used Experiment 2 to examine whether this were true.

      To make the purpose of the two control experiments clearer, we updated the paragraph describing the control experiments on page 9:

      “To establish the neural representations for regime-shift estimation, we performed three fMRI experiments ( subjects for each experiment, 90 subjects in total). Experiment 1 was the main experiment, while Experiments 2 to 3 were control experiments that ruled out two important confounds (Fig. 1E). The control experiments were designed to clarify whether any effect of subjects’ probability estimates of a regime shift, , in brain activity can be uniquely attributed to change detection. Here we considered two major confounds that can contribute to the effect of . First, since subjects in Experiment 1 made judgments about the probability that the current regime is the blue regime (which corresponded to probability of regime change), the effect of  did not particularly have to do with change detection. To address this issue, in Experiment 2 subjects made exactly the same judgments as in Experiment 1 except that the environments were stationary (no transition from one regime to another was possible), as in Edwards (1968) classic “bookbag-and-poker chip” studies. Subjects in both experiments had to estimate the probability that the current regime is the blue regime, but this estimation corresponded to the estimates of regime change only in Experiment 1. Therefore, activity that correlated with probability estimates in Experiment 1 but not in Experiment 2 can be uniquely attributed to representing regime-shift judgments. Second, the effect of  can be due to motor preparation and/or execution, as subjects in Experiment 1 entered two-digit numbers with button presses to indicate their probability estimates. To address this issue, in Experiment 3 subjects performed a task where they were presented with two-digit numbers and were instructed to enter the numbers with button presses. By comparing the fMRI results of these experiments, we were therefore able to establish the neural representations that can be uniquely attributed to the probability estimates of regime-shift.”

      To further make sure that the probability-estimate signals in Experiment 1 were not due to motor confounds, we implemented an action-handedness regressor in the GLM, as we described below on page 19:

      “Finally, we note that in GLM-1, we implemented an “action-handedness” regressor to directly address the motor-confound issue, that higher probability estimates preferentially involved right-handed responses for entering higher digits. The action-handedness regressor was parametric, coding -1 if both finger presses involved the left hand (e.g., a subject pressed “23” as her probability estimate when seeing a signal), 0 if using one left finger and one right finger (e.g., “75”), and 1 if both finger presses involved the right hand (e.g., “90”). Taken together, these results ruled out motor confounds and suggested that vmPFC and ventral striatum represent subjects’ probability estimates of change (regime shifts) and belief revision.”

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Thank you. We thank the reviewer for pushing us to highlight the key contributions. In response, we added a paragraph at the beginning of Discussion to better highlight our contributions:

      “In this study, we investigated how humans detect changes in the environments and the neural mechanisms that contribute to how we might under- and overreact in our judgments. Combining a novel behavioral paradigm with computational modeling and fMRI, we discovered that sensitivity to environmental parameters that directly impact change detection is a key mechanism for under- and overreactions. This mechanism is implemented by distinct brain networks in the frontal and parietal cortices and in accordance with the computational roles they played in change detection. By introducing the framework in system neglect and providing evidence for its neural implementations, this study offered both theoretical and empirical insights into how systematic judgment biases arise in dynamic environments.”

      **Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):**

      Thank you for pointing out the inclusion of the intertemporal prior in glm2, this seems like an important control that would address my criticism. Why not present this better-controlled analysis in the main figure, rather than the results for glm1 which has no effective control of the increasing posterior probability of a reversal with time?

      Thank you for this suggestion. We added a new figure (Figure 4) that showed results from GLM-2. In this new figure, we showed whole-brain results on Pt and delta Pt, ROI results of vmPFC and ventral striatum on Pt, delta Pt, and intertemporal prior.

      The reason we kept results from GLM-1 (Figure 3) was primarily because we wanted to compare the effect of Pt between experiments under identical GLM. In other words, the regressors in GLM-1 was identical across all 3 experiments. In Experiments 1 and 2, Pt and delta Pt were respectively probability estimates and belief updates that current regime was the Blue regime. In Experiment 3, Pt and delta Pt were simply the number subjects were instructed to press (Pt) and change in number between successive periods (delta Pt).

      As a further point I could not navigate the tables of fMRI activations in SI and recommend replacing or supplementing these with images. For example I cannot actually find a vmPFC or ventral striatum cluster listed for the effect of Pt in GLM1 (version in table S1), which I thought were the main results? Beyond that, comparing how much weaker (or not) those results are when additional confound regressors are included in GLM2 seems impossible.

      The vmPFC and ventral striatum were part of the cluster labeled as Central Opercular cortex. In response, we will provide information about coordinates on the local maxima within the cluster. We will also add slice-by-slice images showing the effect of Pt.


      The following is the authors’ response to the original reviews

      eLife Assessment

      This study offers valuable insights into how humans detect and adapt to regime shifts, highlighting distinct contributions of the frontoparietal network and ventromedial prefrontal cortex to sensitivity to signal diagnosticity and transition probabilities. The combination of an innovative task design, behavioral modeling, and model-based fMRI analyses provides a solid foundation for the conclusions; however, the neuroimaging results have several limitations, particularly a potential confound between the posterior probability of a switch and the passage of time that may not be fully controlled by including trial number as a regressor. The control experiments intended to address this issue also appear conceptually inconsistent and, at the behavioral level, while informing participants of conditional probabilities rather than requiring learning is theoretically elegant, such information is difficult to apply accurately, as shown by well-documented challenges with conditional reasoning and base-rate neglect. Expressing these probabilities as natural frequencies rather than percentages may have improved comprehension. Overall, the study advances understanding of belief updating under uncertainty but would benefit from more intuitive probabilistic framing and stronger control of temporal confounds in future work.

      We thank the editors for the assessment and we appreciate your efforts in reviewing the paper. The editors added several limitations in the assessment based on the new reviewer 3 in this round, which we would like to clarify below.

      With regard to temporal confounds, we clarified in the main text and response to Reviewer 3 that we had already addressed the potential confound between posterior probability of a switch and passage of time in GLM-2 with the inclusion of intertemporal prior. After adding intertemporal prior in the GLM, we still observed the same fMRI results on probability estimates. In addition, we did two other robustness checks, which we mentioned in the manuscript.

      With regard to response mode (probability estimation rather than choice or indicating natural frequencies), we wish to point out that the in previous research by Massey and Wu (2005), which the current study was based on, the concern of participants showing system-neglect tendencies due to the mode of information delivery, namely indicating beliefs through reporting probability estimates rather than through choice or other response mode was addressed. Massy and Wu (2005, Study 3) found the same biases when participants performed a choice task that did not require them to indicate probability estimates.

      With regard to the control experiments, the control experiments in fact were not intended to address the confounds between posterior probability and passage of time. Rather, they aimed to address whether the neural findings were unique to change detection (Experiment 2) and to address visual and motor confounds (Experiment 3). These and the results of the control experiments were mentioned on page 18-19.

      We also wish to highlight that we had performed detailed model comparisons after reviewer 2’s suggestions. Although reviewer 2 was unable to re-review the manuscript, we believe this provides insight into the literature on change detection. See “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection” (p.27-30). The model comparison showed that system-neglect models that incorporate signal dependency are better models than the original system-neglect model in describing participants probability estimates. This suggests that people respond to change-consistent and change-inconsistent signals differently when judging whether the regime had changed. This was not reported in previous behavioral studies and was largely inspired by the neural finding on signal dependency in the frontoparietal cortex. It indicates that neural findings can provide novel insights into computational modeling of behavior.

      To better highlight and summarize our key contributions, we added a paragraph at the beginning of Discussion:

      “In this study, we investigated how humans detect changes in the environments and the neural mechanisms that contribute to how we might under- and overreact in our judgments. Combining a novel behavioral paradigm with computational modeling and fMRI, we discovered that sensitivity to environmental parameters that directly impact change detection is a key mechanism for under- and overreactions. This mechanism is implemented by distinct brain networks in the frontal and parietal cortices and in accordance with the computational roles they played in change detection. By introducing the framework in system neglect and providing evidence for its neural implementations, this study offered both theoretical and empirical insights into how systematic judgment biases arise in dynamic environments.”    

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      - The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      - The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well. The model is comprehensively validated.

      - The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      We thank the reviewer for the comments.

      Weaknesses:

      The authors have adequately addressed most of my prior concerns.

      We thank the reviewer for recognizing our effort in addressing your concerns.

      My only remaining comment concerns the z-test of the correlations. I agree with the non-parametric test based on bootstrapping at the subject level, providing evidence for significant differences in correlations within the left IFG and IPS.

      However, the parametric test seems inadequate to me. The equation presented is described as the Fisher z-test, but the numerator uses the raw correlation coefficients (r) rather than the Fisher-transformed values (z). To my understanding, the subtraction should involve the Fisher z-scores, not the raw correlations.

      More importantly, the Fisher z-test in its standard form assumes that the correlations come from independent samples, as reflected in the denominator (which uses the n of each independent sample). However, in my opinion, the two correlations are not independent but computed within-subject. In such cases, parametric tests should take into account the dependency. I believe one appropriate method for the current case (correlated correlation coefficients sharing a variable [behavioral slope]) is explained here:

      Meng, X.-l., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172-175. https://doi.org/10.1037/0033-2909.111.1.172

      It should be implemented here:

      Diedenhofen B, Musch J (2015) cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10(4): e0121945. https://doi.org/10.1371/journal.pone.0121945

      My recommendation is to verify whether my assumptions hold, and if so, perform a test that takes correlated correlations into account. Or, to focus exclusively on the non-parametric test.

      In any case, I recommend a short discussion of these findings and how the authors interpret that some of the differences in correlations are not significant.

      Thank you for the careful check. Yes. This was indeed a mistake from us. We also agree that the two correlations are not independent. Therefore, we modified the test that accounts for dependent correlations by following Meng et al. (1992) suggested by the reviewer. We updated in the Methods section on p.56-57:

      “In the parametric test, we adopted the approach of Meng et al. (1992) to statistically compare the two correlation coefficients. This approach specifically tests differences between dependent correlation coefficients according to the following equation

      Where N is the number of subjects, z<sub>ri</sub> is the Fisher z-transformed value of r<sub>i</sub>,(r<sub>1</sub> = r<sub>blue</sub> and r<sub>2</sub> = r<sub>red</sub>), and r<sub>x</sub> is the correlation between the neural sensitivity at change-consistent signals and change-inconsistent signals. The computation of h is based on the following equations

      Where is the mean of the , and f should be set to 1 if > 1.”

      We updated on the Results section on p.29:

      “Since these correlation coefficients were not independent, we compared them using the test developed in Meng et al. (1992) (see Methods). We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: z = 1.8908, p = 0.0293; left IPS: z = 2.2584, p = 0.0049). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: z = 0.9522, p = 0.1705; right IFG: z = 0.9860, p = 0.1621; right IPS: z = 1.4833, p = 0.0690).”

      We added a Discussion on these results on p.41:

      “Interestingly, such sensitivity to signal diagnosticity was only present in the frontoparietal network when participants encountered change-consistent signals. However, while most brain areas within this network responded in this fashion, only the left IPS and left IFG showed a significant difference in coding individual participants’ sensitivity to signal diagnosticity between change-consistent and change-inconsistent signals. Unlike the left IPS and left IFG, we observed in dmPFC a marginally significant correlation with behavioral sensitivity at change-inconsistent signals as well. Together, these results indicate that while different brain areas in the frontoparietal network responded similarly to change-consistent signals, there was a greater degree of heterogeneity in responding to change-inconsistent signals.”

      Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      We thank the reviewer for the overall descriptions of the manuscript.

      Strengths

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies

      (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Thank you for these assessments.

      Weaknesses

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      We appreciate the reviewer’s concern on this issue. The concern was addressed in Massey and Wu (2005) as participants performed a choice task in which they were not asked to provide probability estimates (Study 3 in Massy and Wu, 2005). Instead, participants in Study 3 were asked to predict the color of the ball before seeing a signal. This was a more intuitive way of indicating his or her belief about regime shift. The results from the choice task were identical to those found in the probability estimation task (Study 1 in Massey and Wu). We take this as evidence that the system-neglect behavior the participants showed was less likely to be due to the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      We thank the reviewer for this comment. It is true that the system-neglect model is not entirely inconsistent with regression to the mean, regardless of whether the implementation has a hyper prior or not. In fact, our behavioral measure of sensitivity to transition probability and signal diagnosticity, which we termed the behavioral slope, is based on linear regression analysis. In general, the modeling approach in this paper is to start from a generative model that defines ideal performance and consider modifying the generative model when systematic deviations in actual performance from the ideal is observed. In this approach, a generative Bayesian model with hyper priors would be more complex to begin with, and a regression to the mean idea by itself does not generate a priori predictions.

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020)

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      Thank you for raising this point. The modeling principle we adopt is the following. We start from the normative model—the Bayesian model—that defined what normative behavior should look like. We compared participants’ behavior with the Bayesian model and found systematic deviations from it. To explain those systematic deviations, we considered modeling options within the confines of the same modeling framework. In other words, we considered a parameterized version of the Bayesian model, which is the system-neglect model and examined through model comparison the best modeling choice. This modeling approach is not uncommon in economics and psychology. For example, Kahneman and Tversky adopted this approach when proposing prospect theory, a modification of expected utility theory where expected utility theory can be seen as one specific model for how utility of an option should be computed.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      Thank you for raising this concern. Yes, Pt always increases with sample number regardless of evidence (seeing change-consistent or change-inconsistent signals). This is captured by the ‘intertemporal prior’ in the Bayesian model, which we included as a regressor in our GLM analysis (GLM-2), in addition to Pt. In short, GLM-1 had Pt and sample number. GLM-2 had Pt, intertemporal prior, and sample number, among other regressors. And we found that, in both GLM-1 and GLM-2, both vmPFC and ventral striatum correlated with Pt.

      To make this clearer, we updated the main text to further clarify this on p.18:

      “We examined the robustness of P<sub>t</sub> representations in these two regions in several follow-up analyses. First, we implemented a GLM (GLM-2 in Methods) that, in addition to P<sub>t</sub>, included various task-related variables contributing to P<sub>t</sub> as regressors (Fig. S7 in SI). Specifically, to account for the fact that the probability of regime change increased over time, we included the intertemporal prior as a regressor in GLM-2. The intertemporal prior is the natural logarithm of the odds in favor of regime shift in the t-th period, where q is transition probability and t = 1,…,10 is the period (see Eq. 1 in Methods). It describes normatively how the prior probability of change increased over time regardless of the signals (blue and red balls) the subjects saw during a trial. Including it along with P<sub>t</sub> would clarify whether any effect of P<sub>t</sub> can otherwise be attributed to the intertemporal prior. Second, we implemented a GLM that replaced P<sub>t</sub> with the log odds of P<sub>t</sub>, ln (P<sub>t</sub>/(1-P<sub>t</sub>)) (Fig. S8 in SI). Third, we implemented a GLM that examined  separately on periods when change-consistent (blue balls) and change-inconsistent (red balls) signals appeared (Fig. S9 in SI). Each of these analyses showed the same pattern of correlations between P<sub>t</sub> and activation in vmPFC and ventral striatum, further establishing the robustness of the P<sub>t</sub> findings.”

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      We thank the reviewer for this comment. On the one hand, the effect of Pt we see in brain activity can be simply due to motor confounds and the purpose of Experiment 3 was to control for them. Our question was, if subjects saw the similar visual layout and were just instructed to press buttons to indicate two-digit numbers, would we observe the vmPFC, ventral striatum, and the frontoparietal network like what we did in the main experiment (Experiment 1)?

      On the other hand, the effect of Pt can simply reflect probability estimates of that the current regime is the blue regime, and therefore not particularly about change detection. In Experiment 2, we tested that idea, namely whether what we found about Pt was unique to change detection. In Experiment 2, subjects estimated the probability that the current regime is the blue regime (just as they did in Experiment 1) except that there were no regime shifts involved. In other words, it is possible that the regions we identified were generally associated with probability estimation and not particularly about probability estimates of change. We used Experiment 2 to examine whether this were true.

      To make the purpose of the two control experiments clearer, we updated the paragraph describing the control experiments on page 9:

      “To establish the neural representations for regime-shift estimation, we performed three fMRI experiments (n\=30 subjects for each experiment, 90 subjects in total). Experiment 1 was the main experiment, while Experiments 2 to 3 were control experiments that ruled out two important confounds (Fig. 1E). The control experiments were designed to clarify whether any effect of subjects’ probability estimates of a regime shift, P<sub>t</sub>, in brain activity can be uniquely attributed to change detection. Here we considered two major confounds that can contribute to the effect of . First, since subjects in Experiment 1 made judgments about the probability that the current regime is the blue regime (which corresponded to probability of regime change), the effect of P<sub>t</sub> did not particularly have to do with change detection. To address this issue, in Experiment 2 subjects made exactly the same judgments as in Experiment 1 except that the environments were stationary (no transition from one regime to another was possible), as in Edwards (1968) classic “bookbag-and-poker chip” studies. Subjects in both experiments had to estimate the probability that the current regime is the blue regime, but this estimation corresponded to the estimates of regime change only in Experiment 1. Therefore, activity that correlated with probability estimates in Experiment 1 but not in Experiment 2 can be uniquely attributed to representing regime-shift judgments. Second, the effect of P<sub>t</sub> can be due to motor preparation and/or execution, as subjects in Experiment 1 entered two-digit numbers with button presses to indicate their probability estimates. To address this issue, in Experiment 3 subjects performed a task where they were presented with two-digit numbers and were instructed to enter the numbers with button presses. By comparing the fMRI results of these experiments, we were therefore able to establish the neural representations that can be uniquely attributed to the probability estimates of regime-shift.”

      To further make sure that the probability-estimate signals in Experiment 1 were not due to motor confounds, we implemented an action-handedness regressor in the GLM, as we described below on page 19:

      “Finally, we note that in GLM-1, we implemented an “action-handedness” regressor to directly address the motor-confound issue, that higher probability estimates preferentially involved right-handed responses for entering higher digits. The action-handedness regressor was parametric, coding -1 if both finger presses involved the left hand (e.g., a subject pressed “23” as her probability estimate when seeing a signal), 0 if using one left finger and one right finger (e.g., “75”), and 1 if both finger presses involved the right hand (e.g., “90”). Taken together, these results ruled out motor confounds and suggested that vmPFC and ventral striatum represent subjects’ probability estimates of change (regime shifts) and belief revision.”

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Thank you. We thank the reviewer for pushing us to highlight the key contributions. In response, we added a paragraph at the beginning of Discussion to better highlight our contributions:

      “In this study, we investigated how humans detect changes in the environments and the neural mechanisms that contribute to how we might under- and overreact in our judgments. Combining a novel behavioral paradigm with computational modeling and fMRI, we discovered that sensitivity to environmental parameters that directly impact change detection is a key mechanism for under- and overreactions. This mechanism is implemented by distinct brain networks in the frontal and parietal cortices and in accordance with the computational roles they played in change detection. By introducing the framework in system neglect and providing evidence for its neural implementations, this study offered both theoretical and empirical insights into how systematic judgment biases arise in dynamic environments.”

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      Many of the figures are too tiny - the writing is very small, as are the pictures of brains. I'd suggest adjusting these so they will be readable without enlarging.

      Thank you. We apologize for the poor readability of the figures. We had enlarged the figures (Fig. 5 in particular) and their font size to make them more readable.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      In our manuscript, we describe a role for the nuclear mRNA export factor UAP56 (a helicase) during metamorphic dendrite and presynapse pruning in flies. We characterize a UAP56 ATPase mutant and find that it rescues the pruning defects of a uap56 mutant. We identify the actin severing enzyme Mical as a potentially crucial UAP56 mRNA target during dendrite pruning and show alterations at both the mRNA and protein level. Finally, loss of UAP56 also causes presynapse pruning defects with actin abnormalities. Indeed, the actin disassembly factor cofilin is required for pruning specifically at the presynapse.

      We thank the reviewers for their constructive comments, which we tried to address experimentally as much as possible. To summarize briefly, while all reviewers saw the results as interesting (e. g., Reviewer 3's significance assessment: "Understanding how post-transcriptional events are linked to key functions in neurons is important and would be of interest to a broad audience") and generally methodologically strong, they thought that our conclusions regarding the potential specificity of UAP56 for Mical mRNA was not fully covered by the data. To address this criticism, we added more RNAi analyses of other mRNA export factors and rephrased our conclusions towards a more careful interpretation, i. e., we now state that the pruning process is particularly sensitive to loss of UAP56. In addition, reviewer 1 had technical comments regarding some of our protein and mRNA analyses. We added more explanations and an additional control for the MS2/MCP system. Reviewers 2 and 3 wanted to see a deeper characterization of the ATPase mutant provided. We generated an additional UAP56 mutant transgene, improved our analyses of UAP56 localization, and added a biochemical control experiment. We hope that our revisions make our manuscript suitable for publication.

      1. Point-by-point description of the revisions

      This section is mandatory. *Please insert a point-by-point reply describing the revisions that were already carried out and included in the transferred manuscript. *

      • *

      Comments by reviewer 1.

      Major comments

      1.

      For Figure 4, the MS2/MCP system is not quantitative. Using this technique, it is impossible to determine how many RNAs are located in each "dot". Each of these dots looks quite large and likely corresponds to some phase-separated RNP complex where multiple RNAs are stored and/or transported. Thus, these data do not support the conclusion that Mical mRNA levels are reduced upon UAP56 knockdown. A good quantitative microscopic assay would be something like smFISH. Additinally, the localization of Mical mRNA dots to dendrites is not convincing as it looks like regions where there are dendritic swellings, the background is generally brighter.

      Our response

      We indeed found evidence in the literature that mRNPs labeled with the MS2/MCP or similar systems form condensates (Smith et al., JCB 2015). Unfortunately, smFISH is not established for this developmental stage and would likely be difficult due to the presence of the pupal case. To address whether the Mical mRNPs in control and UAP56 KD neurons are comparable, we characterized the MCP dots in the respective neurons in more detail and found that their sizes did not differ significantly between control and UAP56 KD neurons. To facilitate interpretability, we also increased the individual panel sizes and include larger panels that only show the red (MCP::RFP) channel. We think these changes improved the figure. Thanks for the insight.

      Changes introduced: Figure 5 (former Fig. 4): Increased panel size for MCP::RFP images, left out GFP marker for better visibility. Added new analysis of MCP::RFP dot size (new Fig. 5 I).

      1.

      Alternatively, levels of Mical mRNA could be verified by qPCR in the laval brain following pan-neuronal UAP56 knockdown or in FACS-sorted fluorescently labeled da sensory neurons. Protein levels could be analyzed using a similar approach.

      Our response

      We thank the reviewer for this comment. Unfortunately, these experiments are not doable as neuron-wide UAP56 KD is lethal (see Flybase entry for UAP56). From our own experience, FACS-sorting of c4da neurons would be extremely difficult as the GFP marker fluorescence intensity of UAP56 KD neurons is weak - this would likely result in preferential sorting of subsets of neurons with weaker RNAi effects. In addition, FACS-sorting whole neurons would not discriminate between nuclear and cytoplasmic mRNA.

      The established way of measuring protein content in the Drosophila PNS system is immunofluorescence with strong internal controls. In our case, we also measured Mical fluorescence intensity of neighboring c1da neurons that do not express the RNAi and show expression levels as relative intensities compared to these internal controls. This procedure rules out the influence of staining variation between samples and is used by other labs as well.

      1.

      In Figure 5, the authors state that Mical expression could not be detected at 0 h APF. The data presented in Fig. 5C, D suggest the opposite as there clearly is some expression. Moreover, the data shown in Fig. 5D looks significantly brighter than the Orco dsRNA control and appears to localize to some type of cytoplasmic granule. So the expression of Mical does not look normal.

      Our response

      We thank the reviewer for this comment. In the original image in Fig. 5 C, the c4da neuron overlaps with the dendrite from a neighboring PNS neuron (likely c2da or c3da). The latter neuron shows strong Mical staining. We agree that this image is confusing and exchanged this image for another one from the same genotype.

      Changes introduced: Figure 5 L (former Fig. 5 C): Exchanged panel for image without overlap from other neuron.

      1.

      Sufficient data are not presented to conclude any specificity in mRNA export pathways. Data is presented for one export protein (UAP56) and one putative target (Mical). To adequately assess this, the authors would need to do RNA-seq in UAP56 mutants.

      Our response

      We thank the reviewer for this comment. To address this, we tested whether knockdown of three other mRNA export factors (NXF1, THO2, THOC5) causes dendrite pruning defects, which was not the case (new Fig. S1). While these data are consistent with specific mRNA export pathways, we agree that they are not proof. We therefore toned down our interpretation and removed the conclusion about specificity. Instead, we now use the more neutral term "increased sensibility (to loss of UAP56)".

      Changes introduced: Added new Figure S1: RNAi analyses of NXF1, THO2 and THOC5 in dendrite pruning. Introduced concluding sentence at the end of first Results paragraph: We conclude that c4da neuron dendrite pruning is particularly sensitive to loss of UAP56. (p. 6)

      1.

      In summary, better quantitative assays should be used in Figures 4 and 5 in order to conclude the expression levels of either mRNA or protein. In its current form, this study demonstrates the novel finding that UAP56 regulates dendrite and presynaptic pruning, potentially via regulation of the actin cytoskeleton. However, these data do not convincingly demonstrate that UAP56 controls these processes by regulating of Mical expression and defintately not by controlling export from the nucleus.

      Our response

      We hope that the changes we introduced above help clarify this.

      1.

      While there are clearly dendrites shown in Fig. 1C', the cell body is not readily identifiable. This makes it difficult to assess attachment and suggests that the neuron may be dying. This should be replaced with an image that shows the soma.

      Our response

      We thank the reviewer for this comment. Changes introduced: we replaced the picture in the panel with one where the cell body is more clearly visible.

      1.

      The level of knockdown in the UAS56 RNAi and P element insertion lines should be determined. It would be useful to mention the nature of the RNAi lines (long/short hairpin). Some must be long since Dcr has been co-expressed. Another issue raised by this is the potential for off-target effects. shRNAi lines would be preferable because these effects are minimized.

      Our response

      We thank the reviewer for this comment. Assessment of knockdown efficiency is a control to make sure the manipulations work the way they are intended to. As mRNA isolation from Drosophila PNS neurons is extremely difficult, RNAi or mutant phenotypes in this system are controlled by performing several independent manipulations of the same gene. In our case, we used two independent RNAi lines (both long hairpins from VDRC/Bloomington and an additional insertion of the VDRC line, see Table S1) as well as a mutant P element in a MARCM experiment, i. e., a total of three independent manipulations that all cause pruning defects, and the VDRC RNAi lines do not have any predicted OFF targets (not known for the Bloomington line). If any of these manipulations would not have matched, we would have generated sgRNA lines for CRISPR to confirm.

      Minor comments:

      1.

      The authors should explain what EB1:GFP is marking when introduced in the text.


      Our response

      We thank the reviewer for this comment. Changes introduced: we explain the EB1::GFP assay in the panel with one where the cell body is more clearly visible.

      1.

      The da neuron images throughout the figures could be a bit larger.

      Our response

      We thank the reviewer for this comment. Changes introduced: we changed the figure organization to be able to use larger panels:

      • the pruning analysis of the ATPase mutations (formerly Fig. 2) is now its own figure (Figure 3).

      • we increased the panel sizes of the MCP::RFP images (Figure 5 A - I, formerly Fig. 4).

      Reviewer #1 (Significance (Required)):

      Strengths:

      The methodology used to assess dendrite and presynaptic prunings are strong and the phenotypic analysis is conclusive.

      Our response

      We thank the reviewer for this comment.

      Weakness:

      The evidence demonstrating that UAP56 regulates the expression of Mical is unconvincing. Similarly, no data is presented to show that there is any specificity in mRNA export pathways. Thus, these major conclusions are not adequately supported by the data.

      Our response

      We hope the introduced changes address this comment.

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      In this paper, the authors describe dendrite pruning defects in c4da neurons in the DEXD box ATPase UAP56 mutant or in neuronal RNAi knockdown. Overexpression UAP56::GFP or UAP56::GFPE194Q without ATPase activity can rescue dendrite pruning defects in UAP56 mutant. They further characterized the mis-localization of UAP56::GFPE194Q and its binding to nuclear export complexes. Both microtubules and the Ubiquitin-proteasome system are intact in UAP56RNAi neurons. However, they suggest a specific effect on MICAL mRNA nuclear export shown by using the MS2-MCP system., resulting in delay of MICAL protein expression in pruned neurons. Furthermore, the authors show that UAP56 is also involved in presynaptic pruning of c4da neuros in VNC and Mica and actin are also required for actin disassembly in presynapses. They propose that UAP56 is required for dendrite and synapse pruning through actin regulation in Drosophila. Following are my comments.

      Major comments

      1.

      The result that UAP56::GFPE194Q rescues the mutant phenotype while the protein is largely mis-localized suggests a novel mechanism or as the authors suggested rescue from combination of residual activities. The latter possibility requires further support, which is important to support the role mRNA export in dendrite and pre-synapse pruning. One approach would be to examine whether other export components like REF1, and NXF1 show similar mutant phenotypes. Alternatively, depleting residual activity like using null mutant alleles or combining more copies of RNAi transgenes could help.

      Our response

      We thank the reviewer for this comment. We agree that the mislocalization phenotype is interesting and could inform further studies on the mechanism of UAP56. To further investigate this and to exclude that this could represent a gain-of-function due to the introduced mutation, we made and characterized a new additional transgene, UAP56::GFP E194A. This mutant shows largely the same phenotypes as E194Q, with enhanced interactions with Ref1 and partial mislocalization to the cytoplasm. In addition, we tested whether knockdown of THO2, THOC5 or NXF1 causes pruning defects (no).

      Changes introduced:

      • added new Figure S1: RNAi analyses of NXF1, THO2 and THOC5 in dendrite pruning.

      • made and characterized a new transgene UAP56 E194A (new Fig. 2 B, E, E', 3 C, C', E, F).

      1.

      The localization of UAP56::GFP (and E194Q) should be analyzed in more details. It is not clear whether the images in Fig. 2A and 2B are from confocal single sections or merged multiple sections. The localization to the nuclear periphery of UAP56::GFP is not clear, and the existence of the E194Q derivative in both nucleus and cytosol (or whether there is still some peripheral enrichment) is not clear if the images are stacked.

      Our response

      We thank the reviewer for this comment. It is correct that the profiles in the old Figure 2 were from single confocal sections from the displayed images. As it was difficult to create good average profiles with data from multiple neurons, we now introduce an alternative quantification based on categories (nuclear versus dispersed) which includes data from several neurons for each genotype, including the new E194A transgene (new Fig 3 G). Upon further inspection, the increase at the nuclear periphery was not always visible and may have been a misinterpretation. We therefore removed this statement.

      Changes introduced:

      • added new quantitative analysis of UAP56 wt and E/A, E/Q mutant localization (new Fig 3 G).

      1.

      The Ub-VV-GFP is a new reagent, and its use to detect active proteasomal degradation is by the lack of GFP signals, which could be also due to the lack of expression. The use of Ub-QQ-GFP cannot confirm the expression of Ub-VV-GFP. The proteasomal subunit RPN7 has been shown to be a prominent component in the dendrite pruning pathway (Development 149, dev200536). Immunostaining using RPN7 antibodies to measure the RPN expression level could be a direct way to address the issue whether the proteasomal pathway is affected or not.

      Our response

      We thank the reviewer for this comment. We agree that it is wise to not only introduce a positive control for the Ub-VV-GFP sensor (the VCP dominant-negative VCP QQ), but also an independent control. As mutants with defects in proteasomal degradation accumulate ubiquitinated proteins (see, e. g., Rumpf et al., Development 2011), we stained controls and UAP56 KD neurons with antibodies against ubiquitin and found that they had similar levels (new Fig. S3).

      Changes introduced:

      • added new ubiquitin immunofluorescence analysis (new Fig. S3).

      1.

      Using the MS2/MCP system to detect the export of MICAL mRNA is a nice approach to confirm the UAP56 activity; lack of UAP56 by RNAi knockdown delays the nuclear export of MS2-MICAL mRNA. The rescue experiment by UAS transgenes could not be performed due to the UAS gene dosage, as suggested by the authors. However, this MS2-MICAL system is also a good assay for the requirement of UAP56 ATPase activity (absence in the E194Q mutant) in this process. Could authors use the MARCM (thus reduce the use of UAS-RNAi transgene) for the rescue experiment? Also, the c4da neuronal marker UAS-CD8-GFP used in Fig4 could be replaced by marker gene directly fused to ppk promoter, which can save a copy of UAS transgene. The results from the rescue experiment would test the dependence of ATPase activity in nuclear export of MICAL mRNA.

      Our response

      We thank the reviewer for this comment. This is a great idea but unfortunately, this experiment was not feasible due to the (rare) constraints of Drosophila genetics. The MARCM system with rescue already occupies all available chromosomes (X: FLPase, 2nd: FRT, GAL80 + mutant, 3rd: GAL4 + rescue construct), and we would have needed to introduce three additional ones (MCP::RFP and two copies of unmarked genomic MICAL-MS2, all on the third chromosome) that would have needed to be introduced by recombination. Any Drosophilist will see that this is an extreme, likely undoable project :-(

      1.

      The UAP56 is also involved in presynaptic pruning through regulating actin assembly, and the authors suggest that Mical and cofilin are involved in the process. However, direct observation of lifeact::GFP in Mical or cofilin RNAi knockdown is important to support this conclusion.

      Our response

      We thank the reviewer for this comment. In response, we analyzed the lifeact::GFP patterns of control and cofilin knockdown neurons and found that loss of cofilin also leads to actin accumulation (new Fig. 7 I, J).

      Changes introduced:

      • new lifeact analysis (new Fig. 7 I, J).

      Minor comments:

      1.

      RNA localization is important for dendrite development in larval stages (Brechbiel JL, Gavis ER. Curr Biol. 20;18(10):745-750). Yet, the role of UAP56 is relatively specific and shown only in later-stage pruning. It would need thorough discussion.


      Our response

      We thank reviewer 2 for this comment. We added the following paragraph to the discussion: "UAP56 has also been shown to affect cytoplasmic mRNA localization in Drosophila oocytes (Meignin and Davis, 2008), opening up the possibility that nuclear mRNA export and cytoplasmic transport are linked. It remains to be seen whether this also applies to dendritic mRNA transport (Brechbiel and Gavis, 2008)." (p.13)

      1.

      Could authors elaborate on the possible upstream regulators that might be involved, as described in "alternatively, several cofilin upstream regulators have been described (Rust, 2015) which might also be involved in presynapse pruning and subject to UAP56 regulation" in Discussion?

      Our response

      We thank reviewer 2 for this comment. In the corresponding paragraph, we cite as example now that cofilin is regulated by Slingshot phosphatases and LIM kinase (p.14).

      1.

      In Discussion, the role of cofilin in pre- and post-synaptic processes was described. The role of Tsr/Cofilin regulating actin behaviors in dendrite branching has been described in c3da and c4da neurons (Nithianandam and Chien, 2018 and other references) should be included in Discussion.

      Our response

      We thank reviewer 2 for this comment. In response we tested whether cofilin is required for dendrite pruning and found that this, in contrast to Mical, is not the case (new Fig. S6). We cite the above paper in the corresponding results section (p.12).

      Changes introduced:

      • new cofilin dendrite pruning analysis (new Fig. S6).

      • added cofilin reference in Results.

      1.

      The authors speculate distinct actin structures have to be disassembled in dendrite and presynapse pruning in Discussion. What are the possible actin structures in both sites could be elaborated.

      Our response

      We thank reviewer 2 for this comment. In response, we specify in the Discussion: "As Mical is more effective in disassembling bundled F-actin than cofilin (Rajan et al., 2023), it is interesting to speculate that such bundles are more prevalent in dendrites than at presynapses." (p14)

      Reviewer #2 (Significance (Required)):

      The study initiated a genetic screen for factors involved in a dendrite pruning system and reveals the involvement of nuclear mRNA export is an important event in this process. They further identified the mRNA of the actin disassembly factor MICAL is a candidate substrate in the exporting process. This is consistent with previous finding that MICAL has to be transcribed and translated when pruning is initiated. As the presynapses of the model c4da neuron in this study is also pruned, the dependence on nuclear export and local actin remodeling were also shown. Thus, this study has added another layer of regulation (the nuclear mRNA export) in c4da neuronal pruning, which would be important for the audience interested in neuronal pruning. The study is limited for the confusing result whether ATPase activity of the exporting factor is required.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary: In the manuscript by Frommeyer, Gigengack et al. entitled "The UAP56 mRNA Export Factor is Required for Dendrite and Synapse Pruning via Actin Regulation in Drosophila" the authors surveyed a number of RNA export/processing factors to identify any required for efficient dendrite and/or synapse pruning. They describe a requirement for a general poly(A) RNA export factor, UAP56, which functions as an RNA helicase. They also study links to aspects of actin regulation.

      Overall, while the results are interesting and the impact of loss of UAP56 on the pruning is intriguing, some of the data are overinterpreted as presented. The argument that UAP56 may be specific for the MICAL RNA is not sufficiently supported by the data presented. The two stories about poly(A) RNA export/processing and the actin regulation seem to not quite be connected by the data presented. The events are rather distal within the cell, making connecting the nuclear events with RNA to events at the dendrites/synapse challenging.

      Our response

      We thank reviewer 3 for this comment. To address this, we tested whether knockdown of three other mRNA export factors (NXF1, THO2, THOC5) causes dendrite pruning defects, which was not the case (new Fig. S1). While these data are consistent with specific mRNA export pathways, we agree that they are not proof. We therefore toned down our interpretation and removed the conclusion about specificity. Instead, we now use the more neutral term "increased sensibility (to loss of UAP56)".

      We agree that it is a little hard to tie cofilin to UAP56, as we currently have no evidence that cofilin levels are affected by loss of UAP56, even though both seem to affect lifeact::GFP in a similar way (new Fig. 7 I, J). However, a dysregulation of cofilin can also occur through dysregulation of upstream cofilin regulators such as Slingshot and LIM kinase, making such a relationship possible.

      Changes introduced:

      • added new Figure S1: RNAi analyses of NXF1, THO2 and THOC5 in dendrite pruning.

      • introduced concluding sentence at the end of first Results paragraph: "We conclude that c4da neuron dendrite pruning is particularly sensitive to loss of UAP56." (p. 6)

      • add new lifeact::GFP analysis of cofilin KD (new Fig. I, J).

      • identify potential other targets from the literature in the Discussion (Slingshot phosphatases and LIM kinase, p.14).

      There are a number of specific statements that are not supported by references. See, for example, these sentences within the Introduction- "Dysregulation of pruning pathways has been linked to various neurological disorders such as autism spectrum disorders and schizophrenia. The cell biological mechanisms underlying pruning can be studied in Drosophila." The Drosophila sentence is followed by some specific examples that do include references. The authors also provide no reference to support the variant that they create in UAP56 (E194Q) and whether this is a previously characterized fly variant or based on an orthologous protein in a different system. If so, has the surprising mis-localization been reported in another system?

      Our response

      We thank reviewer 3 for this comment. We added the following references on pruning and disease:

      1) Howes, O.D., Onwordi, E.C., 2023. The synaptic hypothesis of schizophrenia version III: a master mechanism. Mol. Psychiatry 28, 1843-1856.

      2) Tang, G., et al., 2014. Loss of mTOR-dependent macroautophagy causes autistic-like synaptic pruning deficits. Neuron 83, 1131-43.

      To better introduce the E194 mutations, we explain the position of the DECD motif in the Walker B domain, give the corresponding residues in the human and yeast homologues and cite papers demonstrating the importance of this residue for ATPase activity:

      3) Saguez, C., et al., 2013. Mutational analysis of the yeast RNA helicase Sub2p reveals conserved domains required for growth, mRNA export, and genomic stability. RNA 19:1363-71.

      4) Shen, J., et al., 2007. Biochemical Characterization of the ATPase and Helicase Activity of UAP56, an Essential Pre-mRNA Splicing and mRNA Export Factor. J. Biol. Chem. 282, P22544-22550.

      We are not aware of other studies looking at the relationship between the UAP56 ATPase and its localization. Thank you for pointing this out!

      Specific Comments:

      Specific Comment 1: Figure 1 shows the impact of loss of UAP56 on neuron dendrite pruning. The experiment employs both two distinct dsRNAs and a MARCM clone, providing confidence that there is a defect in pruning upon loss of UAP56. As the authors mention screening against 92 genes that caused splicing defects in S2 cells, inclusion of some examples of these genes that do not show such a defect would enhance the argument for specificity with regard to the role of UAP56. This control would be in addition to the more technical control that is shown, the mCherry dsRNA.

      Our response

      We thank reviewer 3 for this comment. To address this, we included the full list of screened genes with their phenotypic categorization regarding pruning (103 RNAi lines targeting 64 genes) as Table S1. In addition, we also tested four RNAi lines targeting the nuclear mRNA export factors Nxf1, THO2 and THOC5 which do not cause dendrite pruning defects (Fig. S1).

      Changes introduced:

      • added RNAi screen results as a list in Table S1.

      • added new Figure S1: RNAi analyses of NXF1, THO2 and THOC5 in dendrite pruning.

      Specific Comment 2: Later the authors demonstrate a delay in the accumulation of the Mical protein, so if they assayed these pruning events at later times, would the loss of UAP56 cause a delay in these events as well? Such a correlation would enhance the causality argument the authors make for Mical levels and these pruning events.

      Our response

      We thank reviewer 3 for this comment. Unfortunately, this is somewhat difficult to assess, as shortly after the 18 h APF timepoint, the epidermal cells that form the attachment substrate for c4da neuron dendrites undergo apoptosis. Where assessed (e. g., Wang et al., 2017, Development) 144: 1851–1862), this process, together with the reduced GAL4 activity of our ppk-GAL4 during the pupal stage (our own observations), eventually leads to pruning, but the causality cannot be easily attributed anymore. We therefore use the 18 h APF timepoint essentially as an endpoint assay.

      Specific Comment 3: Figure 2 provides data designed to test the requirement for the ATPase/helicase activity of UAP56 for these trimming events. The first observation, which is surprising, is the mislocalization of the variant (E194Q) that the authors generate. The data shown does not seem to indicate how many cells the results shown represent as a single image and trace is shown the UAP56::GFP wildtype control and the E194Q variant.

      Our response

      We thank reviewer 3 for this comment. It is correct that the traces shown are from single confocal sections. To better display the phenotypic penetrance, we now added a categorical analysis that shows that the UAP56 E194Q mutant is completely mislocalized in the majority of cells assessed (and the newly added E194A mutant in a subset of cells).

      Changes introduced:

      • added categorical quantification of UAP56 variant localization (new Fig. 2 G).

      __Specific Comment 4: __Given the rather surprising finding that the ATPase activity is not required for the function of UAP56 characterized here, the authors do not provide sufficient references or rationale to support the ATPase mutant that they generate. The E194Q likely lies in the Walker B motif and is equivalent to human E218Q, which can prevent proper ATP hydrolysis in the yeast Sub2 protein. There is no reference to support the nature of the variant created here.

      Our response

      We thank reviewer 3 for this comment. To better introduce the E194 mutations, we explain the position of the DECD motif in the Walker B domain, give the corresponding residues in the human and yeast homologues (Sub2) and cite papers demonstrating the importance of this residue for ATPase activity:

      1) Saguez, C., et al., 2013. Mutational analysis of the yeast RNA helicase Sub2p reveals conserved domains required for growth, mRNA export, and genomic stability. RNA 19:1363-71.

      2) Shen, J., et al., 2007. Biochemical Characterization of the ATPase and Helicase Activity of UAP56, an Essential Pre-mRNA Splicing and mRNA Export Factor. J. Biol. Chem. 282, P22544-22550.

      __Specific Comment 5: __Given the surprising results, the authors could have included additional variants to ensure the change has the biochemical effect that the authors claim. Previous studies have defined missense mutations in the ATP-binding site- K129A (Lysine to Alanine): This mutation, in both yeast Sub2 and human UAP56, targets a conserved lysine residue that is critical for ATP binding. This prevents proper ATP binding and consequently impairs helicase function. There are also missense mutations in the DEAD-box motif, (Asp-Glu-Ala-Asp) involved in ATP binding and hydrolysis. Mutations in this motif, such as D287A in yeast Sub2 (corresponding to D290A in human UAP56), can severely disrupt ATP hydrolysis, impairing helicase activity. In addition, mutations in the Walker A (GXXXXGKT) and Walker B motifs are can impair ATP binding and hydrolysis in DEAD-box helicases. Missense mutations in these motifs, like G137A (in the Walker A motif), can block ATP binding, while E218Q (in the Walker B motif)- which seems to be the basis for the variant employed here- can prevent proper ATP hydrolysis.

      Our response

      We thank reviewer 3 for this comment. Our cursory survey of the literature suggested that mutations in the Walker B motif are the most specific as they still preserve ATP binding and their effects have not well been characterized overall. In addition, these mutations can create strong dominant-negatives in related helicases (e. g., Rode et al., 2018 Cell Reports, our lab). To better characterize the role of the Walker B motif in UAP56, we generated and characterized an alternative mutant, UAP56 E194A. While the E194A variant does not show the same penetrance of localization phenotypes as E194Q, it also is partially mislocalized, shows stronger binding to Ref1 and also rescues the uap56 mutant phenotypes without an obvious dominant-negative effect, thus confirming our conclusions regarding E194Q.

      Changes introduced:

      • added biochemical, localization and phenotypic analysis of newly generated UAP56 E194A variant (new Figs. 2 B, 2 E, E', 3 C, C'). categorical quantification of UAP56 variant localization (new Fig. 2 G).

      __Specific Comment 6: __The co-IP results shown in Figure 2C would also seem to have multiple potential interpretations beyond what the authors suggest, an inability to disassemble a complex. The change in protein localization with the E194Q variant could impact the interacting proteins. There is no negative control to show that the UAP56-E194Q variant is not just associated with many, many proteins. Another myc-tagged protein that does not interact would be an ideal control.

      Our response

      We thank reviewer 3 for this comment. To address this comment, we tried to co-IP UAP56 wt or UAP56 E194Q with a THO complex subunit THOC7 (new Fig. S2). The results show that neither UAP56 variant can co-IP THOC7 under our conditions (likely because the UAP56/THO complex intermediate during mRNA export is disassembled in an ATPase-independent manner (Hohmann et al., Nature 2025)).

      Changes introduced:

      • added co-IP experiment between UAP56 variants and THOC7 (new Fig. S2).

      __Specific Comment 7: __With regard to Figure 3, the authors never define EB1::GFP in the text of the Results, so a reader unfamiliar with this system has no idea what they are seeing. Reading the Materials and Methods does not mitigate this concern as there is only a brief reference to a fly line and how the EB1::GFP is visualized by microscopy. This makes interpretation of the data presented in Figure 3A-C very challenging.

      Our response

      We thank reviewer 3 for pointing this out. We added a description of the EB1::GFP analysis in the corresponding Results section (p.8).

      __Specific Comment 8: __The data shown for MICAL MS2 reporter localization in Figure 4 is nice, but is also fully expected on many former studies analyzing loss of UAP56 or UAP56 hypomorphs in different systems. While creating the reporter is admirable, to make the argument that MICAL localization is in some way preferentially impacted by loss of UAP56, the authors would need to examine several other transcripts. As presented, the authors can merely state that UAP56 seems to be required for the efficient export of an mRNA transcript, which is predicted based on dozens of previous studies dating back to the early 2000s.

      Our response

      Firstly, thank you for commenting on the validity of the experimental approach! The primary purpose of this experiment was to test whether the mechanism of UAP56 during dendrite pruning conforms with what is known about UAP56's cellular role - which it apparently does. We also noted that our statements regarding the specificity of UAP56 for Mical over other transcripts are difficult. While our experiments would be consistent with such a model, they do not prove it. We therefore toned down the corresponding statements (e. g., the concluding sentence at the end of first Results paragraphis now: "We conclude that c4da neuron dendrite pruning is particularly sensitive to loss of UAP56." (p. 6)).

      Minor (and really minor) points:

      In the second sentence of the Discussion, the word 'developing' seems to be mis-typed "While a general inhibition of mRNA export might be expected to cause broad defects in cellular processes, our data in develoing c4da neurons indicate that loss of UAP56 mainly affects pruning mechanisms related to actin remodeling."

      Sentence in the Results (lack of page numbers makes indicating where exactly a bit tricky)- "We therefore reasoned that Mical expression could be more challenging to c4da neurons." This is a complete sentence as presented, yet, if something is 'more something'- the thing must be 'more than' something else. Presumably, the authors mean that the length of the MICAL transcript could make the processing and export of this transcript more challenging than typical fly transcripts (raising the question of the average length of a mature transcript in flies?).

      Our response

      Thanks for pointing these out. The typo is fixed, page numbers are added. We changed the sentence to: "Because of the large size of its mRNA, we reasoned that MICAL gene expression could be particularly sensitive to loss of export factors such as UAP56." (p.9) We hope this is more precise language-wise.

      Reviewer #3 (Significance (Required)):

      Understanding how post-transcriptional events are linked to key functions in neurons is important and would be of interest to a broad audience.

    1. 3.4. Bots and Responsibility# As we think about the responsibility in ethical scenarios on social media, the existence of bots causes some complications. 3.4.1. A Protesting Donkey?# To get an idea of the type of complications we run into, let’s look at the use of donkeys in protests in Oman: “public expressions of discontent in the form of occasional student demonstrations, anonymous leaflets, and other rather creative forms of public communication. Only in Oman has the occasional donkey…been used as a mobile billboard to express anti-regime sentiments. There is no way in which police can maintain dignity in seizing and destroying a donkey on whose flank a political message has been inscribed.” From Kings and People: Information and Authority in Oman, Qatar, and the Persian Gulf by Dale F. Eickelman1 In this example, some clever protesters have made a donkey perform the act of protest: walking through the streets displaying a political message. But, since the donkey does not understand the act of protest it is performing, it can’t be rightly punished for protesting. The protesters have managed to separate the intention of protest (the political message inscribed on the donkey) and the act of protest (the donkey wandering through the streets). This allows the protesters to remain anonymous and the donkey unaware of it’s political mission. 3.4.2. Bots and responsibility# Bots present a similar disconnect between intentions and actions. Bot programs are written by one or more people, potentially all with different intentions, and they are run by others people, or sometimes scheduled by people to be run by computers. This means we can analyze the ethics of the action of the bot, as well as the intentions of the various people involved, though those all might be disconnected. 3.4.3. Reflection questions# How are people’s expectations different for a bot and a “normal” user? Choose an example social media bot (find on your own or look at Examples of Bots (or apps).) What does this bot do that a normal person wouldn’t be able to, or wouldn’t be able to as easily? Who is in charge of creating and running this bot? Does the fact that it is a bot change how you feel about its actions? Why do you think social media platforms allow bots to operate? Why would users want to be able to make bots? How does allowing bots influence social media sites’ profitability? 1 We haven’t been able to get the original chapter to load to see if it indeed says that, but I found it quoted here and here. We also don’t know if this is common or representative of protests in Oman, nor that we fully understand the cultural importance of what is happening in this story. Still, we are using it at least as a thought experiment. { requestKernel: true, binderOptions: { repo: "binder-examples/jupyter-stacks-datascience", ref: "master", }, codeMirrorConfig: { theme: "abcdef", mode: "python" }, kernelOptions: { kernelName: "python3", path: "./ch03_bots" }, predefinedOutput: true } kernelName = 'python3'

      I found the donkey protest example helpful for understanding how responsibility can be separated from action. Just like the donkey does not understand the protest it carries, bots can perform actions without intention or awareness. This makes it harder to assign responsibility, since the people who design, deploy, or benefit from a bot may all have different roles and intentions.

    1. Unclear Privacy Rules: Sometimes privacy rules aren’t made clear to the people using a system. For example: If you send “private” messages on a work system, your boss might be able to read them. When Elon Musk purchased Twitter, he also was purchasing access to all Twitter Direct Messages Others Posting Without Permission: Someone may post something about another person without their permission. See in particular: The perils of ‘sharenting’: The parents who share too much Metadata: Sometimes the metadata that comes with content might violate someone’s privacy. For example, in 2012, former tech CEO John McAfee was a suspect in a murder in Belize, John McAfee hid out in secret. But when Vice magazine wrote an article about him, the photos in the story contained metadata with the exact location in Guatemala. Deanonymizing Data: Sometimes companies or researchers release datasets that have been “anonymized,” meaning that things like names have been removed, so you can’t directly see who the data is about. But sometimes people can still deduce who the anonymized data is about. This happened when Netflix released anonymized movie ratings data sets, but at least some users’ data could be traced back to them. Inferred Data: Sometimes information that doesn’t directly exist can be inferred through data mining (as we saw last chapter), and the creation of that new information could be a privacy violation. This includes the creation of Shadow Profiles, which are information about the user that the user didn’t provide or consent to Non-User Information: Social Media sites might collect information about people who don’t have accounts, like how Facebook does

      This list shows how privacy risks often come less from a single bad action and more from how data travels and persists across systems. Even when users think they are acting safely or anonymously, metadata, inference, and platform ownership can quietly undermine consent and control, making privacy feel fragile and conditional rather than guaranteed.

    1. Author response:

      The following is the authors’ response to the original reviews

      We appreciate the reviewers’ insightful comments. In response, we conducted three new experiments, summarized in Author response table 1. After the table, we provide detailed responses to each comment.

      Author response table 1.

      Summary of new experiments and results.

      Reviewer #1 (Public review):

      The authors show that corticotropin-releasing factor (CRF) neurons in the central amygdala (CeA) and bed nucleus of the stria terminalis (BNST) monosynaptically target cholinergic interneurons (CINs) in the dorsal striatum of rodents. Functionally, activation of CRFR1 receptors increases CIN firing rate, and this modulation was reduced by pre-exposure to ethanol. This is an interesting finding, with potential significance for alcohol use disorders, but some conclusions could use additional support.

      Strengths:

      Well-conceived circuit mapping experiments identify a novel pathway by which the CeA and BNST can modulate dorsal striatal function by controlling cholinergic tone. Important insight into how CRF, a neuropeptide that is important in mediating aspects of stress, affective/motivational processes, and drug-seeking, modulates dorsal striatal function.

      Weaknesses:

      (1) Tracing and expression experiments were performed both in mice and rats (in a mostly nonoverlapping way). While these species are similar in many ways, some conclusions are based on assumptions of similarities that the presented data do not directly show. In most cases, this should be addressed in the text (but see point number 2).

      In the revised manuscript, we have clarified this limitation in the first paragraph of the Methods and the third paragraph of the Discussion and avoid cross-species claims, limiting our conclusions to the species in which each assay was performed. Specifically, we now state that while mice and rats share many conserved amygdalostriatal components, our tracing and expression studies were performed in a species-specific manner, and direct cross-species comparisons of CRF–CIN connectivity and CRFR1 expression were not assessed. We further note that future studies will be needed to determine the extent to which these observations are conserved across species as more tools become available.

      (2) Experiments in rats show that CRFR1 expression is largely confined to a subpopulation of striatal CINs. Is this true in mice, too? Since most electrophysiological experiments are done in various synaptic antagonists and/or TTX, it does not affect the interpretation of those data, but non-CIN expression of CRFR1 could potentially have a large impact on bath CRF-induced acetylcholine release.

      To address whether CRFR1 expression in striatal CINs is conserved across species, we performed new histological experiments using CRFR1-GFP mice. Striatal sections were immunostained with anti-ChAT, and we found that approximately 10% of CINs express CRFR1 (new Fig. 4D, 4E). This result indicates that, similar to rats, a subset of CINs in mice express CRFR1. However, the proportion of CRFR1<sup>+</sup> CINs is lower than the proportion of CRF-responsive CINs observed during electrophysiology experiments, suggesting that CRF may also modulate CIN activity indirectly through network or synaptic mechanisms. We have also noted in the revised Discussion that while CRFR1 expression is confirmed in a subset of CINs, the broader distribution of CRFR1 among other striatal cell types remains to be determined (third paragraph of Discussion).

      In our study, bath application of CRF increased striatal ACh release. Because striatal ACh is released primarily from CINs, and CRFR1 is an excitatory receptor, this effect is most likely mediated by CRF activation of CRFR1 on CINs, leading to enhanced CIN activity and ACh release. Although CRFR1 may also be expressed on other striatal neurons, these cell types—medium spiny neurons and GABAergic interneurons—are inhibitory. If CRF were to activate CRFR1 on these GABAergic neurons, the resulting increase in GABA release would suppress CIN activity and consequently reduce, rather than enhance, ACh release. Given that most CINs responded functionally while only a small subset expressed CRFR1, these findings imply that indirect mechanisms, such as CRF modulation of local circuits influencing CIN excitability, may also contribute to the observed increase in ACh release. Together, these data support a model in which CRF primarily enhances ACh release via activation of CRFR1-expressing CINs, while indirect network effects may further amplify this response.

      (3) Experiments in rats show that about 30% of CINs express CRFR1 in rats. Did only a similar percentage of CINs in mice respond to bath application of CRF? The effect sizes and error bars in Figure 5 imply that the majority of recorded CINs likely responded. Were exclusion criteria used in these experiments?

      We thank the reviewer for this insightful question. In our mouse cell-attached recordings, ~80% of CINs increased firing during CRF bath application, and all recorded cells were included in the analysis (no exclusions based on response direction/magnitude; cells were only required to meet standard recording-quality criteria such as stable baseline firing and seal).

      Using a CRFR1-GFP reporter mouse, we found that ~10% of striatal CINs are GFP+, suggesting that the high proportion of CRF-responsive CINs cannot be explained solely by somatic reporter-labeled CRFR1 expression. Importantly, the CRF-induced increase in CIN firing is blocked by the selective CRFR1 antagonist NBI 35695 (Fig. 5B–C), supporting a CRFR1-dependent mechanism at the circuit level. We now discuss several non-mutually exclusive explanations for this apparent discrepancy: (i) reporter lines (e.g., CRFR1-GFP) may underestimate functional CRFR1 expression, particularly for low-level or compartmentalized receptor pools; (ii) bath-applied CRF may act indirectly via CRFR1 on presynaptic afferents, thereby enhancing excitatory drive onto CINs; and (iii) electrical coupling among CINs could allow direct effects in a subset of CINs to propagate through the CIN network (Ren, Liu et al. 2021). We added this discussion to the revised manuscript (fourth paragraph of the Discussion).

      (4) The conclusion that prior acute alcohol exposure reduces the ability of subsequent alcohol exposure to suppress CIN activity in the presence of CRF may be a bit overstated. In Figure 6D (no ethanol preexposure), ethanol does not fully suppress CIN firing rate to baseline after CRF exposure. The attenuated effect of CRF on CIN firing rate after ethanol pre-treatment (6E) may just reduce the maximum potential effect that ethanol can have on firing rate after CRF, due to a lowered starting point. It is possible that the lack of significant effect of ethanol after CRF in pre-treated mice is an issue of experimental sensitivity. Related to this point, does pre-treatment with ethanol reduce the later CIN response to acute ethanol application (in the absence of CRF)?

      In the revised manuscript, we have tempered our interpretation in the final Results section and throughout the Discussion to emphasize that ethanol pre-exposure attenuates, rather than abolishes, the CRFinduced increase in CIN firing. We also note the reviewer’s important point that in Figure 6D, ethanol does not fully suppress firing to baseline after CRF exposure, consistent with a partial effect. Regarding the reviewer’s question, our experiments were specifically designed to test interactions between CRF and ethanol, so we did not assess whether ethanol pre-treatment alters subsequent responses to ethanol alone. We now explicitly acknowledge CRF-dependent and CRF-independent effects of ethanol on CIN activity as an important point for future studies to disentangle (sixth paragraph of the Discussion). For example, comparing ethanol responses with and without prior ethanol without any treatment with CRF could resolve this question.

      (5) More details about the area of the dorsal striatum being examined would be helpful (i.e., a-p axis).

      We now provide more detail regarding the anterior–posterior axis of the dorsal striatum examined. Most recordings and imaging were performed in the posterior dorsomedial striatum (pDMS), corresponding to coronal slices posterior to the crossing of the anterior commissure and anterior to the tail of the striatum (starting around 0.62 mm and ending at −1.3 mm relative to the Bregma). While our primary focus was on posterior slices, some anterior slices were included to increase the sample size. These details have been added to the Methods (Last sentence of the ‘Histology and cell counting’ section and of the ‘Slice electrophysiology’ section).

      Reviewer #2 (Public review):

      Essoh and colleagues present a thorough and elegant study identifying the central amygdala and BNST as key sources of CRF input to the dorsal striatum. Using monosynaptic rabies tracing and electrophysiology, they show direct connections to cholinergic interneurons. The study builds on previous findings that CRF increases CIN firing, extending them by measuring acetylcholine levels in slices and applying optogenetic stimulation of CRF+ fibers. It also uncovers a novel interaction between alcohol and CRF signaling in the striatum, likely to spark significant interest and future research.

      Strengths:

      A key strength is the integration of anatomical and functional approaches to demonstrate these projections and assess their impact on target cells, striatal cholinergic interneurons.

      Weaknesses:

      (1) The nature of the interaction between alcohol and CRF actions on cholinergic neurons remains unclear. Also, further clarification of the ACh sensor used and others is required

      We have clarified the nature of the interaction between alcohol and CRF signaling in CINs and have provided additional details regarding the acetylcholine sensor used. These issues are addressed in detail in our responses to the specific comments below.

      Reviewer #2 (Recommendations for the authors):

      (1) The interaction between the effects of alcohol and CRF is a novel and important part of this study. When considering possible mechanisms underlying the findings in the discussion, there is no mention of occlusion. Given that incubation with alcohol produced a similar increase in firing of CINs as CRF, occlusion could be a parsimonious explanation for the observed interaction. Have the author considered blocking the effects of alcohol on CIN with CRF-R1 antagonist? Another experiment that could address the occlusion would be to test if alcohol also increases ACh levels as it did CRF.

      We thank the reviewer for proposing occlusion as a potential mechanism underlying the interaction between alcohol and CRF. We agree that, in principle, alcohol-induced endogenous CRF release could occlude subsequent exogenous CRF-mediated potentiation of CIN firing, and we carefully considered this possibility.

      However, several observations from our data argue against occlusion driven by acute alcohol exposure or withdrawal in this preparation. First, as shown in Fig. 6A, bath application of alcohol transiently reduced CIN firing, and firing recovered to baseline levels after washout without any rebound increase. Second, in Fig. 6D–E, the baseline firing rates under control conditions and following alcohol pretreatment were comparable, indicating that acute alcohol exposure and short-term withdrawal did not produce a sustained increase in CIN excitability. Together, these results suggest that acute withdrawal in slices is less likely to trigger substantial endogenous CRF release capable of occluding subsequent exogenous CRF effects.

      While we and others have previously reported increased spontaneous CIN firing following prolonged in vivo alcohol exposure and extended withdrawal periods (e.g., 21 days), short-term withdrawal (e.g., 1 day) does not robustly alter baseline CIN firing (Ma, Huang et al. 2021, Huang, Chen et al. 2024). Consistent with these prior findings, the absence of a rebound or elevated baseline firing in the present slice experiments discouraged further pursuit of an endogenous CRF occlusion mechanism under acute conditions.

      We also considered experimentally testing occlusion by blocking CRFR1 signaling during alcohol pre-treatment. However, this approach is technically challenging in slice recordings, as CRFR1 antagonists require prolonged incubation (~1 hour) during alcohol exposure. Because it is unclear whether endogenous CRF release is triggered by alcohol incubation itself or by withdrawal, the antagonist would need to remain present throughout both the incubation and withdrawal periods. This leaves insufficient time for complete washout of the CRFR1 antagonist prior to subsequent bath application of exogenous CRF to assess its effects on CIN firing. Consequently, residual antagonist presence would confound the interpretation of the exogenous CRF response.

      Finally, regarding the possibility that alcohol increases acetylcholine release, we did not observe alcohol-induced increases in CIN firing in slices, arguing against elevated ACh signaling under these conditions. Consistent with prior work (Ma, Huang et al. 2021, Huang, Chen et al. 2024), alcohol-induced increases in CIN excitability and cholinergic signaling appear to depend on prolonged in vivo exposure and extended withdrawal rather than acute slice-level manipulations.

      We have now incorporated discussion of occlusion as a potential mechanism (seventh paragraph) and clarified why our data and technical considerations argue against it in the present study. We thank the reviewer for this wonderful suggestion, which we will test in future in vivo studies.

      (2) Retrograde monosynaptic tracing of inputs to CIN. Results state the finding of labeling in all previously reported area..." Can the authors report these areas? A list in the text or a bar plot, if there is quantification, will suffice. This formation will serve as important validation and replication of previous findings.

      We thank the reviewer for this constructive suggestion. We agree that summarizing the anatomical sources of CIN input provides important validation of our tracing results. In the revised Results, we now list the major input regions observed, including the striatum itself, cortex (e.g., cingulate cortex, motor cortex, somatosensory cortex), thalamus (e.g., parafascicular thalamic nucleus, centrolateral thalamic nucleus), globus pallidus, and midbrain (first paragraph of the Results). Quantitative analysis of relative input strength will be presented in a separate study that expands on these findings. Here, we limit the current manuscript to the functional characterization of CRF and alcohol modulation of CINs.

      (3) Given the difference in connectivity among striatal subregions, it would be important to describe in more detail the injection site in the results and figures. In the figure, for example, you might want to include the AP coordinates, given that it is such a zoomed-in image, it is hard to tell how anterior/posterior the site is. I imagine that the picture is a representative image of the injection site, but maybe having a side image with overlay of injection sites in all the animals used, would help.

      The anterior–posterior (AP) coordinates for representative images have been included in the panels and reiterated more clearly in the revised Results section and figure legends. In the legend for Figure 3B, a list of AP coordinates for each animal used for Figure 3A-3E has been added.

      (4) Figure 1D inset, there seem to be some double-labeled cells in the zoomed in BNST images. The authors might want to comment on this. It seemed far from the injection site. Do D1-MSN so far away show connectivity to CINs?

      Upon closer inspection of the BNST images, we noted a small number of double-labeled cells were indeed present, consistent with prior reports that a subset of D1R-expressing neurons (~10%) has been reported previously in our lab in the BNST, with the majority being D2R-expressing neurons (Lu, Cheng et al. 2021). Given the BNST’s anatomical proximity to the dorsal striatum, it is plausible that some D1Rexpressing neurons in this region provide monosynaptic input to CINs, highlighting a potential ventral-to-dorsal connection that merits further study.

      (5) Can the author provide quantification of the onset delay of the optogenetic evoked CRF+ axon responses onto CINs? The claim of monosynaptic connectivity is well supported by the TTX/4AP experiment but additional information on the timing will strengthen that conclusion.

      We thank the reviewer for this insightful suggestion. Quantifying the onset latency of optogenetically evoked CRFMsup+</sup> axon responses onto CINs provides valuable confirmation of monosynaptic connectivity. To address this, we performed new latency measurements under the same recording conditions as the TTX/4-AP experiments. The average onset latency from the start of the optical stimulation was 5.85 ± 0.37 ms (new Figure 3J), consistent with direct monosynaptic transmission.

      As an additional reference, we analyzed latency data from a separate project in which we optogenetically stimulated cholinergic interneurons and recorded synaptic responses in medium spiny neurons. This circuit, known to involve disynaptic transmission from CINs to MSNs via nAChR-expressing interneurons (Autor response image 1) (English, Ibanez-Sandoval et al. 2011), exhibited a significantly longer latency (18.34 ± 0.70 ms; t<sub>(29)</sub> = 10.3, p < 0.001) compared to CRF⁺ CeA/BNST inputs to CINs (5.85 ± 0.37 ms). Together, these results further support that CRF⁺ axons form direct functional synapses onto CINs.

      Author response image 1.

      Latency of disynaptic transmission from CINs to MSNs via interneurons A) Schematic illustrating optogenetic stimulation of Chrimson-expressing CINs, leading to excitation of nAChRexpressing interneurons that release GABA onto recorded MSNs. B) Sample trace of disynaptic transmission (left) and bar graph summarizing onset latency (right) from light stimulation to synaptic response onset (n = 23 neurons from 3 mice).

      (6) The ACh sensor reported is "AAV-GRABACh4m" but the reference is for GRAB-ACh3.0. Also, BrainVTA has GRAB-ACh4.3. Is this the vector? Could you please check the name of the construct and report the corresponding reference, as well as clarify the meaning of the additional "m". They have a mutant version of the GRAB-ACH that researchers use for control, and of course, you want to use it as a control, but not for the test experiment.

      GRAB-ACh4m is the correct acetylcholine sensor used in this study. The ACh4 series (including ACh4h, ACh4m, and ACh4l; personal communication with Dr. Yulong Li’s lab) represents an updated generation following GRAB-ACh3.0. Although the ACh4 family has not yet been formally published, these constructs are publicly available through BrainVTA (https://www.brainvta.tech/plus/view.php?aid=2680).

      The suffix “m” does not indicate a mutant control; rather, it denotes a medium-affinity variant within the ACh4 sensor family. Importantly, the mutant (non-responsive) control sensor is only available for GRAB-ACh3.0 (ACh3.0mut) and does not exist for the ACh4 series.

      Our laboratory has previously used GRAB-ACh4m in multiple peer-reviewed publications (Huang, Chen et al. 2024, Gangal, Iannucci et al. 2025, Purvines, Gangal et al. 2025), and its use has also been reported by independent groups in recent preprints (Potjer, Wu et al. 2025, Touponse, Pomrenze et al. 2025). We have now clarified the construct name, its relationship to GRAB-ACh3.0, in the Methods ‘Reagents’ section, and we have corrected the reference accordingly.

      (7) Are CRF-R1+ CINs equally abundant in the DMS and DLS? From the image in Figure 4, it seems that a larger percentage of CINs are CRFR1+ in the DLS than in DMS. Is this true? The authors probably already have this data, or it should be easy to get, and it could be additional information that was not studied before.

      We did not perform a quantitative comparison of CRFR1+ CIN abundance between the DMS and DLS in the present study. While the representative images in Figure 4 may appear to suggest regional differences, these panels were selected to illustrate labeling quality rather than relative density and should not be interpreted as evidence of unequal distribution. We have clarified this point in the revised Discussion (last sentence of the third paragraph) and note that future studies will be needed to systematically evaluate potential regional differences in CRFR1 expression, which could have important implications for dorsal striatal function.

      (8) The manuscript states several times that there are no CRF+ neurons in the dorsal striatum. At the same time, there are reports of the CRF+ neuron in the ventral striatum and its role in learning. Could the authors include mention of the studies by the Lemos group (10.1016/j.biopsych.2024.08.006)

      We have revised the Discussion section to clarify that our findings pertain specifically to the dorsal striatum and now acknowledge the presence and functional relevance of CRF+ neurons in the ventral striatum, citing the Lemos group’s study (fifth paragraph of the Discussion).

      (9) For the histology analysis, please express cell counts as "density", not just number of cells, by providing an area (e.g., "number of cell/ µm2").

      In the revised manuscript, all histological outcomes have been recalculated as cell density (cells/mm<sup>2</sup>) by normalizing raw cell counts to the measured area of each region of interest (ROI). Figures that previously displayed absolute counts now present densities (cells/mm<sup>2</sup>), with corresponding updates made to figure legends and text. We note one exception in Figure 4B, where the comparison between the total number of CINs and CRFR1+ CINs is best represented as cell counts rather than normalized values, as the counting was conducted in the same area (within the same ROI) of the dorsostriatal subregion.

      (10) Figure 2C, we can see there are some labeled fibers in the striatum cut. Would it be possible to get a better confocal image?

      Figure 2C has been replaced with a higher-quality confocal image captured at the same magnification and scale. The updated image provides improved clarity and resolution, ensuring accurate visualization of labeled CRF+ fibers, but not cell bodies, within the striatum.

      (11) The ACh measurements in the slice are very informative and an important addition. I first thought that these experiments with the GRAB-ACh sensor were performed in ChAT-eGFP mice. After reading more carefully, I realized they were done in wild-type mice. Would you include the wildtype label in the figure as well? The ChATeGFP BAC transgenic line was reported to have enhanced ACh packaging and increased ACh release, which could have magnified the signals. So, it is important to highlight the experiments were done in wildtype mice.

      We now label with ‘WT mice’ and note in the legend that all GRAB-ACh experiments were performed in wild-type mice, not ChAT-eGFP, to avoid confounds in ACh release. We thank the reviewer for this important suggestion.

      Reviewer #3 (Public review):

      The authors demonstrate that CRF neurons in the extended amygdala form GABAergic synapses onto cholinergic interneurons and that CRF can excite these neurons. The evidence is strong, however, the authors fail to make a compelling connection showing CRF released from these extended amygdala neurons is mediating any of these effects. Further, they show that acute alcohol appears to modulate this action, although the effect size is not particularly robust.

      Strengths:

      This is an exciting connection from the extended amygdala to the striatum that provides a new direction for how these regions can modulate behavior. The work is rigorous and well done.

      Weaknesses:

      (1) While the authors show that opto stim of these neurons can increase firing, this is not shown to be CRFR1 dependent. In addition, the effects of acute ethanol are not particularly robust or rigorously evaluated. Further, the opto stim experiments are conducted in an Ai32 mouse, so it is impossible to determine if that is from CEA and BNST, vs. another population of CRF-containing neurons. This is an important caveat.

      We added recordings with the CRFR1 antagonist antalarmin. Light-evoked increases in CIN firing were abolished under CRFR1 blockade, linking the effect to CRFR1 (Figure 5J, 5K). We also clarify that CRFCre;Ai32 does not isolate CeA versus BNST sources, so we temper regional claims and highlight this as a limitation. The acute ethanol effects are modest but consistent; we expanded the discussion of dose and preparation constraints in acute slice physiology and note that in vivo studies will be needed to define the network-level impact.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors could bring some of this data together by examining CRFR1 dependence of optical stimulationinduced increases in firing. Further, the authors have devoted significant effort to exploring how the BNST and CEA project to the CIN, yet their ephys does not explore site-specific infusion of ChR2 into either region. How are we to be sure it is not some other population of CRF neurons mediating this effect? The alcohol data does not appear particularly robust, but I think if the authors wanted to, they could explore other concentrations. Mostly I think it is important to discuss the limitations of acute alcohol on 5a brain slice.

      We thank the reviewer for these thoughtful comments, which helped us strengthen the mechanistic interpretation of the CRF-CIN interaction. In the revised manuscript, we have addressed each point as follows:

      - CRFR1 dependence of optogenetically evoked responses: We performed new recordings in which optogenetic stimulation of CRF⁺ terminals in the dorsal striatum was conducted in the presence of the CRFR1 antagonist antalarmin. The increase in CIN firing evoked by light stimulation was abolished under CRFR1 blockade, confirming that this effect is mediated through CRFR1 activation (new Figure 5J, 5K, third paragraph of the corresponding Result section). These results directly link the functional effects of CRF⁺ terminal activation to CRFR1 signaling on CINs.

      - CeA vs. BNST projection specificity: The reviewer is correct that CeA and BNST projections were not analyzed separately. As unknown pathways, our experiment was designed to first establish the monosynaptic connections between CeA/BNST CRF neurons to striatal CINs. Future studies would further explore the specific contribution of each site. However, our data exclude the possibility of other CRF neurons as we selectively infused Cre-dependent opsins into both CeA and BNST of CRF-Cre mice (Figure 3G-3J).

      - Limitations of acute slice experiments: We have expanded the Discussion (sixth paragraph) to acknowledge that acute slice physiology cannot fully capture the dynamic and network-level effects of ethanol observed in vivo. While this preparation enables mechanistic precision, factors such as washout, diffusion constraints, and the absence of systemic feedback may underestimate ethanol’s impact on CINs. We now explicitly note this limitation and highlight the need for in vivo studies to examine behavioral and circuit-level implications of CRF–alcohol interactions.

      Collectively, these revisions clarify the CRFR1 dependence of CRF<sup>+</sup> terminal effects and reaffirm that both CeA and BNST projections contribute to CIN modulation while addressing the methodological limitations of the slice preparation.

      Reviewer #4 Public Review):

      This manuscript presents a compelling and methodologically rigorous investigation into how corticotropin-releasing factor (CRF) modulates cholinergic interneurons (CINs) in the dorsal striatum - a brain region central to cognitive flexibility and action selection-and how this circuit is disrupted by alcohol exposure. Through an integrated series of anatomical, optogenetic, electrophysiological, and imaging experiments, the authors uncover a previously uncharacterized CRF⁺ projection from the central amygdala (CeA) and bed nucleus of the stria terminalis (BNST) to dorsal striatal CINs.

      Strengths:

      Key strengths of the study include the use of state-of-the-art monosynaptic rabies tracing, CRF-Cre transgenic models, CRFR1 reporter lines, and functional validation of synaptic connectivity and neurotransmitter release. The finding that CRF enhances CIN excitability and acetylcholine (ACh) release via CRFR1, and that this effect is attenuated by acute alcohol exposure and withdrawal, provides important mechanistic insight into how stress and alcohol interact to impair striatal function. These results position CRF signaling in CINs as a novel contributor to alcohol use disorder (AUD) pathophysiology, with implications for relapse vulnerability and cognitive inflexibility associated with chronic alcohol intake. The study is well-structured, with a clear rationale, thorough methodology, and logical progression of results. The discussion effectively contextualizes the findings within broader addiction neuroscience literature and suggests meaningful future directions, including therapeutic targeting of CRFR1 signaling in the dorsal striatum.

      Weaknesses:

      (1) Minor areas for improvement include occasional redundancy in phrasing, slightly overlong descriptions in the abstract and significance sections, and a need for more concise language in some places. Nevertheless, these do not detract from the manuscript's overall quality or impact. Overall, this is a highly valuable contribution to the fields of addiction neuroscience and striatal circuit function, offering novel insights into stress-alcohol interactions at the cellular and circuit level, which requires minor editorial revisions.

      We have streamlined the abstract and significance statement, reduced redundancy, and improved conciseness throughout the text. We appreciate the reviewer’s feedback, which has helped us further strengthen the clarity and readability of the manuscript.

      Reviewer #4 (Recommendations for the authors):

      (1) Line 29-30: Slightly verbose. Consider: "Alcohol relapse is associated with corticotropin-releasing factor (CRF) signaling and altered reward pathway function, though the precise mechanisms are unclear."

      The sentence has been revised as recommended to improve clarity and conciseness in the introductory section (Lines 31-32).

      (2) Lines 39-43: Good synthesis, but could better emphasize the novelty of identifying a CRF-CIN pathway.

      The abstract has been revised to more clearly emphasize the novelty of identifying a CRF-CIN pathway and its functional significance (Line 42-43).

      (3) Lines 66-68: Consider integrating clinical relevance more directly, e.g., "AUD affects over 14 million adults in the U.S., with relapse often triggered by stress...".

      The introduction has been revised to more directly emphasize the clinical relevance of alcohol use disorder, including its high prevalence and the role of stress in relapse, thereby underscoring the translational significance of our findings (Lines 68-69).

      (4) Line 83: Repetition of "goal-directed learning, habit formation, and behavioral flexibility" appears multiple times; consider variety.

      We have varied the phrasing in the Introduction to avoid redundancy. Specifically, in place of repeating “goal-directed learning, habit formation, and behavioral flexibility,” we now use alternative terms such as “action selection,” “habitual responding,” and “cognitive flexibility,” depending on the context.

      (5) Lines 107-116: Clarify why both rats and mice were used-do they serve different experimental purposes?

      We now explain that each species was used for complementary experimental purposes. Rats were used for histological validation of CRFR1 expression using the CRFR1-Cre-tdTomato line, which has been extensively characterized in this species. Mice were used for the majority of electrophysiological, optogenetic, and GRAB-ACh sensor experiments due to the availability of well-established transgenic CRF-Cre-driver lines. This division allowed us to leverage the most appropriate tools in each species to address different aspects of the study. We have clarified this rationale in the Methods (first paragraph of the “Animals” section) and Discussion (third paragraph).

      (6) Electrophysiology section: The distinction between acute exposure vs. withdrawal could be further emphasized.

      To better highlight the distinction between acute alcohol exposure and withdrawal, we have clarified the timing and context of each condition within the Results section for Figure 6. Specifically, we now distinguish the immediate suppressive effects of alcohol observed during bath application (acute exposure) from the subsequent changes in CIN firing measured after washout (withdrawal). These revisions clarify the temporal dynamics and functional implications of CRF–alcohol interactions in our experimental design.

      (7) Lines 227-229: Reword for clarity: "Significantly more BNST neurons projected to CINs compared to the CeA...".

      The sentence has been reworded to clarify as recommended (Lines 247-248).

      (8) Lines 373-374: Consider connecting the CRF-CIN circuit to behavioral inflexibility in AUD more directly.

      We have modified the sentence (Lines 390-395) to more explicitly link alcohol-induced dysregulation of the CRF–CIN circuit to behavioral inflexibility in AUD, consistent with the established role of CINs in action selection and cognitive flexibility.

      (9) Lines 387-389: This is an excellent point about stress resilience; consider expanding with examples or potential implications.

      We thank the reviewer for this insightful suggestion. In the revised Discussion (sixth paragraph), we expanded this section to more directly connect alcohol-induced disruption of CRF–CIN signaling with impaired stress resilience and behavioral inflexibility. Specifically, we now note that such dysregulation may compromise stress resilience mechanisms mediated by CRF–cholinergic interactions in the striatum and related corticostriatal circuits. We further discuss how impaired CIN responsiveness could blunt adaptive behavioral adjustments under stress, biasing animals toward habitual or compulsive alcohol seeking. This addition highlights the broader implication that alcohol-induced alterations in CRF–CIN signaling may contribute to relapse vulnerability by undermining adaptive stress coping.

      References

      English, D. F., O. Ibanez-Sandoval, E. Stark, F. Tecuapetla, G. Buzsaki, K. Deisseroth, J. M. Tepper and T. Koos (2011). "GABAergic circuits mediate the reinforcement-related signals of striatal cholinergic interneurons." Nat Neurosci 15(1): 123–130.

      Gangal, H., J. Iannucci, Y. Huang, R. Chen, W. Purvines, W. T. Davis, A. Rivera, G. Johnson, X. Xie, S. Mukherjee, V. Vierkant, K. Mims, K. O'Neill, X. Wang, L. A. Shapiro and J. Wang (2025). "Traumatic brain injury exacerbates alcohol consumption and neuroinflammation with decline in cognition and cholinergic activity." Transl Psychiatry 15(1): 403.

      Huang, Z., R. Chen, M. Ho, X. Xie, H. Gangal, X. Wang and J. Wang (2024). "Dynamic responses of striatal cholinergic interneurons control behavioral flexibility." Sci Adv 10(51): eadn2446.

      Lu, J. Y., Y. F. Cheng, X. Y. Xie, K. Woodson, J. Bonifacio, E. Disney, B. Barbee, X. H. Wang, M. Zaidi and J. Wang (2021). "Whole-Brain Mapping of Direct Inputs to Dopamine D1 and D2 Receptor-Expressing Medium Spiny Neurons in the Posterior Dorsomedial Striatum." Eneuro 8(1).

      Ma, T., Z. Huang, X. Xie, Y. Cheng, X. Zhuang, M. J. Childs, H. Gangal, X. Wang, L. N. Smith, R. J. Smith, Y. Zhou and J. Wang (2021). "Chronic alcohol drinking persistently suppresses thalamostriatal excitation of cholinergic neurons to impair cognitive flexibility." J Clin Invest 132(4): e154969.

      Potjer, E. V., X. Wu, A. N. Kane and J. G. Parker (2025). "Parkinsonian striatal acetylcholine dynamics are refractory to L-DOPA treatment." bioRxiv.

      Purvines, W., H. Gangal, X. Xie, J. Ramos, X. Wang, R. Miranda and J. Wang (2025). "Perinatal and prenatal alcohol exposure impairs striatal cholinergic function and cognitive flexibility in adult offspring." Neuropharmacology 279: 110627.

      Ren, Y., Y. Liu and M. Luo (2021). "Gap Junctions Between Striatal D1 Neurons and Cholinergic Interneurons." Front Cell Neurosci 15: 674399.

      Touponse, G. C., M. B. Pomrenze, T. Yassine, V. Mehta, N. Denomme, Z. Zhang, R. C. Malenka and N. Eshel (2025). "Cholinergic modulation of dopamine release drives effortful behavior." bioRxiv.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper investigates the control signals that drive event model updating during continuous experience. The authors apply predictions from previously published computational models to fMRI data acquired while participants watched naturalistic video stimuli. They first examine the time course of BOLD pattern changes around human-annotated event boundaries, revealing pattern changes preceding the boundary in anterior temporal and then parietal regions, followed by pattern stabilization across many regions. The authors then analyze time courses around boundaries generated by a model that updates event models based on prediction error and another that uses prediction uncertainty. These analyses reveal overlapping but partially distinct dynamics for each boundary type, suggesting that both signals may contribute to event segmentation processes in the brain.

      Strengths:

      (1) The question addressed by this paper is of high interest to researchers working on event cognition, perception, and memory. There has been considerable debate about what kinds of signals drive event boundaries, and this paper directly engages with that debate by comparing prediction error and prediction uncertainty as candidate control signals.

      (2) The authors use computational models that explain significant variance in human boundary judgments, and they report the variance explained clearly in the paper.

      (3) The authors' method of using computational models to generate predictions about when event model updating should occur is a valuable mechanistic alternative to methods like HMM or GSBS, which are data-driven.

      (4) The paper utilizes an analysis framework that characterizes how multivariate BOLD pattern dissimilarity evolves before and after boundaries. This approach offers an advance over previous work focused on just the boundary or post-boundary points.

      We appreciate this reviewer’s recognition of the significance of this research problem, and of the value of the approach taken by this paper.

      Weaknesses:

      (1) While the paper raises the possibility that both prediction error and uncertainty could serve as control signals, it does not offer a strong theoretical rationale for why the brain would benefit from multiple (empirically correlated) signals. What distinct advantages do these signals provide? This may be discussed in the authors' prior modeling work, but is left too implicit in this paper.

      We added a brief discussion in the introduction highlighting the complementary advantages of prediction error and prediction uncertainty, and cited prior theoretical work that elaborates on this point. Specifically, we now note that prediction error can act as a reactive trigger, signaling when the current event model is no longer sufficient (Zacks et al., 2007). In contrast, prediction uncertainty is framed as proactive, allowing the system to prepare for upcoming changes even before they occur (Baldwin & Kosie, 2021; Kuperberg, 2021). Together, this makes clearer why these two signals could each provide complementary benefits for effective event model updating.

      "One potential signal to control event model updating is prediction error—the difference between the system’s prediction and what actually occurs. A transient increase in prediction error is a valid indicator that the current model no longer adequately captures the current activity. Event Segmentation Theory (EST; Zacks et al., 2007) proposes that event models are updated when prediction error increases beyond a threshold, indicating that the current model no longer adequately captures ongoing activity. A related but computationally distinct proposal is that prediction uncertainty (also termed "unpredictability") can serve as a control signal (Baldwin & Kosie, 2021). The advantage of relying on prediction uncertainty to detect event boundaries is that it is inherently proactive: the cognitive system can start looking for cues about what might come next before the next event starts (Baldwin & Kosie, 2021; Kuperberg, 2021). "

      (2) Boundaries derived from prediction error and uncertainty are correlated for the naturalistic stimuli. This raises some concerns about how well their distinct contributions to brain activity can be separated. The authors should consider whether they can leverage timepoints where the models make different predictions to make a stronger case for brain regions that are responsive to one vs the other.

      We addressed this concern by adding an analysis that explicitly tests the unique contributions of prediction error– and prediction uncertainty–driven boundaries to neural pattern shifts. In the revised manuscript, we describe how we fit a combined FIR model that included both boundary types as predictors and then compared this model against versions with only one predictor. This allowed us to identify the variance explained by each boundary type over and above the other. The results revealed two partially dissociable sets of brain regions sensitive to error- versus uncertainty-driven boundaries (see Figure S1), strengthening our argument that these signals make distinct contributions.

      "To account for the correlation between uncertainty-driven boundaries and error-driven boundaries, we also fitted a FIR model that predicted pattern dissimilarity from both types of boundaries (combined FIR) for each parcel. Then, we performed two likelihood ratio tests: combined FIR to error FIR, which measures the unique contribution of uncertainty boundaries to pattern dissimilarity, and combined FIR to uncertainty FIR, which measures the unique contribution of error boundaries to pattern dissimilarity. The analysis also revealed two dissociable sets of brain regions associated with each boundary type (see Figure S1)."

      (3) The authors refer to a baseline measure of pattern dissimilarity, which their dissimilarity measure of interest is relative to, but it's not clear how this baseline is computed. Since the interpretation of increases or decreases in dissimilarity depends on this reference point, more clarity is needed.

      We clarified how the FIR baseline is estimated in the methods section. Specifically, we now explain that the FIR coefficients should be interpreted relative to a reference level, which reflects the expected dissimilarity when timepoints are far from an event boundary. This makes it clear what serves as the comparison point for observed increases or decreases in dissimilarity.

      "The coefficients from the FIR model indicate changes relative to baseline, which can be conceptualized as the expected value when far from event boundaries."

      (4) The authors report an average event length of ~20 seconds, and they also look at +20 and -20 seconds around each event boundary. Thus, it's unclear how often pre- and post-boundary timepoints are part of adjacent events. This complicates the interpretations of the reported time courses.

      This is related to reviewer's 2 comment, and it will be addressed below.

      (5) The authors describe a sequence of neural pattern shifts during each type of boundary, but offer little setup of what pattern shifts we might expect or why. They also offer little discussion of what cognitive processes these shifts might reflect. The paper would benefit from a more thorough setup for the neural results and a discussion that comments on how the results inform our understanding of what these brain regions contribute to event models.

      We thank the reviewer for this advice on how better to set the context for the different potential outcomes of the study. We expanded both the introduction and discussion to better set up expectations for neural pattern shifts and to interpret what these shifts may reflect. In the introduction, we now describe prior findings showing that sensory regions tend to update more quickly than higher-order multimodal regions (Baldassano et al., 2017; Geerligs et al., 2021, 2022), and we highlight that it remains unclear whether higher-order updates precede or follow those in lower-order regions. We also note that our analytic approach is well-suited to address this open question. In the discussion, we then interpret our results in light of this framework. Specifically, we describe how we observed early shifts in higher-order areas such as anterior temporal and prefrontal cortex, followed by shifts in parietal and dorsal attention regions closer to event boundaries. This pattern runs counter to the traditional bottom-up temporal hierarchy view and instead supports a model of top-down updating, where high-level representations are updated first and subsequently influence lower-level processing (Friston, 2005; Kuperberg, 2021). To make this interpretation concrete, we added an example: in a narrative where a goal is reached midway—for instance, a mystery solved before the story formally ends—higher-order regions may update the event representation at that point, and this updated model then cascades down to shape processing in lower-level regions. Finally, we note that the widespread stabilization of neural patterns after boundaries may signal the establishment of a new event model.

      Excerpt from Introduction:

      “More recently, multivariate approaches have provided insights into neural representations during event segmentation. One prominent approach uses hidden Markov models (HMMs) to detect moments when the brain switches from one stable activity pattern to another (Baldassano et al., 2017) during movie viewing; these periods of relative stability were referred to as "neural states" to distinguish them from subjectively perceived events. Sensory regions like visual and auditory cortex showed faster transitions between neural states. Multi-modal regions like the posterior medial cortex, angular gyrus, and intraparietal sulcus showed slower neural state shifts, and these shifts aligned with subjectively reported event boundaries. Geerligs et al. (2021, 2022) employed a different analytical approach called Greedy State Boundary Search (GSBS) to identify neural state boundaries. Their findings echoed the HMM results: short-lived neural states were observed in early sensory areas (visual, auditory, and somatosensory cortex), while longer-lasting states appeared in multi-modal regions, including the angular gyrus, posterior middle/inferior temporal cortex, precuneus, anterior temporal pole, and anterior insula. Particularly prolonged states were found in higher-order regions such as lateral and medial prefrontal cortex.

      The previous evidence about evoked responses at event boundaries indicates that these are dynamic phenomena evolving over many seconds, with different brain areas showing different dynamics (Ben-Yakov & Henson, 2018; Burunat et al., 2024; Kurby & Zacks, 2018; Speer et al., 2007; Zacks, 2010). Less is known about the dynamics of pattern shifts at event boundaries (e.g. whether shifts observed in higher-order regions precedes or follow shifts observed in lower-level regions), because the HMM and GSBS analysis methods do not directly provide moment-by-moment measures of pattern shifts. Both the spatial and temporal aspects of evoked responses and pattern shifts at event boundaries have the potential to provide evidence about two potential control processes (error-driven and uncertainty-driven) for event model updating.”

      Excerpt from Discussion:

      “We first characterized the neural signatures of human event segmentation by examining both univariate activity changes and multivariate pattern changes around subjectively identified event boundaries. Using multivariate pattern dissimilarity, we observed a structured progression of neural reconfiguration surrounding human-identified event boundaries. The largest pattern shifts were observed near event boundaries (~4.5s before) in dorsal attention and parietal regions; these correspond with regions identified by Geerligs et. al as shifting their patterns on a fast to intermediate timescale (2022). We also observed smaller pattern shifts roughly 12 seconds prior to event boundaries in higher-order regions within anterior temporal cortex and prefrontal cortex, and these are slow-changing regions identified by Geerligs et. al (2022). This is puzzling. One prevalent proposal, based on the idea of a cortical hierarchy of increasing temporal receptive windows (TRWs), suggests that higher-order regions should update representations after lower-order regions do (Chang et al., 2021). In this view, areas with shorter TRWs (e.g., word-level processors) pass information upward, where it is integrated into progressively larger narrative units (phrases, sentences, events). This proposal predicts neural shifts in higher-order regions to follow those in lower-order regions. By contrast, our findings indicate the opposite sequence. Our findings suggest that the brain might engage in top-down event representation updating, with changes in coarser-grain representations propagating downward to influence finer-grain representations. (Friston, 2005; Kuperberg, 2021). For example, in a narrative where the main goal is achieved midway—such as a detective solving a mystery before the story formally ends—higher-order regions might update the overarching event representation at that point, and this updated model could then cascade down to reconfigure how lower-level regions process the remaining sensory and contextual details. In the period after a boundary (around +12 seconds), we found widespread stabilization of neural patterns across the brain, suggesting the establishment of a new event model. Future work could focus on understanding the mechanisms behind the temporal progression of neural pattern changes around event boundaries.”

      Reviewer #2 (Public review):

      Summary:

      Tan et al. examined how multivoxel patterns shift in time windows surrounding event boundaries caused by both prediction errors and prediction uncertainty. They observed that some regions of the brain show earlier pattern shifts than others, followed by periods of increased stability. The authors combine their recent computational model to estimate event boundaries that are based on prediction error vs. uncertainty and use this to examine the moment-to-moment dynamics of pattern changes. I believe this is a meaningful contribution that will be of interest to memory, attention, and complex cognition research.

      Strengths:

      The authors have shown exceptional transparency in terms of sharing their data, code, and stimuli, which is beneficial to the field for future examinations and to the reproduction of findings. The manuscript is well written with clear figures. The study starts from a strong theoretical background to understand how the brain represents events and has used a well-curated set of stimuli. Overall, the authors extend the event segmentation theory beyond prediction error to include prediction uncertainty, which is an important theoretical shift that has implications in episodic memory encoding, the use of semantic and schematic knowledge, and attentional processing.

      We thank the reader for their support for our use of open science practices, and for their appreciation of the importance of incorporating prediction uncertainty into models of event comprehension.

      Weaknesses:

      The data presented is limited to the cortex, and subcortical contributions would be interesting to explore. Further, the temporal window around event boundaries of 20 seconds is approximately the length of the average event (21.4 seconds), and many of the observed pattern effects occur relatively distal from event boundaries themselves, which makes the link to the theoretical background challenging. Finally, while multivariate pattern shifts were examined at event boundaries related to either prediction error or prediction uncertainty, there was no exploration of univariate activity differences between these two different types of boundaries, which would be valuable.

      The fact that we observed neural pattern shifts well before boundaries was indeed unexpected, and we now offer a more extensive interpretation in the discussion section. Specifically, we added text noting that shifts emerged in higher-order anterior temporal and prefrontal regions roughly 12 seconds before boundaries, whereas shifts occurred in lower-level dorsal attention and parietal regions closer to boundaries. This sequence contrasts with the traditional bottom-up temporal hierarchy view and instead suggests a possible top-down updating mechanism, in which higher-order representations reorganize first and propagate changes to lower-level areas (Friston, 2005; Kuperberg, 2021). (See excerpt for Reviewer 1’s comment #5.)

      With respect to univariate activity, we did not find strong differences between error-driven and uncertainty-driven boundaries. This makes the multivariate analyses particularly informative for detecting differences in neural pattern dynamics. To support further exploration, we have also shared the temporal progression of univariate BOLD responses on OpenNeuro (BOLD_coefficients_brain_animation_pe_SEM_bold.html and BOLD_coefficients_brain_animation_uncertainty_SEM_bold.html in the derivatives/figures/brain_maps_and_timecourses/ directory; https://doi.org/10.18112/openneuro.ds005551.v1.0.4) for interested researchers.

      Reviewer #3 (Public review):

      Summary:

      The aim of this study was to investigate the temporal progression of the neural response to event boundaries in relation to uncertainty and error. Specifically, the authors asked (1) how neural activity changes before and after event boundaries, (2) if uncertainty and error both contribute to explaining the occurrence of event boundaries, and (3) if uncertainty and error have unique contributions to explaining the temporal progression of neural activity.

      Strengths:

      One strength of this paper is that it builds on an already validated computational model. It relies on straightforward and interpretable analysis techniques to answer the main question, with a smart combination of pattern similarity metrics and FIR. This combination of methods may also be an inspiration to other researchers in the field working on similar questions. The paper is well written and easy to follow. The paper convincingly shows that (1) there is a temporal progression of neural activity change before and after an event boundary, and (2) event boundaries are predicted best by the combination of uncertainty and error signals.

      We thank the reviewer for their thoughtful and supportive comments, particularly regarding the use of the computational model and the analysis approaches.

      Weaknesses:

      (1) The current analysis of the neural data does not convincingly show that uncertainty and prediction error both contribute to the neural responses. As both terms are modelled in separate FIR models, it may be that the responses we see for both are mostly driven by shared variance. Given that the correlation between the two is very high (r=0.49), this seems likely. The strong overlap in the neural responses elicited by both, as shown in Figure 6, also suggests that what we see may mainly be shared variance. To improve the interpretability of these effects, I think it is essential to know whether uncertainty and error explain similar or unique parts of the variance. The observation that they have distinct temporal profiles is suggestive of some dissociation,but not as convincing as adding them both to a single model.

      We appreciate this point. It is closely related to Reviewer 1's comment 2; please refer to our response above.

      (2) The results for uncertainty and error show that uncertainty has strong effects before or at boundary onset, while error is related to more stabilization after boundary onset. This makes me wonder about the temporal contribution of each of these. Could it be the case that increases in uncertainty are early indicators of a boundary, and errors tend to occur later?

      We also share the intuition that increases in uncertainty are early indicators of a boundary, and errors tend to occur later. If that is the case, we would expect some lags between prediction uncertainty and prediction error. We examined lagged correlation between prediction uncertainty and prediction error, and the optimal lag is 0 for both uncertainty-driven and error-driven models. This indicates that when prediction uncertainty rises, prediction error also simultaneously rises.

      Author response image 1.

      (3) Given that there is a 24-second period during which the neural responses are shaped by event boundaries, it would be important to know more about the average distance between boundaries and the variability of this distance. This will help establish whether the FIR model can properly capture a return to baseline.

      We have added details about the distribution of event lengths. Specifically, we now report that the mean length of subjectively identified events was 21.4 seconds (median 22.2 s, SD 16.1 s). For model-derived boundaries, the average event lengths were 28.96 seconds for the uncertainty-driven model and 24.7 seconds for the error-driven model.

      " For each activity, a separate group of 30 participants had previously segmented each movie to identify fine-grained event boundaries (Bezdek et al., 2022). The mean event length was 21.4 s (median 22.2 s, SD 16.1 s). Mean event lengths for uncertainty-driven model and error-driven model were 28.96s, and 24.7s, respectively (Nguyen et al., 2024)."

      (4) Given that there is an early onset and long-lasting response of the brain to these event boundaries, I wonder what causes this. Is it the case that uncertainty or errors already increase at 12 seconds before the boundaries occur? Or if there are other makers in the movie that the brain can use to foreshadow an event boundary? And if uncertainty or errors do increase already 12 seconds before an event boundary, do you see a similar neural response at moments with similar levels of error or uncertainty, which are not followed by a boundary? This would reveal whether the neural activity patterns are specific to event boundaries or whether these are general markers of error and uncertainty.

      We appreciate this point; it is similar to reviewer 2’s comment 2. Please see our response to that comment above.

      (5) It is known that different brain regions have different delays of their BOLD response. Could these delays contribute to the propagation of the neural activity across different brain areas in this study?

      Our analyses use ±20 s FIR windows, and the key effects we report include shifts ~12s before boundaries in higher-order cortex and ~4.5s pre-boundary in dorsal attention/parietal areas. Given the literature above, region-dependent BOLD delays are much smaller (~1–2s) than the temporal structure we observe (Taylor et al., 2018), making it unlikely that HRF lag alone explains our multi-second, region-specific progression.

      (6) In the FIR plots, timepoints -12, 0, and 12 are shown. These long intervals preclude an understanding of the full temporal progression of these effects.

      For page length purposes, we did not include all timepoints. We uploaded a brain animation of all timepoints and coefficients for each parcel in Openneuro (PATTERN_coefficients_brain_animation_human_fine_pattern.html and PATTERN_coefficients_lines_human_fine.html in the derivatives/figures/brain_maps_and_timecourses/ directory; https://doi.org/10.18112/openneuro.ds005551.v1.0.4) for interested researchers.

      References

      Taylor, A. J., Kim, J. H., & Ress, D. (2018). Characterization of the hemodynamic response function across the majority of human cerebral cortex. NeuroImage, 173, 322–331. https://doi.org/10.1016/j.neuroimage.2018.02.061

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      *Reviewer #1 (Evidence, reproducibility and clarity (Required): *

      *Using genetics and microscopy approaches, Cabral et al. investigate how fission yeast regulates its length and width in response to osmotic, oxidative, or low glucose stress. Miller et al. have recently found that the cell cycle regulators Cdc25, Cdc13 and Cdr2 integrate information about cell volume, time and cell surface area into the cellular decision when to divide. Cabral now build on this work and test how disruption of these regulators affects cell size adaptation. They find that each stress condition shows a distinct dependence on the individual regulators, suggesting that the complex size control network enables optimized size adaptation for each condition. Overall, the manuscript is clear and the detailed methods ensure that the experiments can be replicated.

      Major comments:

      1.) It would be much easier to follow the authors' conclusions, if in addition to surface area to volume ratio, length and width, they would also plot cell volume at division in Figs. 1-4.*

      AUTHOR RESPONSE: Due to space constraints in the main (and supplemental) figures, we focused on SA:Vol ratio together with cell length and width, which directly define cell geometry in rod-shaped fission yeast. Surface area and volume are derived from these measurements and can be misleading when considered alone, as similar surface area or volume values can arise from distinct combinations of length and width. The SA:Vol ratio therefore serves as a robust integrative metric for capturing coordinated changes in length and width that reshape cell geometry. We would be happy to include individual surface area and volume plots if requested.

      2.) To me, it seems that maybe even more than upon osmotic stress, the cdc13-2x strain differs qualitatively from WT in low glucose conditions, where the increased SA-V ratio is almost completely abolished.

      AUTHOR RESPONSE: We agree with the reviewer and have revised the manuscript text to point out this difference. The newly added text states: “Under low glucose, cdc13-2x cells also showed a WT-like response, decreasing length and increasing in SA:Vol ratio (Figures 3B-D). However, this SA:Vol increase was reduced compared to WT (1% vs 8.5%; Figures 1D and 3B), suggesting impaired geometric remodeling under glucose limitation.”

      3.) It is not entirely clear to me why two copies of Cdc13 would qualitatively affect the responses. Shouldn't the extra copy behave similarly to the endogenous one and therefore only lead to quantitative changes? Maybe the authors can discuss this more clearly or even test a strain in which Cdc13 function is qualitatively disrupted.

      AUTHOR RESPONSE: Increased Cdc13 protein concentration in cdc13-2x cells disrupts the typical time-scaling of Cdc13 protein. Consistent with this, cdc13-2x cells enter mitosis at a smaller cell size. We have modified the text to clarify this point. The new text states: “To access the role of the Cdc13 time-sensing pathway, we disrupted Cdc13 protein abundance by creating a cdc13-2x strain carrying an additional copy of cdc13 integrated at an exogenous locus. cdc13-2x cells divided at a smaller size than WT, reflecting accelerated mitotic entry upon disruption of typical time-scaling of Cdc13 protein (Figure S1A).”

      4.) I don't see why the authors come to the conclusion that under osmotic stress cells would maximize cell volume. It leads to a decreased cell length, doesn't it?

      AUTHOR RESPONSE: WT cells under osmotic stress do decrease in length, but this is accompanied by an increase in cell width. Because width contributes disproportionately to cell volume in rod-shaped cells, this change results in a modest but reproducible reduction in the SA:Vol ratio relative to WT cells in control medium (Figure 1D). We note that the degree of this change under osmotic stress is small (-0.4%), although statistically significant (p * Likewise, in Figure 2B, they interpret tiny changes in the SA/V. By my estimation, the difference between control and osmotic stress is only 2% (1.195/1.17), less that the wild-type case, which appears to be twice that (which is still pretty modest). The small amplitude of these changes is obscured by the fact that the graphs do not have a baseline at zero, which, as a matter of good data-presentation practice, they should.

      *

      AUTHOR RESPONSE: We appreciate the reviewer’s distinction between statistical and biological significance and agree that this is an important point to clarify. We now note in the revised text that changes in SA:Vol ratio under osmotic stress are numerically small and should not be overinterpreted. Our revised text now states: “Under oxidative and osmotic stress, the SA:Vol ratio decreased, indicating greater cell volume expansion relative to surface area (Figure 1D). However, we note that the reduction in SA:Vol under osmotic stress, while statistically significant, was modest in magnitude (−0.4%).”

      Although small in absolute terms, even subtle geometric changes can be biologically meaningful in fission yeast due to the small size of these cells, where minor shifts in length or width translate into measurable differences in membrane area relative to cytoplasmic volume. Importantly, in Figure 2B, the key observation is not the magnitude of the change but its direction: cdc25-degron-DaMP cells exhibit a ~2% increase in SA:Vol ratio under osmotic stress, in contrast to the decrease observed in WT cells under the same condition. This opposite response reflects altered cell geometry and is supported by corresponding changes in cell length and width. We have revised the Results text to emphasize both the modest magnitude and the directional nature of these effects: “Under osmotic stress, cdc25-degron-DaMP cells exhibited a ~2% increase in SA:Vol ratio, opposite to the modest decrease observed in WT cells. This increase arose from increased cell length and reduced width (Figures 2B-D).”

      Regarding data presentation, because SA:Vol ratios vary over a narrow numerical range, setting the y-axis minimum to zero would compress the data and obscure all detectable differences. Instead, we have modifed our SA:Vol ratio graphs in Fig. 1-4 to have consistent axis scaling across panels to accurately convey relative changes while maintaining visual clarity. We are happy to provide full data tables and statistical outputs upon request.

      * I am also concerned about the use of manual measurement of width at a single point along the cell. This approach is very sensitive to the choice of width point and to non-cylindrical geometries, several of which are evident in the images presented. MATLAB will return the ??? as well as the length from a mask, but even better, one can more accurately calculate the surface area and volume by assuming rotational symmetry of the mask. Given that surface area and volume calculation need to be redone anyway, as discussed below, I encourage the authors to calculate them directly from the mask, instead of using the cylindrical assumption.*

      AUTHOR RESPONSE: In initial experiments to calculate surface area and volume of fission yeast cells for prior work (Miller et al., 2023, Current Biology) we found that automated width measurements by MATLAB or ImageJ were inaccurate for a subset of cells leading to noisy cell surface area and volume values. Measuring cell width by hand and assuming that each cell in a given strain had the same cell radius (average of population) for calculation of cell surface area and volume gave more consistent results and recapitulated established conclusions regarding size control mechanisms.

      In this previous work and the current study, abnormally skinny or wide regions of a cell were avoided when drawing a line to measure the cell width by hand. For each strain and condition, an average cell width was determined per independent experiment and used for surface area and volume calculations. Additionally, previous analysis demonstrated that this approach yields results consistent with a rotation method derived directly from cell masks, which does not assume a cylindrical cell shape (Facchetti et al., 2019, Current Biology; Miller et al., 2023, Current Biology).

      To test the validity of our size measurements and confirm the robustness of our results in this study we compared the surface area and volume of cells by this rotation method. We have added this additional information to our revised methods section and also added SA:Vol ratio graphs generated from the rotation size measurement to our revised Figure S1 E-J. Importantly, both approaches used to measure cell size gave consistent results and supported the same conclusions.*

      The authors also need to be more careful about their claims about size-dependent scaling. The concentration of both Cdc13 and Cdc25 scale with size (perhaps indirectly, in the case of Cdc13), but Cdr2 does not. Cdr2 activity has been proposed to scale with size, and its density at cortical nodes has been reported to scale with size, although that claim has been challenged .*

      AUTHOR RESPONSE: We have modified text in the Introduction and Results to address this point. Our revised text in the introduction states: “Recent work has shown that Cdk1 activation integrates size- and time-dependent inputs: the Wee1-inhibitory kinase Cdr2 cortical node density scales with cell surface area (Pan et al., 2014; Facchetti et al., 2019); Cdc25 nuclear accumulation scales with cell volume; and cyclin Cdc13 accumulates over time in the nucleus (Miller et al., 2023) (Figure 1B).” Our revised text in the results section states: “Cdr2 functions as a cortical scaffold that regulates Wee1 activity in relation to cell size, with Cdr2 nodal density reported to scale with cell surface area, enforcing a surface area threshold for mitotic entry (Pan et al., 2014; Allard et al., 2018; Facchetti et al., 2019; Sayyad and Pollard, 2022).”*

      Even taking the authors approach at face value, there are observations that do not seem to make sense, which led me to realize that the wrong formulae were used to calculate surface area and volume.

      In Figure 1E,F, the KCl-treated cells get shorter and wider; surely, that should result in a lower SA/V ratio. However, as noted above, in Figure 1D, they are shown to have a similar ratio. As a sanity check, I eye-balled the numbers off of the figure (control: 14 µm x 3.6 µm and KCl: 11 µm x 3.8 µm) and calculated their surface area and volume using the formula for a capsule (i.e., a cylinder with hemispheric ends).

      SA = the surface area of the two hemispheres + the surface are of the cylinder in between = 4*pi*(width/2)^2 + pi*width*(length-width), the length-width term calculates the side length of the capsule (length without the hemispheres) from the full length of the capsule (length including the hemispheres)

      V = the volume of the two hemispheres + the volume of the cylinder in between = 4/3*pi*(width/2)^3 + pi*(width/2)^2*(length-width).

      I got SA/V ratios of around 2, which are way off from what is presented in Figure 1D, but my calculated ratio goes down in KCl, as expected, but not as reported.

      To make sure I was not doing something wrong, I was going to repeat my calculations with the formulae in Table 1, which made me realize both are incorrect. The stated formula for the cell surface area-2*pi*RL-only represents to surface area of the cylindrical side of the cells, not its hemispherical ends. And it is not even the correct formula for the surface area of the side, because that calls for L to be the length of the side (without the hemispherical ends) not the length of the cell (which includes the hemispherical ends). L here is stated to be cell length (which is what is normally measured in the field, and which is consistent with the reported length of control cells in Figure 1E being 14 µm). The formula for the volume of a capsule in the form use in Table 1 (volume of a cylinder of length L - the volume excluded from the hemispherical ends) is pi*R^2*L - (8-(4/3*pi))*R^3.

      Given these problems, I think I spent too much time thinking about the rest of the paper, because all of the calculations, and perhaps their interpretations, need to be redone.*

      AUTHOR RESPONSE: The surface area and volume equations for a cylinder with hemispherical ends used in our study and listed in our table are correct and widely used in other work with fission yeast cells (Navarro and Nurse, 2012; Pan et al., 2014; Facchetti et al., 2019; BayBay et al., 2020; and Miller et al., 2023). We write our equations with variables for cell length and radius because these are biologically relevant and measured parameters for fission yeast cells. Cell length (L) refers to the total tip-to-tip length of the cell, including the hemispherical ends, and radius (R) refers to half the measured cell width. We have revised the Methods section to clarify this definition and avoid ambiguity (Please see methods section “Cell geometry measurements”)

      Additionally, SA or Vol calculations were performed using the length of each individual cell and the average cell radius of the population. We did not use mean cell length of the population for our calculations like the reviewer assumed in their “sanity check” above. Please see methods section “Cell geometry measurements”. We hope that these clarifications and text revisions improve transparency and reproducibility.

      * Minor Points:

      Strains should be identified by strain number is the text and figure legends.*

      AUTHOR RESPONSE: For clarity and readability, we refer to strains by genotype in the main text and figure legends, which we believe is more informative for readers than strain numbers. All strain numbers corresponding to each genotype are provided in Table S1, ensuring traceability and reproducibility without compromising clarity in data presentation.*

      In the Introduction, "Most cell control their size" should be "Most eukaryotic cell control their size".*

      • *

      AUTHOR RESPONSE: The text has been corrected as suggested.*

      Reviewer #2 (Significance (Required)):

      Nothing to add.*

      *Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary This manuscript reports that fission yeast cells exhibit distinct cell size and geometry when exposed to osmotic, oxidative, or low-glucose stress. Based on quantitative measurements of cell length and width, the authors propose that different stress conditions trigger specific 'geometric adaptation' patterns, suggesting that cell size homeostasis is flexibly modulated depending on environmental cues. The study provides phenotypic evidence that multiple environmental stresses lead to distinct outcomes in the balance between cell surface area and volume, which the authors interpret as stress-specific modes of size control.

      Major comments 1) The authors define the 48-hour time point as the 'long-term response', but no justification is provided for why 48 hours represents a physiologically relevant adaptation phase. It is unclear whether the size-control mode has stabilized by that time, or whether it may continue to change afterward. At minimum, the authors should provide a rationale (e.g., growth recovery dynamics, transcriptional adaptation plateau, or pilot time-course observations) to demonstrate that 48 hours corresponds to the steady-state adaptive phase rather than an arbitrarily selected time point.*

      AUTHOR RESPONSE: We thank the reviewer for this important point and agree that the definition of the long-term response should be clarified. We have addressed this with new experiments and revised text. We now incorporate growth curve data and doubling time analyses for all yeast strains grown under control and stress conditions (See new Figure S3). These analyses show that following an initial transient stress-induced cell cycle delay, growth rates stabilize well before 48 hours. Notably, the slowest growth rate observed was in 1M KCl, with a doubling time of ~4 hours across all yeast strains tested. Thus, by 48 hours, cells in this condition have undergone more than 12 generations of growth, while cells in all other conditions with shorter doubling times have undergone even more divisions. So by allowing cells to grow for 48 hours prior to imaging, we are capturing cells that have resumed sustained cell cycle progression following transient stress-induced cell cycle delays. Because cell size control is tightly linked to the cell cycle, we define 48 hours as a physiologically relevant time point where cells have adapted to stress conditions.

      Our revised methods now states: “Cultures were incubated at 25°C while shaking at 180 rpm for 48 h prior to imaging. This time point was chosen to ensure that cells had progressed beyond the initial transient stress response and reached a stable, condition-specific growth state, as confirmed by growth curve and doubling time analyses showing stabilization well before 48 h (Figure S3), including in the slowest growing condition (1 M KCl; doubling time ~4 h).”

      * 2*)Related to the above comment, the authors propose that different stresses lead to distinct cell size adaptations, yet the rationale for the chosen stress intensities and exposure times is insufficiently described. It remains unclear whether the osmotic, oxidative, and low-glucose conditions used here induce comparable levels of cellular stress. Dose-response and time-course analyses would greatly strengthen the conclusions. Without such analyses, it is difficult to support the interpretation that geometry modulation represents a direct adaptive response.

      AUTHOR RESPONSE: * *We selected the specific stress conditions based on previously published work showing that these doses elicit robust responses while preserving overall cell viability and the capacity for recovery. We note that osmotic, oxidative, and low glucose conditions perturb fundamentally different cellular systems (turgor pressure and cell wall mechanics, redox balance, and metabolism etc.) and therefore do not generate directly comparable levels of cellular stress in a quantitative sense. Our goal was not to equalize stress intensity across conditions, but to examine how cells change their geometry in response to distinct classes of stressors.

      We have clarified the rationale for specific stress conditions in the revised methods: “These stress intensities were selected based on prior studies demonstrating robust cellular responses while preserving cell viability and the capacity for recovery (Fantes and Nurse, 1977, Shiozaki and Russell, 1995, Degols, et al., 1996; López-Avilés et al., 2008; Sansó et al., 2008; Satioh et al., 2015, Salat-Canela et al., 2021, Bertaux et al., 2023).”

      * 3) The authors describe stress-induced size changes as an 'adaptive' response. While this is an appealing hypothesis, the presented data do not demonstrate that the change in cell size itself confers a fitness advantage. Evidence showing that blocking the size change reduces stress survival-or that the altered size improves growth recovery- would be required to support this claim. Without such data, the use of the term 'geometric adaptation' seems overstated.*

      AUTHOR RESPONSE: We have revised the text to remove the term “adaptive” and now describe stress-induced size changes in descriptive terms. As discussed further in response to Comment 4, new growth curve and doubling time analyses show that defects in surface area or volume expansion do not uniformly impair growth or survival over the stress exposure examined here, reinforcing the decision to avoid fitness-based language.*

      4) The authors conclude that mutants exhibit no major defects in growth or viability during 48-hour stress exposure based on comparable septation index values (Fig. S2). However, septation index alone does not fully capture growth performance or cell-cycle progression and is not sufficient to support claims regarding fitness or robustness of proliferation. If the authors intend to make statements about 'growth', 'viability', or 'cell-cycle progression', additional quantitative measures (e.g., growth curves, doubling time, colony-forming units, or microcolony growth measurements) would be necessary. Alternatively, the claims should be toned down to align with the measurements currently provided.*

      AUTHOR RESPONSE: We have addressed this concern with new experiments and revised text. In addition to septation index measurements (now analyzed using chi-square tests of proportions; Figure S2), we performed growth curve experiments and doubling time analyses for all genotypes under control and stress conditions (new Figure S3). These additional data show that growth rates are largely comparable across genotypes in control, oxidative, and low-glucose conditions, with more pronounced genotype-dependent differences emerging under osmotic stress. Defects in surface area or volume expansion did not uniformly correspond to impaired population growth, indicating that geometric remodeling is not strictly required for proliferation over the 48-hour stress exposure examined here. We have refined our conclusion to emphasize that defects in surface area or volume expansion do not uniformly impair growth or survival. See revised Results text under the heading “Defects in surface area or volume expansion do not uniformly compromise growth or survival”.*

      5) Related to the above comment, the manuscript does not adequately rule out the possibility that the decreased division size simply results from slower growth or delayed cell-cycle progression rather than a shift in the size-control mechanism. Measurements and normalizations of growth rate are required; without them, the interpretation remains speculative.*

      AUTHOR RESPONSE: We agree that changes in growth rate or altered cell cycle timing are important to consider. We have revised our text: “Changes in growth rate or cell cycle progression under stress may influence division size by altering mitotic regulator accumulation. Future studies measuring mitotic regulator dynamics alongside growth rates will be needed to distinguish direct changes in size control mechanisms from growth- or timing-dependent effects.”

      * 6) Regarding the phenotypes of wee1-2x cells, it is interesting that they increase the SA:Vol ratio under all stress conditions and show phenotypes distinct from cdr2Δ cells. From these observations, the authors claims that Cdr2 and Wee1 function as a surface-area-sensing module that complements the volume-sensing and time-sensing pathways to maintain geometric homeostasis. To support this interpretation, the authors could consider additional experiments, such as analyzing cdr2Δ + wee1-2x cells under the same stress conditions. Such data would test whether increased Wee1 can rescue or modify the cdr2Δ phenotype, providing functional evidence for the proposed Cdr2-Wee1-Cdk1 regulatory relationship. Measurements of cell length, width, SA:Vol ratio, and, if feasible, Cdk1 activity markers in the strain would greatly strengthen the mechanistic claims.*

      AUTHOR RESPONSE: We thank the reviewer for this insightful suggestion. While analysis of a cdr2Δ wee1-2x strain could provide additional mechanistic detail, such experiments address a distinct question beyond the scope of our current study, which focuses on how cell geometry changes under different stress conditions in cells with perturbed surface area-, volume-, or time-sensing pathways. Our conclusions regarding a surface area-sensing role for Cdr2-Wee1 signaling are based on previous studies (Pan et al., 2014; Facchetti et al., 2019; Miller et al., 2023) and the cell geometry phenotypes we observe of cdr2Δ and wee1-2x cells under stress conditions. *

      Minor comments 1) The manuscript focuses on adaptation through changes in the surface-to-volume ratio; however, only the ratio is shown. Presenting the underlying values of surface area and volume would clarify which geometric parameter primary contributes to the observed changes.*

      AUTHOR RESPONSE: Please see our response to Reviewer 1 major comment 1.*

      *2) Statistical analysis for Fig.S2 should be provided.

      AUTHOR RESPONSE: We have completed this. See revised Figure S2 and methods.*

      3) The paper by Kellog and Levin 2022 is missing from the reference list.*

      AUTHOR RESPONSE: Thank you for catching this. This reference has now been added. *

      **Referees cross-commenting**

      After reading the other reviewer's reports, I recognize that focal points differ, but they appear sequential rather than contradictory.

      Reviewer 2 raises concerns regarding the surface area/volume calculations, which-if incorrect-would influence many of the quantitative conclusions. I agree that confirming the validity of these calculations (and recalculating if necessary) should be the top priority before evaluating the biological interpretations.

      Reviewer 1 raises more mechanistic biological questions. These are certainly important, but in my view they depend on the robustness of the quantitative analysis highlighted by Reviewer 2.

      Therefore, I regard the reports as complementary rather than conflicting. Once the analytical issue pointed out by Reviewer 2 is resolved, the field will be in a better position to assess the significance of the mechanistic points raised by Reviewer 1 (as well as those in my own report).

      Reviewer #3 (Significance (Required)):

      General assessment One of the major strengths of this manuscript is its quantitative, side-by-side comparison of multiple environmental stresses under a unified experimental and analytical framework. The authors provide well-controlled morphometric measurements, allowing direct comparison of geometry changes that would otherwise be difficult to evaluate across studies. The observation that different stress types generate distinct geometric outcomes is particularly intriguing and has the potential to stimulate new conceptual thinking in the field of size control. However, the strength of the conceptual conclusion is currently limited by several aspects of the experimental design and interpretation. In particular, it remains unclear whether the observed geometry changes represent active adaptive responses rather than non-specific consequences of prolonged or string stress exposure. Demonstrating whether geometry remodeling provides a fitness advantage, clarifying whether the changes reach a steady-state rather than reflecting slow drift over time, or identifying upstream stress pathways that govern the response would substantially strengthen the conceptual advance. Even if additional mechanistic or fitness-related data cannot be added, refining the interpretation so that it remains aligned with the present evidence will enhance the clarity, and impact of the study.

      Advance Previous study - including the 2023 publication by the James B. Moseley group - established that fission yeast integrates distinct size-control pathways related to surface area, volume, and time under normal growth conditions. The present manuscript extends this line of work to stressed environments and argues that each stress condition elicits a distinct size-control pattern. To our knowledge, a systematic comparison of cell geometry across multiple stress types in the context of size-control pathways has not been reported, and this represents a potentially valuable conceptual advance. The advance is primarily phenomenological and conceptual rather than mechanistic: the work presents new correlation between stress types and geometry but does not yet elucidate the pathways governing these responses or demonstrate a functional advantage. With additional evidence - or with qualifiers ensuring that claims match the current data - the study could make an important contribution to understanding how cells integrate environmental cues into size-control strategies.

      Audience Although the primary audience consists of researchers in the fields of cell growth, cell-cycle control, and stress responses in yeast, the conceptual contribution may interest broader fields such as growth homeostasis, metabolic adaptation, and pathological cell size changes in higher eukaryotes. Beyond yeast biology, the modular view of size regulation proposed here may inspire new investigations in stem cell biology, cancer research, and biotechnology where environmental adaptation and cell size are closely linked.

      Expertise: nuclear morphology; cell morphology; cell growth; cell cycle; cytoskeleton*

    1. 8.1. Sources of Social Media Data# Social media platforms collect various types of data on their users. Some data is directly provided to the platform by the users. Platforms may ask users for information like: email address name profile picture interests friends Platforms also collect information on how users interact with the site. They might collect information like (they don’t necessarily collect all this, but they might): when users are logged on and logged off who users interact with What users click on what posts users pause over where users are located what users send in direct messages to each other Online advertisers can see what pages their ads are being requested on, and track users across those sites. So, if an advertiser sees their ad is being displayed on an Amazon page for shoes, then the advertiser can start showing shoe ads to that same user when they go to another website. Additionally, social media might collect information about non-users, such as when a user posts a picture of themselves with a friend who doesn’t have an account, or a user shares their phone contact list with a social media site, some of whom don’t have accounts (Facebook does this). Social media platforms then use “data mining” to search through all this data to try to learn more about their users, find patterns of behavior, and in the end, make more money.

      This section made me realize how much data social media platforms collect, even beyond what we intentionally share. I used to think they only stored basic info like my name or email, but they also track behaviors like what I click on, how long I look at posts, and even where I go online. It feels a little uncomfortable because many of these things happen without us noticing. It shows that our online actions can reveal a lot about us, not just what we directly say. This makes me think we should be more careful about privacy and what platforms are allowed to collect.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer 1

      Minor

      The main substance of my previous comment I suppose targeted a deeper issue - namely whether such a result is reflecting a resolution to a 'neural prediction' puzzle or a 'perceptual prediction' puzzle. Of course, these results tell us a great deal about a potential resolution for how dampening and sharpening might co-exist in the brain - but in the absence of corresponding perceptual effects (or a lack of correlation between neural and perceptual variables - as outlined in this revision) I do wonder if any claims about implications for perception might need moderation or caveating. To be honest, I don't think the authors *need* to make any more changes along these lines for this paper to be acceptable - it is more an issue they might wish to consider themselves when contextualizing their findings.

      Thank you for the thoughtful comment. We have now added a caveat to the relevant section of the discussion to make it clearer that we are discussing neural results, not perceptual results (p.20, lines 378-379).

      I am also happy with the changes that the authors have made justifying which claims can and cannot made based on a statistical decoding test against 'chance' in a single condition using t-tests. I was perhaps a little unclear when I spoke about 'comparisons against 0' in my original review, when the key issue (as the authors have intuited!) is about comparisons against 'chance' (where e.g., 0% decoding above chance is the same thing as 'chance'!). The authors are of course correct in the amendment they have made on p.29 to make clear this is a 'fixed effects analysis' - though I still worry this could be a little cryptic for the average reader. I am not suggesting that the authors run more analyses, or revise any conclusions, but I think it would be more transparent if a note was added along the lines of "while the fixed effects approach (one-sample t-test) enables us to establish whether some consistent informative patterns are detectable in these particular subjects, the results from our paired t-tests support inference to the wider population".

      This sentence has been added for increased transparency (p. 27, lines 544-547).

      Reviewer 3

      Major

      (1) In the previous round of comments, I noted that: "I am not fully convinced that Figures 3A/B and the associated results support the idea that early learning stages result in dampening and later stages in sharpening. The inference made requires, in my opinion, not only a significant effect in one-time bin and the absence of an effect in other bins. Instead to reliably make this inference one would need a contrast showing a difference in decoding accuracy between bins, or ideally an analysis not contingent on seemingly arbitrary binning of data, but a decrease (or increase) in the slope of the decoding accuracy across trials. Moreover, the decoding analyses seem to be at the edge of SNR, hence making any interpretation that depends on the absence of an effect in some bins yet more problematic and implausible". The authors responded: "we fitted a logarithmic model to quantify the change of the decoding benefit over trials, then found the trial index for which the change of the logarithmic fit was < 0.1%. Given the results of this analysis and to ensure a sufficient number of trials, we focused our further analyses on bins 1-2". However, I do not see how this new analysis addresses the concern that the conclusion highlights differences in decoding performance between bins 1 and 2, yet no contrast between these bins are performed. While I appreciate the addition of the new model, in my current understanding it does not solve the problem I raised. I still believe that if the authors wish to conclude that an effect differs between two bins they must contrast these directly and/or use a different appropriate analysis approach.

      Relatedly, the logarithmic model fitting and how it justifies the focus on analysis bin 1-2 needs to be explained better, especially the rationale of the analysis, the choice of parameters (e.g., why logarithmic, why change of logarithmic fit < 0.1% as criterion, etc), and why certain inferences follow from this analysis. Also, the reporting of the associated results seems rather sparse in the current iteration of the manuscript.

      We thank the reviewer for this important point. Following your suggestion, we conducted additional post-hoc tests directly comparing the first and second bins. We found significant differences between bins in the invalid trials, but not the valid trials, suggesting that sharpening/dampening effects are condition specific. This is discussed in the manuscript on p.14, lines 268-271; p.15, 280-284; p.20, lines 382-386.

      A logarithmic analysis was chosen as learning is usually found to be a nonlinear process; learning effects occur rapidly before stabilising relatively early, as seen in Fig. 2D. This is consistent with other research which found that logarithmic fits efficiently describe learning curves in statistical learning (Kang et al., 2023; Siegelman et al., 2018; Choi et al., 2020). By utilising a change of logarithmic fit at <0.1% as a criterion, it is ensured that virtually zero learning took place after that point, allowing us to focus our analysis on learning effects as they developed and providing a more accurate model of representational change. This is explained in the manuscript on p.13, lines 250-251; p.27-28, lines 557-563.

      (2) A critical point the authors raise is that they investigate the buildup of expectations during training. They go on to show that the dampening effect disappears quickly, concluding: "the decoding benefit of invalid predictions [...] disappeared after approximately 15 minutes (or 50 trials per condition)". Maybe the authors can correct me, but my best understanding is as follows: Each bin has 50 trials per condition. The 2:1 condition has 4 leading images, this would mean ~12 trials per leading stimulus, 25% of which are unexpected, so ~9 expected trials per pair. Bin 1 represents the first time the participants see the associations. Therefore, the conclusion is that participants learn the associations so rapidly that ~9 expected trials per pair suffice to not only learn the expectations (in a probabilistic context) but learn them sufficiently well such that they result in a significant decoding difference in that same bin. If so, this would seem surprisingly fast, given that participants learn by means of incidental statistical learning (i.e. they were not informed about the statistical regularities). I acknowledge that we do not know how quickly the dampening/sharpening effects develop, however surprising results should be accompanied with a critical evaluation and exceptionally strong evidence (see point 1). Consider for example the following alternative account to explain these results. Category pairs were fixed across and within participants,i.e. the same leading image categories always predicted the same trailing image categories for all participants. Some category pairings will necessarily result in a larger representational overlap (i.e., visual similarity, etc.) and hence differences in decoding accuracy due to adaptation and related effects. For example, house  barn will result in a different decoding performance compared to coffee cup  barn, simply due to the larger visual and semantic similarity between house and barn compared to coffee cup and barn. These effects should occur upon first stimulus presentation, independent of statistical learning, and may attenuate over time e.g., due to increasing familiarity with the categories (i.e., an overall attenuation leading to smaller between condition differences) or pairs.

      We apologise for the confusion, there are 50 expected trials per bin per condition. The trial breakdown is as follows. Each participant completed 1728 trials, split equally across 3 mappings (two 2:1 maps and one 1:2 map), giving 1152 trials in the 2:1 mapping. Stimuli were expected in 75% of trials (864), leaving 216 per bin, and 54 per leading image in each bin. We have clarified this in the script (p.14, line 267; p.15, line 280). This is in line with similar studies in the field (e.g. Han et al., 2019).

      (3) In response to my previous comment, why the authors think their study may have found different results compared to multiple previous studies (e.g. Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011), particularly the sharpening to dampening switch, the authors emphasize the use of non-repeated stimuli (no repetition suppression and no familiarity confound) in their design. However, I fail to see how familiarity or RS could account for the absence of

      sharpening/dampening inversion in previous studies.

      First, if the authors argument is about stimulus novelty and familiarity as described by Feuerriegel et al., 2021, I believe this point does not apply to the cited studies. Feuerriegel et al., 2021 note: "Relative stimulus novelty can be an important confound in situations where expected stimulus identities are presented often within an experiment, but neutral or surprising stimuli are presented only rarely", which indeed is a critical confound. However, none of the studies (Han et al., 2019; Richter et al., 2018; Kumar et al., 2017; Meyer and Olson, 2011) contained this confound, because all stimuli served as expected and unexpected stimuli, with the expectation status solely determined by the preceding cue. Thus, participants were equally familiar with the images across expectation conditions.

      Second, for a similar reason the authors argument for RS accounting for the different results does not hold either in my opinion. Again, as Feuerriegel et al. 2021 correctly point out: "Adaptation-related effects can mimic ES when the expected stimuli are a repetition of the last-seen stimulus or have been encountered more recently than stimuli in neutral expectation conditions." However, it is critical to consider the precise design of previous studies. Taking again the example of Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011. To my knowledge none of these studies contained manipulations that would result in a more frequent or recent repetition of any specific stimulus in the expected compared to unexpected condition. The crucial manipulation in all these previous studies is not that a single stimulus or stimulus feature (which could be subject to familiarity or RS) determines the expectation status, but rather the transitional probability (i.e. cue-stimulus pairing) of a particular stimulus given the cue. Therefore, unless I am missing something critical, simple RS seems unlikely to differ between expectation condition in the previous studies and hence seems implausible to account for differences in results compared to the current study.

      Moreover, studies cited by the authors (e.g. Todorovic & de Lange, 2012) showed that RS and ES are separable in time, again making me wonder how avoiding stimulus repetition should account for the difference in the present study compared to previous ones. I am happy to be corrected in my understanding, but with the currently provided arguments by the authors I do not see how RS and familiarity can account for the discrepancy in results.

      The reviewer is correct in that the studies cited (Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011) ensure that participants are equally familiar with the images across expectation conditions. Where the present study differs is that participants are not familiar with individual exemplars at all. Han et al., 2019 used a pool of 30 individual images, and subjects underwent exposure sessions lasting two hours each daily for 34 days prior to testing. Kumar et al., 2017 used a pool of 12 images with subjects being exposed to each sequential pair 816 times over the course of the training period. Meyer & Olsen, 2011 used pure tones at five different pitch levels. While familiarity of stimuli across conditions was controlled for in these studies in the sense that familiarity was constant across conditions, novelty was not controlled for. The present study uses a pool of ~3500 images, which are unrepeated across trials.

      Feuerriegel et al., 2021 also points out: “There are also effects of adaptation that are dependent on the recent stimulation history extending beyond the last encountered stimulus and long-lag repetition effects that occur when the first and second presentation of a stimulus is separated by tens or even hundreds of intervening images”. Bearing this in mind, and given the very small pool of stimuli being used by Han et al., 2019; Kumar et al., 2017; Meyer and Olson, 2011, it stands to reason that these studies may still have built-in but unaccounted for effects relating to the repetition of exemplars. Thus, our avoidance of those possible confounds, in addition to foregoing any prior training, may elicit differing results. Furthermore, as pointed out by Walsh et al. 2020, methodological heterogeneity (such as subject training) can produce contrasting results as PP makes divergent predictions regarding the properties of prediction error given different permutations of variables such as training, transitional probabilities, and conditional probabilities. In our case, the use of differing methodology was intentional. These issues have been discussed in more detail on p.5, lines 112-115; p.19, lines 368-377; p.20, lines 378-379).

      Minor

      (1) The authors note in their reply to my previous questions that: "As mentioned above, we opted to target our ERP analyses on Oz due to controversies in the literature regarding univariate effects of ES (Feuerriegel et al., 2021)". This might be a lack of understanding on my side, but how are concerns about the reliability of ES, as outlined by Feuerriegel et al. (2021), an argument for restricting analyses to 1 EEG channel (Oz)? Could one not argue equally well that precisely because of these concerns we should be less selective and instead average across multiple (occipital) channels to improve the reliability of results?

      The reviewer is correct in suggesting that a cluster of occipital electrodes may be more reliable than reporting one single electrode. We have amended the analysis to examine electrodes Oz, O1, and O2 (p.9, lines 187-188; p.11, lines 197-201).

      (2) The authors provide a github link for the dataset and code. However, I doubt that github is a suitable location to share EEG data (which at present I also cannot find linked in the github repo). Do the authors plan to share the EEG data and if so where?

      Thank you for bringing this to my attention. EEG data has now been uploaded at osf.io/x7ydf and linked to the github repository (p.28, lines 569-570).

      (3) The figure text could benefit from additional information; e.g. Fig.1C and Fig.3 do not clarify what the asterisk indicates; p < ? with or without multiple comparison correction?

      Thank you for pointing out this oversight, the figure texts have been amended (p. 9, line 168; p.16, line 289).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We sincerely appreciate the feedback, attention to detail and timeliness of the referees for our manuscript. Below, we provide a point-by-point response to all comments from the referees, detailing the changes we have already made, and those that are in progress. Referee's comments will appear in bolded text, while our responses will be unbolded. Any text quoted directly from the manuscript will be italicised and contained within "quotation marks". Additionally, we have grouped all comments into four categories (structural changes, minor text changes, experimental changes, figure changes), comments are numbered 1-n in each of these categories. Please note: this response to reviewer's comments included some images that cannot be embedded in this text-only section.

      1. General Statements

      We appreciate the overall highly positive and enthusiastic comments from all reviewers, who clearly appreciated the technical difficulty of this study, and noted amongst other things that this study represents" a major contribution to the future advancement of oocyst-sporozoite biology" and the development of the segmentation score for oocysts as a "major advance[ment]". We apologise for the omission of line numbers on the document sent to reviewers, we removed these for the bioRxiv submission without considering that this PDF would be transferred across to Review Commons.

      We have responded to all reviewers comments through a variety of text changes, experimental inclusions, or direct query response. Significant changes to the manuscript since initial submission are as follows:

      1. Refinement of rhoptry biogenesis model: Reviewers requested more detail around the content of the AORs, which we had previously suggested were a vehicle for rhoptry biogenesis as we saw they carried the rhoptry neck protein RON4. To address this, we first attempted to address this using antibodies against rhoptry bulb proteins but were unsuccessful. We then developed a * berghei* line where there rhoptry bulb protein RhopH3 was GFP-tagged. Using this parasite line, we observed that the earliest rhoptry-like structure, which we had previously interpreted as an AOR contained RhopH3. By contrast, RhopH3 was absent from AORs. Reflecting these observations we have renamed this initial structure the 'pre-rhoptry' and suggested a model for rhoptry biogenesis where rhoptry neck cargo are trafficked via the AOR but rhoptry bulb cargo are trafficked by small vesicles that move along the rootlet fibre (previously observed by EM).
      2. Measurement of rhoptry neck vs bulb: While not directly suggested by the reviewers, we have also included an analysis that estimates the proportion of the sporozoite rhoptry that represents the rhoptry neck. By contrast to merozoites, which we show are overwhelmingly represented by the rhoptry bulb, the vast majority of the sporozoite rhoptry represents the rhoptry neck.
      3. Measurement of subpellicular microtubules: One reviewer asked if we could measure the length of subpellicular microtubules where we had previously observed that they were longer on one side of the sporozoite than the other. We have now provided absolute and relative (% sporozoite length) length measurements for these subpellicular microtubules and also calculated the proportion of the microtubule that is polyglutamylated.
      4. More detailed analysis of RON11cKD rhoptries: Multiple comments suggested a more detailed analysis of the rhoptries that were formed/not formed in RON11cKD We have included an updated analysis that shows the relative position of these rhoptries in sporozoites.

      2. Point-by-point description of the revisions

      Reviewer #1

      Minor text changes (Reviewer #1)

      1. __Text on page 12 could be condensed to highlight the new data of ron4 staining of the AOR. __

      We agree with the reviewer that it is a reasonable suggestion. After obtaining additional data on the contents of the AOR (as described in General Statements #1), this section has been significantly rewritten to highlight these findings. 2.

      __Add reference on page 3 after 'disrupted parasites' __

      This sentence has been rewritten slightly with some references included and now reads: "Most data on these processes comes from electron microscopy studies 6-8, with relatively few functional reports on gene deleted or disrupted parasites9-11. 3.

      __Change 'the basal complex at the leading edge' - this seems counterintuitive __

      This change has been made. 4.

      __Change 'mechanisms underlying SG are poorly' - what mechanisms? of invasion or infection? __

      This was supposed to read "SG invasion" and has now been fixed. 5.

      __On page 4: 'handful of proteins' __

      This error has been corrected. 6.

      __What are the 'three microtubule spindle structures'? __

      The three microtubule spindle structures: hemispindle, mitotic spindle, and interpolar spindle are now listed explicitly in the text. 7.

      __On page 5: 'little is known' - please describe what is known, also in other stages. At the end of the paper I would like to know what is the key difference to rhoptry function in other stages? __

      The following sentence already detailed that we had recently used U-ExM to visualise rhoptry biogenesis in blood-stage parasites, but the following two sentences have been added to provide extra detail on these findings: "In that study, we defined the timing of rhoptry biogenesis showing that it begun prior to cytokinesis and completed approximate coincident with the final round of mitosis. Additionally, we observed that rhoptry duplication and inheritance was coupled with centriolar plaque duplication and nuclear fission." 8.

      __change 'rhoptries golgi-derived, made de novo' __

      This has been fixed. 9.

      __change 'new understand to' __

      This change has been made 10.

      __'rhoptry malformations' seem to be similar in sporozoites and merozoites. Is that surprising/new? __

      We assume this is in reference to mention of "rhoptry malformations" in the abstract. In the RON11 merozoite study (PMID:39292724) the authors noted no gross rhoptry malformations, only that one was not formed/missing. The abstract sentence has been changed to the following to better reflect this nuance: "*We show that stage-specific disruption of RON11 leads to a formation of sporozoites that only contain half the number of rhoptries of controls like in merozoites, however unlike in merozoites the majority of rhoptries appear grossly malformed."

      * 11.

      __What is known about crossing the basal lamina. Where rhoptries thought to be involved in this process? Or is it proteins on the surface or in other secretory organelles? __

      We are unaware of any studies that specifically look at sporozoites crossing the SG basal lamina. A review, although now ~15 years old stated that "No information is available as to how the sporozoites traverse the basal lamina" (PMID:19608457) and we don't know any more information since then. To try and better define our understanding of rhoptry secretion during SG invasion, we have added the following sentence:

      "It is currently unclear precisely when during these steps of SG invasion rhoptry proteins are required, but rhoptry secretion is thought to begin before in the haemolymph before SG invasion16." 12.

      __On page change/specify: 'wide range of parasite structures' __

      The structures observed have been listed: centriolar plaque, rhoptry, apical polar rings, rootlet fibre, basal complex, apicoplast. 13.

      __On page 7: is Airyscan2 a particular method or a specific microscope? __

      Airyscan2 is a detector setup on Zeiss LSM microscopes, this was already detailed in the materials and methods sections, but figure legends have been clarified to read: "...imaged by an LSM900 microscopy with an Airyscan2 detector". 14.

      __how large is RON11? __

      RON11 is 112 kDa in * berghei*, as noted in the text. 15.

      __There is no causal link between ookinete invasion and oocyst developmental asynchrony __

      We have deleted the sentence that implied that ookinete invasion was responsible for oocyst asynchrony. This section now simply states that "Development of each oocyst within a midgut is asynchronous..." 16.

      __First sentence of page 24 appears to contradict what is written in results____ I don't understand the first two sentences in the paragraph titled Comparison between Plasmodium spp __

      This sentence was worded confusingly, making it appear contradictory when that was not the intention. The sentence has been changed to more clearly support what is written in the discussion and now reads: "Our extensive analysis only found one additional ultrastructural difference between Plasmodium spp."

      __On page 25 or before the vast number of electron microscopy studies should be discussed and compared with the authors new data. __

      It is not entirely clear which new data should be specifically discussed based on this comment. However, we have added a new paragraph that broadly compares MoTissU-ExM and our findings with other imaging methods previously used on mosquito-stage malaria parasites:

      "*Comparison of MoTissU-ExM and other imaging modalities

      Prior to the development of MoTissU-ExM, imaging of mosquito-stage malaria parasites in situ had been performed using electron microscopy7,8,11,28, conventional immunofluorescence assays (IFA)10, and live-cell microscopy25. MoTissU-ExM offers significant advantages over electron microscopy techniques, especially volume electron microscopy, in terms of accessibility, throughput, and detection of multiple targets. While we have benchmarked many of our observations against previous electron microscopy studies, the intracellular detail that can be observed by MoTissU-ExM is not as clear as electron microscopy. For example, previous electron microscopy studies have observed Golgi-derived vesicles trafficking along the rootlet fibre8 and distinguished the apical polar rings44; both of which we could not observe using MoTissU-ExM. Compared to conventional IFA, MoTissU-ExM dramatically improves the number and detail of parasite structures/organelles that can be visualised while maintaining the flexibility of target detection. By contrast, it can be difficult or impossible to reliably quantify fluorescence intensity in samples prepared by expansion microscopy, something that is routine for conventional IFA. For studying temporally complex processes, live-cell microscopy is the 'gold-standard' and there are some processes that fundamentally cannot be studied or observed in fixed cells. We attempt to increase the utility of MoTissU-ExM in discerning temporal relationships through the development of the segmentation score but note that this cannot be applied to the majority of oocyst development. Collectively, MoTissU-ExM offers some benefits over these previously applied techniques but does not replace them and instead serves as a novel and complementary tool in studying the cell biology of mosquito-stage malaria parasites.**"

      *

      __First sentence on page 27: there are many studies on parasite proteins involved in salivary gland invasion that could be mentioned/discussed. __

      The sentence in question is "To the best of our knowledge, the ability of sporozoites to cross the basal lamina and accumulate in the SG intercellular space has never previously been reported."

      This sentence has now been changed to read as follows: "While numerous studies have characterized proteins whose disruption inhibited SG invasion9,10,15,59-63, to the best of our knowledge the ability of sporozoites to cross the basal lamina and accumulate in the SG intercellular space has never previously been reported ."

      __On page 10 I suggest to qualify the statement 'oocyst development has typcially been inferred by'. There seem a few studies that show that size doesn't reflect maturation. __

      In our opinion, this statement is already qualified in the following sentence which reads: "Recent studies have shown that while oocysts increase in size initially, their size eventually plateaus (11 days pot infection (dpi) in P. falciparum4)."

      __On page 16 the authors state that different rhoptries might have different function. This is an interesting hypothesis/result that could be mentioned in the abstract. __

      The abstract already contains the following statement: "...and provide the first evidence that rhoptry pairs are specialised for different invasion events." We see this as an equivalent statement.


      Experimental changes (Reviewer #1)

      1. On page 19: do the parasites with the RON11 knockout only have the cytoplasmic or only the apical rhoptries?

      The answer to this is not completely clear. We have added the following data to Figures 6 and 8 where we quantify the proportion of rhoptries that are either apical or cytoplasmic: In both wildtype parasites and RON11ctrl parasites, oocyst spz rhoptries are roughly 50:50 apical:cytoplasmic (with a small but consistent majority apical), while almost all rhoptries are found at the apical end (>90%) in SG spz. Presumably, after the initial apical rhoptries are 'used up' during SG invasion, the rhoptries that were previously cytoplasmic take their place. In RON11cKD the ratio of apical:cytoplasmic rhoptries is fairly similar to control oocyst spz. In RON11cKD SG spz, the proportion of cytoplasmic rhoptries decreases but not to the same extent as in wildtype or RON11Ctrl. From this, we infer that the two rhoptries that are lost/not made in RON11cKD sporozoites are likely a combination of both the apical and cytoplasmic rhoptries we find in control sporozoites.

      __in panel G: Are the dense granules not micronemes? What are the dark lines? Rhoptries?? __

      We have labelled all of Figure 1 more clearly to point out that the 'dark lines' are indeed rhoptries. Additionally, we have renamed the 'protein-dense granules' to 'protein-rich granules', as it seems we are suggesting that these structures are dense granules the secretory organelle. At this stage we simply do not know what all of these granules are. The observation that some but not all of these granules contain CSP (Supplementary Figure 2) suggests that they may represent heterogenous structures. It is indeed possible that some are micronemes, however, we think it is unlikely that they are all micronemes for a number of reasons: (1) micronemes are not nearly this protein dense in other Plasmodium lifecycle stages, (2) some of them carry CSP which has not been demonstrated to be micronemal, (3) very few of these granules are present in SG sporozoites, which would be unexpected because microneme secretion is required for hepatocyte invasion.

      __Figure 2 seems to add little extra compared to the following figures and could in my view go to the supplement. __

      We agree that Figure 2b adds little and so have moved that to Supplementary Figure 2, but think that the relative ease at which it can be distinguished if sporozoites are in the secretory cavity or SG epithelial cell is a key observation because of the difficulty in doing this by conventional IFA.

      __On page 8 the authors mention a second layer of CSP but do not further investigate it. It is likely hard to investigate this further but to just let it stand as it is seems unsatisfactory, considering that CSP is the malaria vaccine. What happens if you add anti-CSP antibodies? I would suggest to shorten the opening paragraphs of this paper and to focus on the rhoptries. This could be done be toning down the text on all aspects that are not rhoptries and point to the open question some of the observations such as the CSP layers raise for future studies. __

      When writing the manuscript, we were unsure whether to include this data at all as it is a purely incidental finding. We had no intention of investigating CSP specifically, but anti-CSP antibodies were included in most of the salivary gland imaging experiments so we could more easily find sporozoites. Given the tremendous importance of CSP to the field, we figured that these observations were potentially important enough that they should be reported in the literature even though they are not something we have the intention or resources to investigate subsequently. Additionally, after consultation with other microscopists we think there is a reasonable chance that this double-layer effect could be a product of chemical fixation. To account for this, we have qualified the paragraph on CSP with this sentence:

      "We cannot determine if there is any functional significance of this second CSP layer and considering that it has not been observed previously it may well represent an artefact of chemical (paraformaldehyde) fixation."

      __Maybe include more detail of the differences between species on rhoptry structure into Figure 4. I would encourage to move the Data on rhoptries in Figure S6 to the main text ie to Figure 4. __

      We have moved the images of developing rhoptries in * falciparum *(previously Figure S6a and b) into figure 4, which now looks as follows:

      Figure S8 (previously S6c) now consists only of the MG spz rhoptry quantification

      Manuscript structural changes (Reviewer #1)

      1. Abstract: don't focus on technique but on the questions you tried to answer (ie rewrite or delete the 3rd and 4th sentence)

      2. 'range of cell biology processes' - I understand the paper that the key discovery concerns rhoptry biogenesis and function, so focus on that, all other aspects appear rather peripheral.

      3. 'Much of this study focuses on the secretory organelles': I would suggest to rewrite the intro to focus solely on those, which yield interesting findings.

      4. Page 11: I am tempted to suggest the authors start their study with Figure 3 and add panel A from Figure 2 to it. This leads directly to their nice work on rhoptries. Other features reported in Figures 1 and 2 are comparatively less exciting and could be moved to the supplement or reported in a separate study.____ Page 23: I suggest to delete the first sentence and focus on the functional aspects and the discoveries.

      5. __Maybe add a conclusion section rather than a future application section, which reads as if you want to promoted the use of ultrastructure expansion microscopy. To my taste the technological advance is a bit overplayed considering the many applications of this techniques over the last years, especially in parasitology, where it seems widely used. In any case, please delete 'extraordinarily' __

      Response to Reviewer#1 manuscript structural changes 1-5: This reviewer considers the findings related to rhoptry biology as the most significant aspect of the study and suggests rewriting the manuscript to emphasize these findings specifically. Doing so might make the key findings easier to interpret. However, in our view, this approach could misrepresent how the study originated and what we see as the most important outcomes. We did not develop MoTissU-ExM specifically to investigate rhoptry biology. Instead, this technique was created independently of any particular biological question, and once established, we asked what questions it could answer, using rhoptry biology as a proof of concept. Given the authors' previous work and available resources, we chose to focus on rhoptry biology. Since this was driven by basic research rather than a specific hypothesis, it's important to acknowledge this in the manuscript. While we agree that the findings related to rhoptry biology are valuable, we believe that highlighting the technique's ability to observe organelles, structures, and phenotypes with unprecedented ease and detail is more important than emphasizing the rhoptry findings alone. For these reasons, we have decided not to restructure the manuscript as suggested.


      Reviewer #2

      Minor text changes (Reviewer #2)

      1. __The 'image Z-depth' value indicated in the figures is ambiguous. It is not clear whether this refers to the distance from the coverslip surface or the starting point of the z-stack image acquisition. A precise definition of this parameter would be beneficial. __

      In the legend of Figure 1, the image Z-depth has been clarified as "sum distance of Z-slices in max intensity projection". 2.

      __Paragraph 3 of the introduction - line 7, "handful or proteins" should be handful of proteins __

      This has been corrected. 3.

      __Paragraph 5 of the introduction - line 7, "also able to observed" should be observe __

      This has been changed. 4.

      __In the final paragraph of the introduction - line 1, "leverage this new understand" should be understanding __

      This has been fixed. 5.

      __The first paragraph of the discussion summary contains an incomplete sentence on line 7, "PbRON11ctrl-infected SGs." __

      This has been removed. 6.

      __The second paragraph of the discussion - line 10, "until cytokinesis beings" should be begins __

      This mistake has been corrected. 7.

      __One minor point that author suggest that oocyst diameter is not appropriate for the development of sporozoite develop. This is not so true as oocyst diameter tells between cell division and cell growth so it is important parameter especially where the proliferation with oocyst does not take place but the growth of oocyst takes place. __

      We agree that this was not highlighted enough in the text. The final sentence of the results section about this now reads:

      "While diameter is a useful readout for oocyst development in the early stages of its growth, this suggests that diameter is a poor readout for oocyst development once sporozoite formation has begun and highlights the usefulness of the segmentation score as an alternative.", and the final sentence of the discussion section about this now reads "Considering that oocyst size does not plateau until cytokinesis begins4, measuring oocyst diameter may represent a useful biological clock specifically when investigating the early stages of oocyst development." 8.

      __How is the apical polarity different to merozoite as some conoid genes are present in ookinete and sporozoite but not in merozoite. __

      Our hypothesis is that apical polarity is established by the positioning and attachment of the centriolar plaque to the parasite plasma membrane in both forming merozoites and sporozoites. While the apical polar ring proteins are obviously present at the apical end, and have important functions, we think that they themselves are unlikely to regulate polarity establishment directly. Additionally, it seems that the apical polar rings are visible in forming sporozoites far before the comparable stages of merozoite formation. An important note here is that at this point, this is largely inferences based on observational differences and there is relatively little functional data on proteins that regulate polarity establishment at any stage of the Plasmodium 9.

      __Therefore, I think that electron microscopy remains essential for the observation of such ultra-fine structures __

      We have added a paragraph in the discussion that provides a more clear comparison between MoTissU-ExM and other imaging modalities previously applied on mosquito-stage parasites (see response to Reviewer#1 (Minor text changes) comment #17). 10.

      __The author have not mentioned that sometimes the stage oocyst development is also dependent on the age of mosquito and it vary between different mosquito gut even if the blood feed is done on same day. __

      In our opinion this can be inferred through the more general statement that "development of each oocyst within a midgut is asynchronous..."


      Figure changes (Reviewer #2)

      1. __Fig 3B: stage 2 and 6 does not show the DNA cyan, it would-be good show the sate of DNA at that particular stage, especially at stage 2 when APR is visible. And box the segment in the parent picture whose subset is enlarged below it. __

      We completely agree with the reviewer that the stage 2 image would benefit from the addition of a DNA stain. Many of the images in Figure 3b were done on samples that did not have a DNA stain and so in these * yoelii samples we did not find examples of all segmentation scores with the DNA stain. Examples of segmentation score 2 and 6 for P. berghei, and 6 for P. falciparum* can be found with DNA stains in Figure S8. 2.

      __For clarity, it would be helpful to add indicators for the centriolar plaques in Figure 1b, as their locations are not immediately obvious. __

      The CPs in Figure 1a and 1b have been circled on the NHS ester only panel for clarity. +

      __Regarding Figure 1c, the authors state that 'the rootlet fiber is visible'. However, such a structure cannot be confirmed from the provided NHS ester image. Can the authors present a clearer image where the rootlet fibre is more distinct? Furthermore, please provide the basis for identifying this structure as a rootlet fiber based on the NHS ester observation alone. __

      The image in Figure 1c has been replaced with one that more clearly shows the rootlet fibre.

      Based on electron microscopy studies, the rootlet fibre has been defined as a protein dense structure that connects the centriolar plaque to the apical polar rings (PMID: 17908361). Through NHS ester and tubulin staining, we could identify the apical polar rings and centriolar plaque as sites on the apical end of the parasite and nucleus that microtubules are nucleated from. There is a protein dense fibre that connects these two structures. Based on the fact that the protein density of this structure was previously considered sufficient for its identification by electron microscopy, we consider its visualisation by NHS ester staining sufficient for its identification by U-ExM.

      __Fig 1B - could the tubulin image in the hemispindle panel be made brighter? __

      The tubulin staining in this panel was not saturated, and so this change has been made.

      __Fig 4A - the green text in the first image panel is not visible. Also, the cyan text in the 3rd image in Fig 1A is also difficult to see. There's a few places where this is the case __

      We have made all microscopy labels legible at least when printed in A4/Letter size.

      __Fig 6A - how do the authors know ron11 expression is reduced by 99%? Did they test this themselves or rely on data from the lab that gifted them the construct? Also please provide mention the number of oocyst and sporozoites were observed. __

      The way Figure 6a was previously designed and described was an oversight, that wrongly suggested we had quantified a >99% reduction in *ron11 * The 99% reduction has been removed from Figure 6a and the corresponding part of the figure legend has been rewritten to emphasise that this was previously established:

      "(a) Schematic showing previously established Ron11Ctrl and Ron11cKD parasite lines where ron11 expression was reduced by >99%9."

      As to the second part of the question, we did not independently test either protein or RNA level expression of RON11, but we were gifted the clonal parasite lines established by Prof. Ishino's lab in PMID: 31247198 not just the genetic constructs.

      __Fig 6E - are the data point colours the wrong way round on this graph? Just looking at the graph it looks as though the RON11cKD has more rhoptries than the control which does not match what is said in the text. __

      Thank you for pointing out this mistake, the colours have now been corrected.

      __Fig S8C, PbRON11 ctrl, pie chart shows 89.7 % spz are present in the secretory cavity while the text shows 100 %, 35/35 __

      The text saying 100% (35/35) only considered salivary glands that were infected (ie. Uninfected SGs were removed from the count. The two sentences that report this data have been clarified to reflect this better:

      "Of *PbRON11ctrl SGs that were infected (35/39), 100% (35/35) contained sporozoites in the secretory cavity (Figure S8c). Conversely of infected PbRON11cKD SGs (59/82), only 24% (14/59) contained sporozoites within the secretory cavity (Figure S9d)."

      *

      __Fig S9D shows that RON11 ckd contains 17.1% sporozoites in secretory cavity while the text says 24%. __

      Please see the response to Reviewer#2 Figure Changes Comment #8 where this was addressed.


      Experimental changes (Reviewer #2)

      1. __Why do the congruent rhoptries have similar lengths to each other, while the dimorphic rhoptries have different lengths? Is this morphological difference related to the function of these rhoptries? __

      We hypothesise that this morphological difference arises because the congruent rhoptries are 'used' during SG invasion, while the dimorphic rhoptries are utilized during hepatocyte invasion. It is not straightforward to test this functionally at this point, as no protein is known to have differential localization between the two. Additionally, RON11 is likely directly involved in both SG and hepatocyte invasion through a secreted portion of the protein (as seen in RBC invasion). Therefore, RON11cKD sporozoites may have combined defects, meaning we cannot assume any defect is solely due to the absence of two rhoptries. Determining this functionally is of high interest to our research groups and remains an area of ongoing study, but it is beyond the scope of this study. 2.

      Would it be possible to show whether RON11 localises to the dimorphic rhoptries, the congruent rhoptries, or both, by using expansion microscopy and a parasite line that expresses RON11 tagged with GFP or a peptide tag?

      __ __We do not have access to a parasite line that expresses a tagged copy of RON11, or anti-PbRON11 antibodies. Based on previously published localisation data, however, it seems likely that RON11 localises to both sets of rhoptries. Below are excerpts from Figure 1c of PMID: 31247198, where RON11 (in green) seems to have a more basally-extended localisation in midgut (MG) sporozoites than in salivary gland (SG) sporozoites. From this we infer that in the MG sporozoite you're seeing RON11 in both pairs of rhoptries, but only the one remaining pair in the SG sporozoite.


      __The knockdown of RON11 disrupts the rhoptry structure, making the dimorphic and congruent rhoptries indistinguishable. Does this suggest that RON11 is important for the formation of both types of rhoptries? I believe that it would be crucial to confirm whether RON11 localises to all rhoptries or is restricted to specific rhoptries for a more precise discussion of RON11's function. __

      Based on our analysis, it does indeed seem that RON11 is important for both types of rhoptries as when RON11 isn't expressed sporozoites still have both apical and cytoplasmic rhoptries (ie. Not just one pair is lost; see Reviewer #1 Experimental changes comment #1).

      __The authors state that 64% of RON11cKD SG sporozoites contained no rhoptries at all. Does this mean RON11cKD SG sporozoites used up all rhoptries corresponding to the dimorphic and congruent pairs during SG invasion? If so, this contradicts your claims that sporozoites are 'leaving the dimorphic rhoptries for hepatocyte invasion' and that 'rhoptry pairs are specialized for different invasion events'. If that is not the case, does it mean that RON11cKD sporozoites failed to form the rhoptries corresponding to the dimorphic pair? A more detailed discussion would be needed on this point and, as I mentioned above, on the specific role of RON11 in the formation of each rhoptry pair. __

      We do not agree that this constitutes a contradiction; instead, more nuance is needed to fully explain the phenotype. As shown in the new graph added in response to Reviewer#1 Figure changes comment #1 in RON11cKD oocyst sporozoites, 64% of all rhoptries are located at the apical end. Our hypothesis is that these rhoptries are used for SG invasion and, therefore, would not be present in RON11cKD SG sporozoites. Consequently, the fact that 64% of RON11cKD sporozoites lack rhoptries is exactly what we would expect. Essentially, we predict three slightly different 'pathways' for RON11cKD sporozoites: If they had 2 apical rhoptries in the oocyst, we predict they would have zero rhoptries in the SG. If they had 2 cytoplasmic rhoptries in the oocyst, we predict they would have two rhoptries in the SG. If they had one apical and one cytoplasmic rhoptry in the oocyst, we predict they would have one rhoptry in the SG. In any case, we expect the apical rhoptries to be 'used up,' which appears to be supported by the data.

      __Out of pure curiosity, is it possible to measure the length and number of subpellicular microtubules in the sporozoites observed in this study using expansion microscopy? __

      We have performed an analysis of subpellicular microtubules which is now included as Supplementary Figure 2. We could not always distinguish every SPMT from each other and so have not quantified SPMT number. We have, however, quantified their absolute length on both the 'long side' and 'short side', their relative length (as % sporozoite length) and the degree to which they are polyglutamylated.

      A description of this analysis is now found in the results section as follows: "*We quantified the length and degree of polyglutamylation of SPMTs on the 'long side' and 'short side' of the sporozoite (Figure S2). 'Short side' SPMTs were on average 33% shorter (mean = 3.6 µm {plus minus}SD 1.0 µm) than 'long side' SPMTs (mean = 5.3 µm {plus minus}SD 1.5 µm) and extended 17.4% less of the total sporozoite length. While 'short side' SPMTs were significantly shorter, a greater proportion of their length (87.9% {plus minus}SD 11.2%) was polyglutamylated compared to 'long side' SPMTs (69.4% {plus minus}SD 13.8%)." *

      Supplementary Figure 2: Analysis of sporozoite subpellicular microtubules. Isolated P. yoelii salivary gland sporozoites were prepared by U-ExM and stained with anti-tubulin (microtubules) and anti-PolyE (polyglutamylated SPMTs) antibodies. SPMTs were defined as being on either the 'long side' (nucleus distant from plasma membrane) or 'short side' (nucleus close to plasma membrane) of the sporozoite as depicted in Figure 1f. (a) SPMT length along with (b) SPMT length as a proportion of sporozoite length were both measured. (c) Additionally, the proportion of the SPMT that was polyglutamylated was measured. Analysis comprises 25 SPMTs (11 long side, 14 short side) from 6 SG sporozoites. ** = p The following section has also been added to the methods to describe this analysis: * "Subpellicular microtubule measurement

      • To measure subpellicular microtubule length and polyglutamylation maximum intensity projections were made of sporozoites stained with NHS Ester, anti-tubulin and anti-PolyE antibodies, and SYTOX Deep Red. The side where the nucleus was closest to the parasite plasma membrane was defined as the 'short side', while the side where the nucleus was furthest from the parasite plasma membrane was defined as the 'long side'. Subpellicular microtubules were then measured using a spline contour from the apical end of the sporozoite to the basal-most end of the microtubule with fluorescence intensity across the contour plotted (Zeiss ZEN 3.8). Sporozoite length was defined as the distance from the sporozoite apical polar rings to the basal complex, measuring through the centre of the cytoplasm. The percentage of the subpellicular microtubule that was polyglutamylated was determined by assessing when along the subpellicular microtubule contour the anti-PolyE fluorescence intensity last dropped below a pre-defined threshold."

      *

      __In addition to the previous point, in the text accompanying Figure 7a, the authors claim that "64% of PbRON11cKD SG sporozoites contained no rhoptries at all, while 9% contained 1 rhoptry and 27% contained 2 rhoptries". Could this data be used to infer which rhoptry pair are missing from the RON11cKD oocyst sporozoites? Can it be inferred that the 64% of salivary gland sporozoites that had no rhoptries in fact had 2 congruent rhoptries in the oocyst sporozoite stage and that these have been discharged already? __

      Please see the response to Reviewer #2 Experimental Changes Comment #4.

      __Is it possible that the dimorphic rhoptries are simply precursors to the congruent rhoptries? Could it be that after the congruent rhoptries are used for SG invasion, new congruent rhoptries are formed from the dimorphic ones and are then used for the next invasion?____ Would it be possible to investigate this by isolating sporozoites some time after they have invaded the SG and performing expansion microscopy? This would allow you to confirm whether the dimorphic rhoptries truly remain in the same form, or if new congruent rhoptries have been formed, or if there have been any other changes to the morphology of the dimorphic rhoptries. __

      In theory, it is possible that the dimorphic rhoptries are precursors to the uniform rhoptries, specifically how the larger one of the two in the dimorphic pair might be a precursor. Maybe the smaller one is, but we have no evidence to suggest that this rhoptry lengthens after SG invasion. We are interested in isolating sporozoites from SGs to add a temporal perspective, but currently, this isn't feasible. When sporozoites are isolated from SGs, they are collected at all stages of invasion. Additionally, we don't know how long each step of SG invasion takes, so a time-based method might not be effective either. We are developing an assay to better determine the timing of events during SG invasion with MoTissU-ExM, but this is beyond the scope of this study.

      __In the section titled "Presence of PbRON11cKD sporozoites in the SG intercellular space", the authors state that "the majority of PbRON11cKD-infected mosquitoes contained some sporozoites in their SGs, but these sporozoites were rarely inside either the SG epithelial cell or secretory cavity". - this is suggestive of an invasion defect as the authors suggest. Could the authors collect these sporozoites and see if liver hepatocyte infection can be established by the mutant sporozoites? They previously speculate that the two different types of rhoptries (congruent and dimorphic) may be specific to the two invasion events (salivary gland epithelial cell and liver cell infection). __

      It has already been shown that RON11cKD sporozoites fail hepatocyte invasion (PMID: 31247198), even when isolated from the haemolymph and so it seems very unlikely that they would be invasive following SG isolation. As mentioned in the discussion, RON11 in merozoites has a 'dual-function' where it is partially secreted during merozoite invasion in addition to its rhoptry biogenesis functions. Assuming this is also the case in sporozoites, using the RON11cKD parasite line we cannot differentiate these two functions and therefore cannot ascribe invasion defects purely to issues with rhoptry biogenesis. In order to answer this question functionally, we would need to identify a protein that only has roles in rhoptry biogenesis and not invasion directly.

      Reviewer #3

      Minor text changes (Reviewer #3)

      1. __Page 3 last paragraph: ...the molecular mechanisms underlying SG (invasion?) are poorly understood. __

      This has been corrected 2.

      __The term "APR" does not refer to a tubulin structure per se, but rather to the proteinaceous structure to which tubulin anchors. Are there any specific APR markers that can be used in Figure 1C? If not, I recommend avoiding the use of "APR" in this context. __

      The text does not state that the APR is a tubulin structure. Given that it is a proteinaceous structure, we visualise the APRs through protein density (NHS Ester). It has been standard for decades to define APRs by protein density using electron microscopy, and it has previously been sufficient in Plasmodium using expansion microscopy (PMIDs: 41542479, 33705377) so it is unclear why it should not be done so in this study. 3.

      __I politely disagree with the bold statements ‚ Little is known about cell biology of sporozoite formation.....from electron microscopy studies now decades old' (p.3, 2nd paragraph); ‚To date, only a handful of (instead of ‚or') proteins have been implicated in SG invasion' (p. 4, 1st paragraph). These claims may overlook existing studies; a more thorough review of the literature is recommended. __

      This study includes at least 50 references from papers broadly related to sporozoite biology, covering publications from every decade since the 1970s. The most recent review that discusses salivary gland invasion cites 11 proteins involved in SG invasion. We have replaced "handful" with a more precise term, as it is not the best adjective, but it is hardly an exaggeration.


      Figure changes (Reviewer #3)

      1. __The hypothesis that Plasmodium utilizes two distinct rhoptry pairs for invading the salivary gland and liver cells is intriguing but remains clearly speculative. Are the "cytoplasmic pair" and "docked pair" composed of the same secretory proteins? Are the paired rhoptries identical? How does the parasite determine which pair to use for salivary gland versus liver cell invasion? Is there any experimental evidence showing that the second pair is activated upon successful liver cell invasion? Without such data this hypothesis seems rather premature. __

      We are unaware of any direct protein localisation evidence suggesting that the rhoptry pairs may carry different cargo. However, only a few proteins have been localised in a way that would allow us to determine if they are associated with distinct rhoptry pairs, so this possibility cannot be ruled out either. It seems unlikely that the parasite 'selects' a specific pair, as rhoptries are typically always found at the apical end. What appears more plausible is that the "docked pair" forms first and immediately occupies the apical docking site, preventing the cytoplasmic pair from docking there. Regarding any evidence that the second pair is activated during liver cell invasion, it has been well documented over decades that rhoptries are involved in hepatocyte invasion. If the dimorphic rhoptries are the only ones present in the parasite during hepatocyte invasion, then they must be used for this process. 2.

      __The quality of the "Roolet fibre" image is not good and resembles background noise from PolyE staining. Additional or alternative images should be provided to convincingly demonstrate that PolyE staining indeed visualizes the Roolet fibre. It is puzzling that the structure is visible with PolyE staining but not with tubulin staining. __

      This is a logical misinterpretation based on the image provided in Figure 1c. Our intention was not to imply that PolyE staining enables us to see the rootlet fibre but that PolyE and tubulin allow us to see the APR to which the rootlet fibre is connected. There is some PolyE staining that likely corresponds to the early SPMTs that in 1c appears to run along the rootlet fibre but this is a product of the max-intensity projection. Please see Reviwer#2 Figure Changes Comment #3 for the updated Figure 1c. 3.

      __More arrows should be added to Figures 6b and 6c to guide readers and improve clarity. __

      We have added arrows to Figure 6b and 6c which point out what we have defined as normal and aberrant rhoptries more clearly. These panels now look like this: 4.

      __Figure 2a zoomed image of P. yoelii infected SG is different than the highligted square. __

      We agree that the highlighted square and the zoomed area appear different, but this is due to the differing amounts of light captured by the objectives used in these two panels. The entire SG panel was captured with a 5x objective, while the zoomed panel was captured with a 63x objective. Because of this difference, the plane of focus of the zoomed area is hard to distinguish in the whole SG image. The zoomed image is on the 'top' of the SG (closest to the coverslip), while most of the signal you see in the whole SG image comes from the 'middle' of the SG. To demonstrate this more clearly, we have provided the exact region of interest shown in the 63x image alongside a 5x image and an additional 20x image, all of which are clearly superimposable.__

      __ 5.

      __Figure 3 legend: "P. yoelii infected midguts harvested on day 15" should be corrected. More general, yes, "...development of each oocyst within a single midgut is asynchronous." but it is still required to provide the dissection days. __

      We are unsure what the suggested change here is. We do not know what is wrong with the statement about day 15 post infection, that is when these midguts were dissected. __ Experimental Changes (Reviewer #3)__

      1. __The proposed role of AOR in rhoptry biogenesis appears highly speculative. It is unclear how the authors conclude that "AORs carry rhoptry cargo" solely based on the presence of RON4 within the structure. Inclusion of additional markers to characterize the content of AOR and rhoptries will be essential to substantiate the hypothesis that this enigmatic structure supports rhoptry biogenesis. __

      It is important to note that the hypothesis that AORs, or rhoptry anlagen, carry rhoptry cargo and serve as vehicles of rhoptry biogenesis was proposed long before this study (PMID: 17908361). In that study, it was assumed that structures now called AORs or rhoptry anlagen were developing rhoptries. Although often visualised by EM and presumed to carry rhoptry cargo (PMID: 33600048, 26565797, 25438048), it was only more recently that AORs became the subject of dedicated investigation (PMID: 31805442), where the authors stated that "...AORs could be immature rhoptr[ies]...". Our observation that AORs contain the rhoptry protein RON4, which is not known to localize to any other organelle, we therefore consider sufficient to conclude that AORs carry rhoptry cargo and are thus vehicles for rhoptry biogenesis. 2.

      __The study of RON11 appears to be a continuation of previous work by a collaborator in the same group. However, neither this study nor the previous one adequately addresses the evolutionary context or structural characteristics of RON11. Notably, the presence of an EF-hand motif is an important feature, especially considering the critical role of calcium signaling in parasite stage conversion. Given the absence of a clear ortholog, it would be interesting to know whether other Apicomplexan parasites harbor rhoptry proteins with transmembrane domains and EF-hand motifs, and if these proteins might respond similarly to calcium stimulation. Investigating mutations within the EF-hand domain could provide valuable functional insights into RON11. __

      We are unsure what suggests that RON11 lacks a clear orthologue. RON11 is conserved across all apicomplexans and is also present in Vitrella brassicaformis (OrthoMCL orthogroup: OG7_0028843). A phylogenetic comparison of RON11 across apicomplexans has previously been performed (PMID: 31247198), and this study provides a structural prediction of PbRON11 with the dual EF-hand domains annotated (Supplementary Figure 9). 3.

      __The study cannot directly confirm that membrane fusion occurs between rhoptries and AORs. __

      This is already stated verbatim in the results "Our data cannot directly confirm that membrane fusion occurs between rhoptries and AORs..." 4.

      __It is unclear what leads to the formation of the aberrant rhoptries observed in RON11cKD sporozoites. Since mosquitoes were not screened for infection prior to salivary gland dissection, The defect reports and revisited of RON11 knockdown does not aid in interpreting rhoptry pair specialization, as there was no consistent trend as to which rhoptry pair was missing in RON11cKD oocyst sporozoites. The notion that RON11cKD parasites likely have ‚combinatorial defects that effect both rhoptry biogenesis and invasion' poses challenges to understand the molecular role(s) of RON11 on biogenesis versus invasion. Of note, RON11 also plays a role in merozoite invasion. __

      We are unclear about the comment or suggestion here, as the claims that RON11cKD does not help interpret rhoptry pair specialization, and that these parasites have combined defects, are both directly stated in the manuscript. 5.

      __Do all SG PbRON11cKD sporozoites lose their reduced number of rhoptries during SG invasion as in Figure 7a (no rhoptries)? __

      Not all RON11cKD SG sporozoites 'use up' their rhoptries during SG invasion. This is quantified in both Figure 7a and the text, which states: "64% of *PbRON11cKD SG sporozoites contained no rhoptries at all, while 9% contained 1 rhoptry and 27% contained 2 rhoptries."

      * 6.

      Different mosquito species/strains are used for P. yoelii, P. berghei, and P. falciparum. Does it effect oocyst sizes/stages? Is it ok to compare?

      __ __We agree that a direct comparison between for example * yoelii and P. berghei *oocyst size would be inappropriate, however Figure 3c and Supplementary Figure 4 are not direct comparisons between two species, but a summation of all oocysts measured in this study to indicate that the trends we observe transcend parasite/mosquito species differences. Our study was not set up with the experimental power to determine if mosquito host species alter oocyst size. 7.

      __While I acknowledge that UExM has significantly advanced resolution capabilities in parasite studies, the value of standard microscopy technique should not be overlooked. Particularly, when discussing the function of RON11, relevant IFA and electron microscopy (EM) images should be included to support claims about RON11's role in rhoptry biogenesis. This would complement the UExM data and substantially strengthen the conclusions. Importantly, UExM can sometimes produce unexpected localization patterns due to the denaturation process, which warrants caution. __

      The purpose of this study is not to discredit, undermine, or supersede other imaging techniques. It is simply to use U-ExM to answer biological questions that cannot or have not been answered using other techniques. Please refer to Reviewer # 1 Minor text changes comment#17 to see the new paragraph "Comparison of MoTissU-ExM and other imaging modalities" that addresses this

      Both conventional IFA and immunoEM have already been performed on RON11 in sporozoites before (PMID: 31247198). When assessing defects caused by RON11 knockdown, conventional IFA isn't especially helpful because it doesn't allow visualization of individual rhoptries. Thin-section TEM also doesn't provide the whole-cell view needed to draw these kinds of conclusions. Volume EM could likely support these observations, but we don't have access to or expertise in this technique, and we believe it is beyond the scope of this study. It's also important to note that for the defect we observe-missing or abnormal rhoptries-the visualization with NHS ester isn't significantly different from what would be seen with EM-based techniques, where rhoptries are easily identified based on their protein density.

      The statement that "UExM can sometimes produce unexpected localisation patterns due to the denaturation process..." is partially correct but lacks important nuance in this context. Based on our extensive experience with U-ExM, there are two main reasons why the localisation of a single protein may look different when comparing U-ExM and traditional IFA images. First, denaturation: in conventional IFAs, antibodies need to recognize conformational epitopes to bind to their target, whereas in U-ExM, antibodies must recognize linear epitopes. This doesn't mean the target protein's localisation changes, only that the antibody's ability to recognize it does. Second, antibody complexes seem unable to freely diffuse out of the gel, which can result in highly fluorescent signals not related to the target protein appearing in the image, as we have previously reported (PMID: 36993603). Importantly, neither of these factors applies to our phenotypic analysis of RON11 knockdown. All phenotypes described are based solely on NHS Ester (total protein) staining, so the considerations about changes in the localisation of individual proteins are not relevant.

    1. We are experiencing civil strife at this moment due to breakdowns in human-centered discourse and dialogue. Technology is, in part, to blame because, despite its marvelous achievements, it disconnects us from direct human interaction, eroding trust and squandering meaning. We have lost sympathy and absorbed indifference through online echo-chambers or fervent social media chains.

      The passage points out that while technology helps us stay connected, it can also weaken our social ties and make real conversation harder. When so many of our interactions happen through screens, we lose important habits like listening closely, disagreeing respectfully, and seeing each other as an actual human being. Online platforms usually strengthen our existing views instead of encouraging real discussion or empathy, so we end up talking past each other instead of truly connecting. As a result, people can become emotionally distant and only engage with important topics in a shallow way, since complex debates often get reduced to quick comments, likes, or shares.

      The passage also suggests that civility means more than just being polite. It is about creating a shared space where people can disagree without showing contempt. When technology encourages quick reactions and outrage, it becomes harder to slow down, ask honest questions, or admit mistakes. This can lead to more mistrust, and small misunderstandings may quickly turn into bigger social conflicts or even civil strife. It is easy for people to say what ever they want to whoever they want when they don't have to see their faces or fully interact with someone. Things can also be misinterpreted based on the "tone of voice" someone may read it in, even if that is not the tone intended. I think that makes people feel more inclined and quick to make their point, regardless of how it may make people feel. I believe it is important to be aware of the impact of words, even just written, and how it can make others feel and I hope more people will start to take that into consideration when reacting and responding online.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 Reviewer 1 Point 1- The authors describe cortical neuronal counts across several mammalian species, which is quite impressive, but the information on the methods of counting is lacking: how representative are the data used / shown; how many individuals / brains / sections were used for each species considered? Much more detailed description of the quantifications should be provided to judge the validity of this first conclusion.

      Response: We sincerely thank the reviewer for this insightful and constructive suggestion. We agree that the methodological description of our comparative histological analysis, which is the fundamental basis of this study, was insufficient in the original manuscript. Following the reviewer’s advice, we have extensively revised the Materials and Methods section entitled “Nissl staining and neuronal cell number count” (Page 32, Line 15).

      Reviewer 1 Point 2- The authors use several markers of cortical neuron identity to confirm their neuron number measurements, but from the data shown in Figure 1D,E it seems that only some markers (Satb2) show species-differences while others do not (CTIP2 / Tbr1). How do the authors explain this discrepancy - does this mean that it is mainly Satb2 neurons that are increased in number? But if so how to explain the relative increase in subcortical projections shown in Figure S7?

      Response: We appreciate the reviewer’s insightful comments regarding the marker expression patterns. Upon re-evaluating our data in light of your feedback, we agree that the species differences in deep-layer (DL) markers such as Ctip2 and Tbr1 in the adult stage appear relatively modest compared to the robust differences observed in Satb2 and the projection data shown in Figure S8.

      To address this point, we have incorporated a comparison between the adult data (Figure 1) and our findings from P7 (Figure S2). As shown in the revised manuscript, the species differences for all markers are significantly more pronounced at P7 than in the adult. Notably, in the lower layers, rats exhibit a significantly higher number of marker-positive cells across all markers, including those newly added in this revision, compared to mice.

      We offer the following interpretation regarding these temporal differences:

      1. Developmental Relevance: The marker molecules analyzed are well-established regulators of neuronal subtype fate and projection identity during development. Their critical fate-determining functions are primarily exercised during the migration and maturation phases of nascent neurons.
      2. Postnatal Expression Shifts: Whether these molecules maintain functional roles in the fully matured adult brain remains less certain. It is plausible that marker expression may diminish in certain neuronal populations during late postnatal development, leading to the attenuated species differences observed in adults. Consequently, we believe the strong correlation between P7 quantitative data and projection fate provides a biologically sound validation of our hypothesis.

      While we have kept the discussion in the main text concise to maintain focus for the general reader, we have provided comprehensive data in Figure 1 and Figure S2. This ensures that the necessary evidence is readily available for specialists interested in these developmental dynamics.

      Reviewer 1 Point 3- The authors focus their study almost exclusively on somatosensory cortex, but can they comment on other areas (motor, visual for instance)? It would be nice to provide additional comparative data on other areas, at least for some of the parameters examined across mouse and rat. Alternatively the authors should be more explicit in the abstract and description of the study that it is limited to a single area.

      Response: We sincerely appreciate the reviewer’s insightful comment. As suggested, we have revised the Abstract to explicitly state that our current analysis is focused on the somatosensory cortex. Furthermore, as demonstrated in Figure 1B, we have added a discussion regarding the possibility that the species differences observed in the primary somatosensory cortex may be a general feature shared across the entire cerebral cortex, as follows: “This DL-biased thickening in rats was evident in the primary somatosensory area, but is consistently observed throughout the rostral-caudal cortical regions. (Page 19, Lines 29-31)“

      Reviewer 1 Point 4- The authors provide convincing evidence of increased Wnt signaling pathway in the rat. They should show more explicitly how other classical pathways of neurogenic balance / temporal patterning are expressed in their mouse and rat transcriptome data sets. These would include Notch, FGF, BMP, for which all the data should be available to provide meaningful species comparison.

      Response: We sincerely thank the reviewer for this insightful suggestion. Following your advice, we have newly included comparative data on key signaling pathways essential for cortical development—namely Wnt, FGF, NOTCH, mTOR, SHH, and BMP—across different species. These results are now presented in Figure S17. Rat progenitors show comparable patterns to other species for FGF, mTOR, and Notch signaling, but elevated Wnt and BMP expression, especially at early stages. A detailed heatmap of raw Wnt pathway gene expression across species is also included in the same supplementary figure. We believe these additions provide a more comprehensive evolutionary perspective and significantly strengthen our findings.

      Reviewer 1 Point 5- The alignment of mouse and rat trajectories is very nicely showing a delay at early-mid-corticogenesis. But there is also heterochronic transcriptome at latest stages (end of 5). How can this be interpreted? Does this mean potentially prolonged astrogliogenesis in the rat cortex?

      Response: We sincerely appreciate the reviewer’s insightful comment and the meticulous attention given to our data. Regarding the heterochronic shift observed at Day 5, we agree that this point was not sufficiently addressed in the original manuscript.

      We would like to clarify the two primary reasons for this omission, which are inherent to the current study’s design:

      1. Resolution of Stage Alignment at Temporal Extremes: In our developmental stage alignment analysis, corresponding stages are defined by pairs showing the highest transcriptomic similarity within the sampled range. By definition, the precision of this alignment tends to decrease at the earliest and latest time points of a dataset. Since the "true" biological equivalent might lie outside our sampling window, we must be cautious in interpreting shifts at these temporal boundaries.
      2. Difference in Validation Rigor: Our study prioritized the early stages of deep-layer (DL) neuron production. Consequently, we rigorously defined the onset of neurogenesis in rats (Day 1) using multiple independent methods, including clonal analysis, immunohistochemistry, and gene expression. In contrast, Day 5 was defined simply as five days post-initiation of neurogenesis, without equivalent multi-modal validation. Given that our primary focus is the early phase of neurogenesis, the precision of the transition from late neurogenesis to gliogenesis is relatively lower. For these reasons, we believe that an in-depth discussion of the heterochronic shift at Day 5 might lead to over-interpretation. To reflect this more accurately and avoid misleading the reader, we have revised Figure 6F to de-emphasize the Day 5 shift. In addition, we revised the manuscript as “Importantly, while this analysis identified stage pairs with the highest similarity, the correspondence at the edges of the temporal sampling window is inherently less certain than at the center. Consequently, we focus on the notable reflection point at the center of our dataset. (Page 13, Lines 37-39)”.

      We believe these changes more faithfully represent the biological scope of our data while maintaining the scientific integrity of our primary conclusions.

      Reviewer 1 Point 6- Figure 7: description implies that module 3 is a subset of module 4, but this is not obvious at all from the panels shown. Please clarify.

      Response: We sincerely appreciate the reviewer’s careful reading of our manuscript. As suggested, we have revised Figure 7 to clarify the hierarchical relationship between Module 3 and Module 4, ensuring that their inclusion is now explicitly presented.


      Reviewer #2 Reviewer 2 Point 1. The introduction lacks sufficient background and fails to convey the significance of the study. Specifically, why the research was undertaken, what knowledge gap it addresses, and how the findings could be applied. Addressing these questions already in the introduction would enhance the impact of the work and broaden its readership.

      Response: We sincerely appreciate the reviewer’s insightful comment on this point. Our study reports evolutionary insights gained through an unconventional approach: a single-cell level comparison between mice and rats. We agree that clarifying the necessity of this specific approach is crucial for the manuscript. Accordingly, we have added the following two points to the Introduction:

      1. At the end of the first paragraph, we emphasized the current lack of research on the evolutionary adaptation of cortical circuits, despite the established functional importance of evolutionarily conserved circuits. (Page 3, Lines 7-10); “Paradoxically, despite the importance of these variations, research has predominantly focused on the conserved aspects of cortical architecture. Consequently, the degree of evolutionary plasticity inherent in these circuits and the cell-intrinsic mechanisms driving their modification remain profoundly enigmatic.”)
      2. At the end of the third paragraph, we revised and added text (Page3, Lines 26-27; “This lack of comparative insight represents a significant gap in our understanding of how conserved developmental programs give rise to species-specific brain architectures.”).

      Reviewer 2 Point 2. In figure 5 the authors conclude that "differences in cell cycle kinetics and indirect neurogenesis are unlikely to be the primary factors driving the species-specific variation in DL neuron production. Instead, the temporal regulation of progenitor neurogenic competence, which determines the duration of the DL production phase, provides a more plausible explanation for the greater number of DL subtypes observed in rats". It is not clear to this reviewer how the authors come to this conclusion. Authors observe a significant proportion of mitotic cells in rat VZ from day 1, and a higher constant proportion of mitotic progenitors in SVZ rats compared to mouse (Figure 5C). This points to an early difference in mitotic progenitors that may also lead to increased IP numbers, and potentially an increased number in DL cells, even before day 1. In addition, the higher abundance of IPs in the G2/S phase (statistically significant in 4 of the 7 time points) (Figure 5F), would suggest that this difference might play a role in the species-specific variation of DL neuron production. The authors should estimate cell cycle length instead of just measuring proportions to conclude something about cell cycle kinetics. They can then model growth curves to predict the effect caused if there were differences in cell cycle length between equivalent cell types across species.

      Response: We sincerely thank the reviewer for their careful reading of our manuscript and for pointing out the overstatements in our original descriptions. We agree that a more nuanced interpretation of the data was necessary. In response to these constructive suggestions, we have made the following revisions:

      1. Refinement of Descriptions: We have revised the text to more accurately reflect our findings, specifically noting that the increase in RG division on Day 1 and IP proliferation throughout the neurogenic period showed a significant trend. These features are now described more fairly and cautiously in the revised manuscript. (Page 11, Lines 42-46; “Remarkably, while the temporal dynamics of mitotic density were strikingly conserved between the two species, subtle yet discernible species-specific signatures emerged. Specifically, rats exhibited a higher ratio of mitotic cells in the VZ at the onset of neurogenesis, the precise period when DL subtypes are generated in both species. Further assessment of G2/S-phase cells via pulse-EdU labeling (Figure 5D, E) “)
      2. Inclusion of Time-lapse Imaging Data: The reviewer is correct that measuring the proportions of M and G2/S phases provides only a limited snapshot of cell cycle dynamics. To gain a more precise insight, we performed primary cultures of neural progenitor cells (NPCs) from Day 1 and conducted live-cell time-lapse imaging. This allowed us to directly quantify the cell cycle duration of mouse and rat NPCs (Figure S9A-C).
      3. Comparative Analysis and Mathematical Modeling: Our new data revealed that the cell cycle lengths of the two species are remarkably similar, with no significant differences observed under these culture conditions. Furthermore, to validate the impact of these findings on overall brain development, we developed a mathematical model based on our experimental data. This model predicts the total number of cells produced over the five-day neurogenic period, providing a more robust theoretical framework for our conclusions (Figure S9D). We believe these additions significantly strengthen the manuscript and address the reviewer's concerns regarding the physiological relevance of our observations.

      Reviewer 2 Point 3. In Figure 6 the authors focus only on the mouse and rat datasets. Given the availability of datasets from primates that the author used already for Figure 7, it would give the reader a broader prospective if also these datasets would be integrated in the analysis done for Figure 6, particularly it would be interesting to integrate them in the pseudotime alignment of cortical progenitor. How do human and/or macaque early and late neurogenic phase would compare to mouse and rat in this model?

      Response: We sincerely appreciate the reviewer’s insightful suggestion. In accordance with this comment, we have now incorporated pseudotime alignments of cortical progenitors between primates (human, macaque) and rodents (mouse, rat), presented as pairwise gene expression distance matrices with dynamic time warping in Figure S13. These heatmaps illustrate temporal compression or stretching in progenitor gene expression progression across species. Notably, macaque progenitors show no definitive deviations from rodents, whereas human progenitors exhibit distinct protraction relative to rats and even more so to mice. These additions provide a more comprehensive cross-species perspective without altering the study's core conclusions.

      Reviewer 2 Point 4. In Figures 6C and 6D, the authors distinguish between cycling and non-cycling NECs and RGCs. Could the authors clarify the rationale behind making this distinction? Could the authors comment on how they interpret the impact of cycling versus non-cycling states on species-specific non-uniform scaling? Do they consider the observed non-linear correspondences to be driven by differences in cell cycle activity?

      Response: We are grateful to the reviewer for their insightful observation. We agree that our initial classification of neural progenitor cell (NPC) populations based on proliferation marker expression levels followed a convention used in other studies but was, in the context of this work, unnecessary and potentially misleading. To avoid further confusion and focus on the core biological question, we have re-organized the data by pooling these populations into a single group. Regarding the concern about species differences in cell cycle kinetics, we believe there is no significant divergence between mice and rats that could explain the observed developmental patterns in temporal progression of neurogenesis. This is supported by two lines of evidence:

      1. Quantitative analysis of pH3-positive cells (Figure 5).
      2. New time-lapse imaging data of primary cultured NPCs, which shows no substantial difference in cell cycle length between the two species (Figure S9). These results indicate that the species-specific differences in deep-layer (DL) neuron production are not driven by cell division kinetics. Consequently, we conclude that the non-linear developmental progression of NPCs occurs independently of cell cycle regulation.

      Reviewer 2 Point 5. For the non-uniform scaling in Figure 6F, the authors identify critical inflection points and mention that "the largest delay in rat progenitors occurring where Day 1 and Day 3 progenitors overlapped". It would be good if the authors could discuss what they think all the inflection points represents. How much can it be explained by the heterogeneity within progenitors per time point? There is a clear higher spread of histograms at days 3 and 5, and the histogram at day 5 almost overlaps with day 1. I wonder if the same conclusion about non-uniform scaling would be detected if the distance matrix was built separately for specific cell types, for example only looking at NECs or RGCs.

      Response: We sincerely appreciate the reviewer’s insightful perspective on this point. In alignment with the suggestions from both this reviewer and Reviewer 1 (Point 5), we have updated the manuscript to discuss all identified inflection points. Specifically, we have clarified why our discussion focuses on the correspondence between Mouse D1 and Rat Day 3.

      A recognized limitation of our current analytical approach is that it identifies the closest matching expression profiles within the specific timeframes sampled for each species. For stages at the beginning or end of our sampling window, the "true" corresponding stage in the other species may lie outside our sampled range, which naturally limits the strength of any conclusions regarding those boundary points. Consequently, while we can confidently confirm the correspondence between Mouse Day 1 and Rat Day 3—both of which sit centrally within our sampled window—we have intentionally avoided over-interpreting data near the temporal boundaries.

      Regarding the cell types analyzed, this specific analysis was conducted exclusively on NECs and RGs (now shown in Figure 6F). Extensive prior research (Susan McConnell lab, Sally Temple lab, Fumio Matsuzaki lab, Dennis Jabaudon lab, and more) has established that the time-dependent mechanisms governing the fate determination of cortical excitatory neuron subtypes are encoded within RGs. Therefore, we focused our investigation on these lineages and did not include other cell types in this study. We believe this focused approach maintains the highest degree of biological relevance for our conclusions.

      Reviewer 2 Point 6. The authors conclude that the elevated and prolonged expression of Wnt-ligand genes in rat RGs extend the DL neurogenic window and contribute to rat-specific expansion of deep cortical layer. In order to validate this finding it would be good for the authors to perform a perturbation experiment and reduce Wnt signalling/ Axin 2 levels in rats or depleted the Lmx1a and Lhx2 double-positive population. Response: __We thank the reviewer for this insightful suggestion. We agree that providing direct experimental evidence is crucial to demonstrating that elevated Wnt signaling in RG progenitors drives the production of DL subtype neurons in rats. To address this, we performed a functional intervention on Day 3, a stage when Wnt signaling (indicated by Axin2 expression) is significantly higher in rats than in mice (__Figure 7C, D). By introducing a dominant-negative form of TCF7L2 (dnTCF7L2) to inhibit Wnt signaling specifically in RG progenitors, we tracked the fate of the resulting neurons (Figure 7I, J). Our results showed a clear reduction in the proportion of DL neurons, accompanied by a reciprocal increase in upper-layer (UL) neurons. These findings demonstrate that maintained high levels of Wnt signaling are essential for the prolonged neurogenic capacity for DL neurons in rats. This new data has been incorporated into Figure 7.

      Reviewer 2 Point 7. The authors conclude that Wnt signaling is a rat specific effect since they did not observe any clear temporal change in wnt receptors in gyrencephalic species, and only a subset of RG in rats co-express Lmx1a and Lhx2. However, specific Wntligands and receptors (Wnt5a, Fzd and Lrp6) seem to be upregulated in human as well (Fig 7G), non RG cells could act as wnt ligand inducers in other species, and it has not been demonstrated that Lmx1a and Lhx2 are the source for Wntligand production. I wonder if the authors can completely rule out a role for Wnt in the protracted neurogenesis of other species.

      Response: We sincerely appreciate the reviewer’s insightful and broad perspective regarding Wnt signaling dynamics across diverse species. In this study, our primary focus was to elucidate the specific mechanisms underlying the differences between mice and rats. Consequently, we did not initially explore Wnt dynamics in other species or their roles in developmental timing in great depth in the original manuscript. We fully acknowledge that lineage-specific adaptations occur at the individual gene level; for instance, Silver and colleagues have reported that human-specific upregulation of Wnt receptor gene FZD8 modulates neural progenitor behavior (Boyd et al., Current Biology 2008, Liu et al., Nature 2025). However, our comparative analysis of five mammalian species—carefully aligned by developmental stage—reveals a distinct global trend. While individual gene variations exist like human FZD8, the expression levels of multiple Wnt-related genes, particularly ligands, are markedly higher in rats than in the other four species.

      Following the reviewer’s insightful suggestion, we examined the potential role of Lmx1a in activating Wnt ligand transcription in rat cortical progenitors by analyzing their expression correlation at the single-cell level. Our analysis revealed that several Wnt ligand genes are co-expressed with Lmx1a with a remarkably strong positive correlation. While we have not yet experimentally demonstrated the direct transcriptional activation of Wnt ligands by Lmx1a in these cells, this robust correlation at single-cell resolution strongly suggests that Lmx1a regulates Wnt ligand expression. These new findings are now included in Figure 7 and Figure S16, and the corresponding results section (Page 15, Lines 42-44) has been revised accordingly.

      __Reviewer 2 Point 8 __Minor comments: The RNAscope experiment is currently qualitative. Is it the mRNA copy number per cell equal in both species but more cells are positive in rat, or are there differences in number of mRNA molecules as well? It is not indicated if the RNAscopeprobes are the same for mouse and rat.

      Response: We sincerely thank the reviewer for this insightful suggestion. Following the comment, we performed RNAscope analysis for Axin2 in both mice and rats and quantified the results (now included in Figure 7D). The new data successfully validate the species differences initially observed in our scRNAseq analysis: specifically, the period of high-level Axin2 expression is significantly extended in rats compared to mice. These findings provide histological evidence that reinforces our conclusions regarding the distinct temporal dynamics between the two species.

      Regarding probe design, the Axin2 RNAscope probes target conserved and corresponding sequences between mouse and rat, with species-specific probes optimized for each organism to ensure maximal specificity and sensitivity. We have updated the Methods section ("Fluorescent in situ hybridization with RNAscope") to include these details.

      Reviewer #3

      Reviewer 3 Point 1. Satb2 is also widely recognized as a deep layer marker. The authors need to perform analysis and quantification in Figs 1 and 4 with other II/III and IV markers such as Cux1 and Rorb.

      Response: We thank the reviewer for their insightful comments regarding the marker specificity. We fully agree that while Satb2 is a robust marker for callosal projection identity, its broad distribution across both deep and upper layers limits its utility as a layer-specific marker. As the reviewer suggested, Cux1 (Layers 2/3) and Rorb (Layer 4) are indeed superior markers for defining laminar identity.

      To address this, we have incorporated new immunohistochemical data for these markers in both the quantification of somatosensory cortical neurons (Figure S2) and the birth-dating analysis (Figure 4).

      Our new findings are as follows:

      1. Layer Quantification (Figure S2): By utilizing Cux1 and Rorb as more specific upper-layer (UL) markers, we confirmed that there are no significant differences in the number of these neurons between mice and rats.
      2. Birth-dating Analysis (Figure 4): These markers allowed us to more precisely define the timing of Cux1/Rorb-positive cell generation, revealing subtle but important differences between the two species. While these additions do not alter the fundamental narrative of the original manuscript, they have significantly enhanced the precision and rigor of our analysis. We are grateful to the reviewer for guiding us toward this more robust validation.

      Reviewer 3 Point 2. Rats have larger cortices. Therefore, quantification of neurons should also be normalized to cortical thickness in Fig 1E and also represented with individual data points.

      Response: We sincerely appreciate the reviewer’s constructive suggestion. We agree that normalizing the number of cortical neurons by thickness provides a more rigorous comparison. Accordingly, we have calculated the neuronal density (cell count per unit thickness) for Tbr1- and Ctip2-positive cells and included these data in Figure S2C. Our analysis confirms that these populations are distributed at a significantly higher density in mice compared to rats.

      Furthermore, we have updated the visualization in Figure 1E to display individual data points, ensuring full transparency of the underlying distribution. We believe these revisions, prompted by the reviewer’s insight, have substantially strengthened the clarity and persuasiveness of our manuscript.

      Reviewer 3 Point 3. The clonal analysis in Figs 2 and 3 quantifies GFP and RFP and reports these as neurons. However, without using cell-specific markers, it seems the authors cannot exclude that some progeny are also glia derived from a radial glial progeny. I don't expect all experiments to have this but they must have some measures of both populations to address this possibility. This needs to be addressed to build confidence in the conclusion that there is clonal production of neurons.

      Related to this, the relationship between position and fate is not always 1 to 1. The data summarized in Fig 2G are based on position and not using subtype markers. They should include assessment of markers as they do in Fig 4.

      Response: We sincerely thank the reviewer for this insightful comment. We agree that a clear definition of cell types is essential for the accuracy of clonal analysis.

      In this study, we primarily identified neurons based on their distinct morphological characteristics and performed measurements specifically on these cells. To validate this approach, we confirmed that the vast majority of cells identified as neurons were positive for NeuN and cortical excitatory neuron markers, while remaining negative for glial markers such as Olig2 and SOX9. (Notably, at postnatal day 7, most cells in the glial lineage exist as undifferentiated Olig2-positive progenitors). These observations support our conclusion that the cells analyzed based on morphology are indeed cortical excitatory neurons.

      As the reviewer rightly pointed out, evaluating cell composition using fate-specific marker expression is the ideal approach. However, our current experimental setup required multiple fluorescence channels for DAPI staining (to assess tissue architecture) and immunostaining for GFP and RFP (to identify labeled clones). Due to these technical constraints regarding available detection channels and host species compatibility, we relied on morphological criteria for the primary analysis.

      To address this concern and ensure the reliability of our findings, we performed additional analyses using a subset of samples. By co-staining retrovirally labeled neurons with cell-fate markers, we obtained results consistent with our other data (Figures 1 and 4) regarding laminar position and marker expression. Based on this consistency, we are confident that our classification based on morphology and laminar position does not alter the fundamental conclusions of this study.

      Reviewer 3 Point 4. In Fig 5, the authors use PH3 as well as EdU to measure differences in indirect neurogenesis. Using EdU and Tbr2 they report more dividing IPs. However they need to measure this over the total number of Tbr2 cells as it is not normalized to differences in Tbr2 cells between species. Are there total differences in Tbr2+ cells when normalized to DAPI as well? Moreover, little analyses is performed to measure any impact on radial glia. As no striking differences were observed in IPs this leaves the cellular mechanism a bit unclear and begs the impact on radial glia. Measuring PH3+ cells in VZ and SVZ is not cell specific nor does it yield information to support the prolonged neurogenesis.

      Response: We sincerely thank the reviewer for this insightful suggestion. We agree that quantifying Tbr2+/EdU+ double-positive cells alone was insufficient to fully capture the IP dynamics. Following the reviewer’s advice, we have now quantified the total population of Tbr2+ cells, normalized to the number of DAPI-stained nuclei. This new analysis reveals that mice and rats exhibit nearly indistinguishable temporal dynamics (Figure S10). When integrated with the original Tbr2+/EdU+ data in Figure 5, these findings suggest that rats maintain a slightly higher IP pool throughout the neurogenic period. This implies that the increased neuronal production in rats is not restricted to a specific phase, but rather occurs consistently across all developmental stages. We believe these additional data significantly strengthen our conclusions.


      Reviewer 3 Point 5. The sc-seq is done in rat and compared to published mouse data from corresponding stages. They conclude species specific differences in progenitor gene expression. I am unsure how appropriate this is. Are similar sequencing platforms used? Can they find similar results if using multiple dataset? There are other datasets that may be used to validate these findings beyond DiBella et al.

      Response: We sincerely thank the reviewer for this insightful comment. We agree that establishing the validity of our analytical approach is crucial for the reader’s confidence in our findings. To address this, we have explicitly stated in the revised manuscript that both our rat scRNAseq data and the publicly available datasets were generated using consistent experimental platforms. This ensures that the integration process is technically sound.

      Revised text (Page 13, Lines 16-18): “After quality control, we integrated these profiles with previously published mouse cortical cell data from corresponding neurogenic stages, which is prepared using the consistent platform with ours (35) (Figure S11).”

      Furthermore, to ensure the robustness of our comparative analysis, we have incorporated an additional independent dataset (Ruan et al., PNAS 2021) in addition to the Di Bella et al. Nature 2021 data used in the original manuscript. We confirmed that the results obtained using this second dataset are highly consistent with our initial findings, further validating our conclusions across different studies (Figure S13A).

      Reviewer 3 Point 6. Wnt ligand analysis requires validation in situ across developmental stages, to support their conclusions. Ideally they might consider doing some manipulations to provide context to this observation.

      Response: We sincerely thank the reviewer for these insightful suggestions. We agree that validating the spatial expression patterns of Wnt ligands and confirming their expression in rat-specific RG, as suggested by our scRNAseq data, is crucial for strengthening our conclusions.

      Regarding the expression of Wnt3a, a key ligand in cortical development: although immunohistochemical analysis clearly identified Wnt3a expression in the cortical hem, the expression levels in RG within the cortical area were substantially lower than those in the hem, making definitive visualization challenging. To complement these findings and provide more robust evidence, we performed the following additional experiments:

      1. Validation of Wnt signaling levels: Using RNAscope-based in situ hybridization for Axin2, we successfully confirmed the elevated Wnt signaling levels in rat-specific RG (Figure 7C, D), consistent with our scRNAseq findings.
      2. Elucidating strikingly high correlated expressions of Lmx1a and Wnt ligand genes in the rat cortical progenitors in our scRNAseq dataset (Figure S16B).
      3. Functional analysis: To test the functional significance of this signaling, we inhibited Wnt signaling by electroporating dominant-negative TCF7L2 into rat RG at E15.5. This manipulation resulted in a subtype shift of the generated neurons toward an upper-layer identity (Figure 7I, J). These new results demonstrate that the rat-specific extension of high Wnt signaling levels serves as a fundamental mechanism for the prolonged production of deep-layer (DL) neurons. We are grateful to the reviewer for these suggestions; these additional data have significantly strengthened our core argument that the heterochronic regulation of Wnt signaling states drives the evolution of cortical neuronal composition.

      __Reviewer 3 Point 7 __Minor concerns-1

      Please separate images in Fig 1D it is very strange to have them all on top of each other.

      Response: We sincerely thank the reviewer for this suggestion. As requested, we have provided individual channel images alongside the merged multicolor panels. We agree that this modification significantly enhances the clarity of our data and makes the results much easier to interpret.

      __Reviewer 3 Point 8 __Minor concerns-2

      Are data in Fig 4E Edu+Tbr1+EdU+? This should be clarified and would be most accurate.

      Response: We appreciate the reviewer’s suggestion. We added the label of Y axes of the plots in Figure 4E-K. The procedure of cell count in these analyses are documented in the caption of Figure 4E-K, “Normalized counts of neurons colabeled for EdU and projection-specific markers, relative to the peak of EdU+ and marker+ cells.”.

      __Reviewer 3 Point 9 __Minor concerns-3

      Fig 4 graphs only have titles without Y axis. Please adjust location of title or repeat for clarity.

      Response: We thank the reviewer for this helpful suggestion. To clarify the definition of the Y-axis, we have now added a descriptive label to the axis in the revised figure.

      __Reviewer 3 Point 10 __Minor concerns-4

      Fig 4A implies cumulative incorporation which I don't think is being performed here. They should clarify this in the figure.

      Response: We appreciate the reviewer’s insightful comment. To avoid any potential misunderstanding regarding the additivity of the effect, we have revised the illustration in Figure 4A for greater clarity.

      __Reviewer 3 Point 11 __Minor concerns-5

      Fig 5 needs labels for the actual stages assayed, as illustrated in Fig 4A.

      Response: We thank the reviewer for this helpful suggestion. Following your comment, we have added the developmental stage information (expressed as embryonic days) for both mice and rats in the revised manuscript.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary:

      Yamauchi et al. performed a comparative anatomical analysis of the layer architecture in the primary somatosensory cortex across 8 mammalian species. Unlike primates, which show an expansion of upper layers (UL), rodents, especially rats, display a pronounced thickening of deep layers (DL). In this study they focus on comparing rats and mice, given the higher abundance of DL neuron subtypes in rats. Using histological analysis, they showed that rats possess significantly more DL neurons per cortical column than mice, while UL neuron counts remain similar. Clonal lineage tracing showed that rat radial glial (RG) progenitors generate more DL neurons, indicating species-specific differences in progenitor neurogenic activity. Birth dating assays confirmed an extended DL neurogenesis phase in rats, followed by a conserved UL generation phase. Single-cell RNA sequencing further revealed that rats maintain an early progenitor state longer than mice, marked by sustained expression of DL-associated genes. Specifically, rat RG progenitors exhibit prolonged and elevated expression of Wnt signaling genes, particularly Wnt ligands. Comparative analysis of published single-cell RNA-Seq across species highlighted that this extended Wnt-high period in rats is exceptional, suggesting a species-specific extension of a conserved neurogenic program.

      Major comments:

      This reviewer thinks the topic is exciting, and the experiments elegant, insightful and well described. The paper is well written and follows a very logical flow, the conclusion for each experiment is supported by the data and they are carefully stated. This reviewer really appreciated the summary illustration included as a panel in each figure, they think that this greatly enhanced the clarity and accessibility of the data presented, especially because species comparison can be difficult to follow.

      In this reviewer's opinion, there are some aspects of the findings that the authors would need to clarify/address to explain in clarify the phenotype observed and to enhance the overall significance of this very well-made paper: 1. The introduction lacks sufficient background and fails to convey the significance of the study. Specifically, why the research was undertaken, what knowledge gap it addresses, and how the findings could be applied. Addressing these questions already in the introduction would enhance the impact of the work and broaden its readership. 2. In figure 5 the authors conclude that "differences in cell cycle kinetics and indirect neurogenesis are unlikely to be the primary factors driving the species-specific variation in DL neuron production. Instead, the temporal regulation of progenitor neurogenic competence, which determines the duration of the DL production phase, provides a more plausible explanation for the greater number of DL subtypes observed in rats". It is not clear to this reviewer how the authors come to this conclusion. Authors observe a significant proportion of mitotic cells in rat VZ from day 1, and a higher constant proportion of mitotic progenitors in SVZ rats compared to mouse (Figure 5C). This points to an early difference in mitotic progenitors that may also lead to increased IP numbers, and potentially an increased number in DL cells, even before day 1. In addition, the higher abundance of IPs in the G2/S phase (statistically significant in 4 of the 7 time points) (Figure 5F), would suggest that this difference might play a role in the species-specific variation of DL neuron production. The authors should estimate cell cycle length instead of just measuring proportions to conclude something about cell cycle kinetics. They can then model growth curves to predict the effect caused if there were differences in cell cycle length between equivalent cell types across species. 3. In Figure 6 the authors focus only on the mouse and rat datasets. Given the availability of datasets from primates that the author used already for Figure 7, it would give the reader a broader prospective if also these datasets would be integrated in the analysis done for Figure 6, particularly it would be interesting to integrate them in the pseudotime alignment of cortical progenitor. How do human and/or macaque early and late neurogenic phase would compare to mouse and rat in this model? 4. In Figures 6C and 6D, the authors distinguish between cycling and non-cycling NECs and RGCs. Could the authors clarify the rationale behind making this distinction? Could the authors comment on how they interpret the impact of cycling versus non-cycling states on species-specific non-uniform scaling? Do they consider the observed non-linear correspondences to be driven by differences in cell cycle activity? 5. For the non-uniform scaling in Figure 6F, the authors identify critical inflection points and mention that "the largest delay in rat progenitors occurring where Day 1 and Day 3 progenitors overlapped". It would be good if the authors could discuss what they think all the inflection points represents. How much can it be explained by the heterogeneity within progenitors per time point? There is a clear higher spread of histograms at days 3 and 5, and the histogram at day 5 almost overlaps with day 1. I wonder if the same conclusion about non-uniform scaling would be detected if the distance matrix was built separately for specific cell types, for example only looking at NECs or RGCs. 6. The authors conclude that the elevated and prolonged expression of Wnt-ligand genes in rat RGs extend the DL neurogenic window and contribute to rat-specific expansion of deep cortical layer. In order to validate this finding it would be good for the authors to perform a perturbation experiment and reduce Wnt signalling/ Axin 2 levels in rats or depleted the Lmx1a and Lhx2 double-positive population. 7. The authors conclude that Wnt signaling is a rat specific effect since they did not observe any clear temporal change in wnt receptors in gyrencephalic species, and only a subset of RG in rats co-express Lmx1a and Lhx2. However, specific Wnt ligands and receptors (Wnt5a, Fzd and Lrp6) seem to be upregulated in human as well (Fig 7G), non RG cells could act as wnt ligand inducers in other species, and it has not been demonstrated that Lmx1a and Lhx2 are the source for Wnt ligand production. I wonder if the authors can completely rule out a role for Wnt in the protracted neurogenesis of other species.

      Minor comments:

      The RNAscope experiment is currently qualitative. Is it the mRNA copy number per cell equal in both species but more cells are positive in rat, or are there differences in number of mRNA molecules as well? It is not indicated if the RNAscope probes are the same for mouse and rat.

      Significance

      How different species achieve such remarkable differences in brain shape and size remains poorly understood. A critical aspect of this process is the duration of the neurogenic phase: the period during which neural progenitors generate neurons. This phase tends to be extended in species with larger brains and contains multiple neuronal stem cell types in varying proportions. It is thought that this accounts for their increased neuronal numbers. In their search for mechanisms that prolong neurogenesis across species, the authors propose a rat-specific role for Wnt ligands in expanding the neurogenic period in the rat brain. Importantly, they rule out that this mechanism operates in other species, such as primates or ferrets, to achieve similar extensions.

      The study is of high quality, incorporating rigorous lineage-tracing experiments in two species and single-cell RNA sequencing. Previous work established a role for Wnt signaling in regulating early neurogenesis in mice. Here, the authors characterize a novel population of radial glial cells (Lmx1a and Lhx2 double-positive) that may explain increased Wnt ligand secretion in rats. However, functional validation of this mechanism is still lacking. To strengthen its evolutionary relevance, it would be important to determine whether similar effects occur during earlier neural stages in other species (such as neuroepithelium thickening), or whether other cell types have co-opted the proposed Lmx1a-Lhx2 regulatory module in other species.

      From the perspective of a researcher with a stem cell and developmental background focused on neural evo-devo, this manuscript represents a solid and novel contribution. The proposed model of a rat-specific mechanism for extending the neurogenic phase contrasts with the prevailing concept of convergence in mechanisms underlying species-specific cortical development. This raises intriguing questions about how multiple molecular pathways have been co-opted to achieve similar developmental outcomes. Furthermore, we know very little about what determines the duration of specific developmental processes. This work suggests that extended Wnt signaling may account for prolonged neurogenesis in rats compared to mice. Future studies should aim to validate the proposed rat-specific co-option of an Lmx1a-Wnt ligand cascade in cortical radial glia, potentially through relief of Lhx2-mediated repression of Lmx1a.

    1. Calendar Planners and To-Do Lists Calendar planners and to-do lists are effective ways to organize your time. Many types of academic planners are commercially available (check your college bookstore), or you can make your own. Some people like a page for each day, and some like a week at a time. Some use computer calendars and planners. Almost any system will work well if you use it consistently. Some college students think they don’t need to actually write down their schedule and daily to-do lists. They’ve always kept it in their head before, so why write it down in a planner now? Some first-year students were talking about this one day in a study group, and one bragged that she had never had to write down her calendar because she never forgot dates. Another student reminded her how she’d forgotten a preregistration date and missed taking a course she really wanted because the class was full by the time she went online to register. “Well,” she said, “except for that time, I never forget anything!” Of course, none of us ever forgets anything—until we do. Calendars and planners help you look ahead and write in important dates and deadlines so you don’t forget. But it’s just as important to use the planner to schedule your own time, not just deadlines. For example, you’ll learn later that the most effective way to study for an exam is to study in several short periods over several days. You can easily do this by choosing time slots in your weekly planner over several days that you will commit to studying for this test. You don’t need to fill every time slot, or to schedule every single thing that you do, but the more carefully and consistently you use your planner, the more successfully will you manage your time. But a planner cannot contain every single thing that may occur in a day. We’d go crazy if we tried to schedule every telephone call, every e-mail, every bill to pay, every trip to the grocery store. For these items, we use a to-do list, which may be kept on a separate page in the planner. Check the example of a weekly planner form in Figure 2.5 “Weekly Planner”. (You can copy this page and use it to begin your schedule planning. By using this first, you will find out whether these time slots are big enough for you or whether you’d prefer a separate planner page for each day.) Fill in this planner form for next week. First write in all your class meeting times; your work or volunteer schedule; and your usual hours for sleep, family activities, and any other activities at fixed times. Don’t forget time needed for transportation, meals, and so on. Your first goal is to find all the blocks of “free time” that are left over. Remember that this is an academic planner. Don’t try to schedule in everything in your life—this is to plan ahead to use your study time most effectively. Next, check the syllabus for each of your courses and write important dates in the planner. If your planner has pages for the whole term, write in all exams and deadlines. Use red ink or a highlighter for these key dates. Write them in the hour slot for the class when the test occurs or when the paper is due, for example. (If you don’t yet have a planner large enough for the whole term, use Figure 2.5 “Weekly Planner” and write any deadlines for your second week in the margin to the right. You need to know what’s coming next week to help schedule how you’re studying this week.)

      Calendar planners and to-do lists help students organize their time and avoid forgetting important dates. Writing schedules down is more reliable than keeping everything in your head, because everyone forgets things sometimes. Planners are not only for deadlines but also for scheduling study time in advance so work is spread out and less stressful. To-do lists are useful for smaller daily tasks that don’t fit into a planner, helping you stay organized without feeling overwhelmed.

    2. ime Management Strategies for Success Following are some strategies you can begin using immediately to make the most of your time: Prepare to be successful. When planning ahead for studying, think yourself into the right mood. Focus on the positive. “When I get these chapters read tonight, I’ll be ahead in studying for the next test, and I’ll also have plenty of time tomorrow to do X.” Visualize yourself studying well! Use your best—and most appropriate—time of day. Different tasks require different mental skills. Some kinds of studying you may be able to start first thing in the morning as you wake, while others need your most alert moments at another time. Break up large projects into small pieces. Whether it’s writing a paper for class, studying for a final exam, or reading a long assignment or full book, students often feel daunted at the beginning of a large project. It’s easier to get going if you break it up into stages that you schedule at separate times—and then begin with the first section that requires only an hour or two. Do the most important studying first. When two or more things require your attention, do the more crucial one first. If something happens and you can’t complete everything, you’ll suffer less if the most crucial work is done. If you have trouble getting started, do an easier task first. Like large tasks, complex or difficult ones can be daunting. If you can’t get going, switch to an easier task you can accomplish quickly. That will give you momentum, and often you feel more confident tackling the difficult task after being successful in the first one. If you’re feeling overwhelmed and stressed because you have too much to do, revisit your time planner. Sometimes it’s hard to get started if you keep thinking about other things you need to get done. Review your schedule for the next few days and make sure everything important is scheduled, then relax and concentrate on the task at hand. If you’re really floundering, talk to someone. Maybe you just don’t understand what you should be doing. Talk with your instructor or another student in the class to get back on track. Take a break. We all need breaks to help us concentrate without becoming fatigued and burned out. As a general rule, a short break every hour or so is effective in helping recharge your study energy. Get up and move around to get your blood flowing, clear your thoughts, and work off stress. Use unscheduled times to work ahead. You’ve scheduled that hundred pages of reading for later today, but you have the textbook with you as you’re waiting for the bus. Start reading now, or flip through the chapter to get a sense of what you’ll be reading later. Either way, you’ll save time later. You may be amazed how much studying you can get done during downtimes throughout the day. Keep your momentum. Prevent distractions, such as multitasking, that will only slow you down. Check for messages, for example, only at scheduled break times. Reward yourself. It’s not easy to sit still for hours of studying. When you successfully complete the task, you should feel good and deserve a small reward. A healthy snack, a quick video game session, or social activity can help you feel even better about your successful use of time. Just say no. Always tell others nearby when you’re studying, to reduce the chances of being interrupted. Still, interruptions happen, and if you are in a situation where you are frequently interrupted by a family member, spouse, roommate, or friend, it helps to have your “no” prepared in advance: “No, I really have to be ready for this test” or “That’s a great idea, but let’s do it tomorrow—I just can’t today.” You shouldn’t feel bad about saying no—especially if you told that person in advance that you needed to study. Have a life. Never schedule your day or week so full of work and study that you have no time at all for yourself, your family and friends, and your larger life. Use a calendar planner and daily to-do list. We’ll look at these time management tools in the next section.

      The main idea of “Time Management Strategies for Success” is that managing your time well is about working smarter, not just harder. This section gives practical, realistic strategies students can use right away to stay productive, reduce stress, and avoid procrastination—while still having a life.

      In simple terms, it teaches you how to:

      Plan ahead with a positive mindset, so studying feels less stressful and more motivating.

      Use your energy wisely by doing tasks at the time of day when you focus best.

      Break big tasks into smaller, manageable pieces to avoid feeling overwhelmed.

      Set priorities, so the most important work gets done first.

      Build momentum by starting with easier tasks when motivation is low.

      Stay flexible by reviewing your schedule when things feel out of control.

      Ask for help when needed, instead of staying stuck and confused.

      Take regular breaks to avoid burnout and stay mentally fresh.

      Use small pockets of free time during the day to get work done early.

      Avoid distractions, especially multitasking, to keep your focus strong.

      Reward yourself after completing tasks to stay motivated.

      Learn to say no to interruptions without feeling guilty.

      Balance work and life, making time for rest, friends, and personal well-being.

      Use planners and to-do lists to stay organized and on track.

    1. Thinking helps in many situations, as we’ve discussed throughout this chapter. When we work out a problem or situation systematically, breaking the whole into its component parts for separate analysis, to come to a solution or a variety of possible solutions, we call that analytical thinking. Characteristics of analytical thinking include setting up the parts, using information literacy, and verifying the validity of any sources you reference. While the phrase analytical thinking may sound daunting, we actually do this sort of thinking in our everyday lives when we brainstorm, budget, detect patterns, plan, compare, work puzzles, and make decisions based on multiple sources of information. Think of all the thinking that goes into the logistics of a dinner-and-a-movie date—where to eat, what to watch, who to invite, what to wear, popcorn or candy—when choices and decisions are rapid-fire, but we do it relatively successfully all the time.

      I like that the reading shows analytical thinking isn’t just for school or work, it’s something we practice all the time in normal life.

    1. Author response:

      We thank all reviewers for their comments. We appreciate the acknowledgement that the paper is important and that results support the major conclusions. We are planning to address the specific concerns as noted by the reviewers in the following way:

      Public Reviews:

      Reviewer #2 (Public review):

      (1) The authors generate a new tool, a Gal4 knock-in of the jam2b locus, to track EGFP-expressing cells over time and follow the developmental trajectory of jam2b-expressing cells. Figure 1 characterizes the line. However, it lacks quantification, e.g., how many etv2-expressing cells also show EGFP expression or the contribution of EGFP-expressing cells to different types of blood vessels. This type of quantification would be useful, as it would also allow for comparison of their findings to their previous data examining the contribution of SVF cells to different types of blood vessels. All the authors state that at 30 hpf, EGFP-expressing cells can be seen in the vasculature (apparently the PCV).

      It is not clear why the authors do not use a nuclear marker for both ECs (as they did in their previous publication) and for jam2b-expressing cells. UAS:nEGFP and UAS:NLS-mcherry (e.g. pt424tg) transgenic lines are available. This would circumvent the problem the authors encounter with the strong fluorescence visible in the yolk extension. It would also facilitate quantifying the contribution of jam2b cells to different types of blood vessels.

      We agree with the importance of quantification. We had performed quantification of jam2b<sup>Gt(2A-Gal4)</sup>;UAS:GFP contribution to different vascular beds, which was shown in Suppl. Fig. S3. We will clarify this in the revision. We also agree that nuclear GFP or mCherry would help to visualize and quantify cells. Unfortunately, we do not have nuclear UAS:GFP or UAS:mCherry line in our possession, and it will take too long to import it for the standard revision timeline. We are working on the construct, and will attempt to establish the line; therefore we are hoping to clarify these results with the nuclear line in the revised manuscript.

      (2) The time-lapse movie in Figure 2 is not very informative, as it just provides a single example of a dividing cell contributing to the PCV. Also, quantifications are needed. As SVF cells appear to expand significantly after their initial specification, it would be informative to know how many cell divisions and which types of blood vessels jam2b-expressing cells contribute to. Can the authors observe cells that give rise to different types of blood vessels? Jam2b expression in LPM cells apparently precedes expression of etv2. Is etv2 needed for maintenance, or do Jam2b-expressing cells contribute to different types of tissues in etv2 mutant embryos? Comparing time-lapse analysis in wildtype and etv2 mutant embryos would address this question.

      The time-lapse was meant to serve as an illustration and confirmation of jam2b cell contribution to vasculature. As noted above, Suppl. Fig. S3 provides quantification of jam2b cell contribution to different vascular beds. We had previously performed detailed time-lapse analysis and quantification of SVF cell migration to PCV, SIA and SIV using etv2-2A-Venus line (Metikala et al 2022, Dev Cell), which has some of the same (or similar) information. It is very challenging to obtain this data using jam2b reporter line due to extensive and bright GFP expression in the mesothelial layer over the yolk and yolk extension; for that reason we can only trace some GFP cells but not all of them. Regarding etv2 requirement for jam2b maintenance, we intend to address this question by analyzing jam2b cell contribution in etv2 MO injected embryos, which recapitulates the phenotype in jam2b mutants.

      (3) In Figure 3, the authors generate UAS:Cre and UAS:Cre-ERT2 transgenic lines to lineage trace the jam2b-expressing cells. It is again not clear why the authors do not use a responder line containing nuclear-localized fluorescent proteins to circumvent the strong expression of fluorescent proteins in the yolk extension. It is also unclear why the two transgenic lines give very different results regarding the number of cells being labelled. The ERT2 fusions label around 3 cells in the SIA, while the Cre line labels only about 1.5 cells per embryo, with very little contribution of labelled cells to other blood vessels. One would expect the Cre line requiring tamoxifen induction to label fewer cells when compared to the constitutive Cre line. What is the reason for this discrepancy? Are the lines single integration? Is there silencing? This needs to be better characterized, also regarding the reproducibility of the experiments. If the Cre lines were to be multiple copy integrations, outcrossing the line might lead to lower expression levels in future generations. 

      It is also not clear how the authors conclude from these findings that "SVF cells show major contribution to the SIA and SIV" when only 1.5 or 3 cells of the SIA are labelled, with even fewer cells labelled in other blood vessels. They speculate that this might be due to low recombination efficiency, a question they then set out to answer using photoconversion of etv2:KAEDE expressing cells, an experiment that they also performed in their 2014 and 2022 publications. To check for low recombination efficiency, the authors could examine the expression of Cre mRNA in their transgenic embryos. Do many more jam2b expressing cells express Cre mRNA than they observe in their switch lines? They could also compare their experiments using Cre recombinase with those using EGFP expression in jam2b cells. EGFP is relatively stable, and the time frames the authors analyze are short. As no quantification of EGFP-expressing cells is provided in Figure 1, this comparison is currently not possible. Do these two different approaches answer different questions here? 

      The reviewer brings up important points, we appreciate that. Unfortunately, we do not have a nuclear switch line in our possession, and it is not possible to obtain it in the normal manuscript revision time line. Regarding UAS:Cre and UAS:CreERT2 lines, they both show rather similar labeling, with most labeled cells present in the SIA. The difference in cell number (1.5 versus 3) is likely due to different levels of Cre expression, which may vary dependent on the integration site. The lines most likely are multi-copy integrations, which can be helpful, as this would result in higher Cre expression. We will address the silencing question by performing in situ hybridization or HCR analysis for Cre or CreERT2 and comparing it with endogenous jam2b expression, as the reviewer suggested. We have noticed that the switch line used, actb2:loxP-BFP-loxP-dsRed, exhibits lower recombination frequency compared to other switch lines (we used it because it was compatible with endothelial fli1:GFP line). We will attempt to answer this question by crossing to other switch lines, which may exhibit higher recombination frequency. In principle, UAS:GFP and switch lines should produce a similar result, except that GFP decays over time and therefore our initial expectation was that switch lines may produce a more accurate result. However, this may not be the case due to low recombination efficiency, which we will attempt to address in the revision.

      (4) Concerning the etv2:KAEDE photoconversion experiments: The percentages the authors report for SVF cells' contribution to the SIV and SIA differ from their previous study (Dev Cell, 2022). In that publication, SVF cells contributed 28% to the SIA and 48% to the SIV. In the present study, the numbers are close to 80% for both vessels. The difference is that the previous study analyzed 2dpf old embryos and the new one 4dpf old embryos. Do SVF-derived cells proliferate more than PCV-derived cells, or is there another explanation for this change in percentage contribution? 

      These numbers refer to different experiments; we apologize for the confusion. As reported earlier in Metikala et 2022, 28% of SVF cells contributed to the SIA and 48% to the SIV by 3 dpf (not 2 dpf; only PCV analysis was done at 2 dpf); SIA and SIV analysis was done based on time-lapse image analysis of etv2-2A-Venus line at 3 dpf, shown in Fig. 3C in Metikala et al. However, this only refers to SVF cell contribution. It does not mean that 28% or 48% cells in SIA or SIV are derived from SVF. The total fraction of SIA and SIV cells that are derived from SVF has not been quantified in the previous study, because that would require accurate tracking of all SVF cells, which is experimentally challenging. Etv2:Kaede experiment is slighly different, because it reports newly formed cells after 24 hpf. It cannot tell if new cells are all derived from SVF cells, although we are not aware of any other source of new endothelial cells at these stages. In the previous study by Metikala et al 2022, we reported ~22 newly formed SIA and ~50 newly formed cells in SIV by 3 dpf (Fig. 1 in Metikala et al 2022), although the entire number of cells was not quantified, therefore the percentage was not known. In the current study, we attempted to estimate the entire percentage of green only Kaede cells, which was close to 80% in both SIA or SIV at 4 dpf. Please note that this estimate was performed in the posterior portion of SIA and SIV that overlies the yolk extension and where SVF cells are observed. We did not quantify cells in the anterior SIV portion, which forms the basket over the yolk.

      (5) Single-cell sequencing data: Why do the authors not show jam2b expression in their single-cell sequencing data? They sorted for (presumably) jam2b-expressing cells and hypothesize that jam2b expression in ECs at this time point is important for the generation of intestinal vasculature. Do ECs in cluster 15 express jam2b? Why are no other top marker genes (tal1, etv2, egfl7, npas4l) included in the dot blot in Figure 5b?

      We appreciate the suggestion and will include additional marker genes as well as jam2b in the revised version of the manuscript.

      (6) Concerns about cell autonomy of mutant phenotypes: The authors need to perform in situ hybridization to characterize jam2a expression. Can it be seen in SVF cells? The double mutants show a clear phenotype in intestinal vessel development; however, it is unclear whether this is due to a cell-autonomous function of jam2a/b within SVF cells. The authors need to address this issue, as jam2b and potentially also jam2a are expressed within the tissue surrounding the forming SVF. For instance, do transplanted mutant cells contribute to the intestinal vasculature to the same extent as wild-type cells do?

      jam2a expression has been characterized in the previous studies and it is shown in the Suppl. Fig. S4E. It is primarily enriched in the skeletal muscle. However, our single-cell RNA-seq analysis shows that SVF cells also express jam2a. We will include additional data on jam2a expression in the revised manuscript. We agree that transplation to address cell autonomy is an important experiment, yet there are some practical challenges to it. Jam2a,jam2b mutant phenotype is only partially penetrant, and about 50% reduction in SVF cell number, as well as partial SIA and SIV phenotypes are observed. Only a small number of transplanted cells may contribute to intestinal vasculature, therefore it may be challenging to see the differences, given the partial penetrance. In an attempt to address cell -autonomy question, we will try a different approach. We will overexpress jam2b labeled with 2A-mCherry, and test if it can rescue the mutant phenotype in cell autonomous manner. Overexpression will be done in a mosaic manner, with higher number of cells labeled than in a typical transplantation experiment.

      (7) Finally, the authors analyze the phenotypes of hand2 mutants and their impact on the expression of jam2b and etv2. They observe a reduction in jam2b and etv2 expression in SVF cells. However, they do not show the vascular phenotypes of hand2 mutants. Is the formation of the SIA and SIV disturbed? Is hand2 cell autonomously needed in ECs? The authors suggest that hand2 controls SVF development through the regulation of jam2b. However, they also show that jam2b mutants do not have a phenotype on their own. Clearly, hand2, if it were to be required in ECs, regulates other genes important for SVF development. These might then regulate jam2b expression. The clear linear relationship, as the title suggests, is not convincingly shown by the data.

      As suggested, we will add the analysis of SIA and SIV in hand2 mutants during the revision process. We could not assess that easily because the line was not maintained in vascular fli1:GFP background. We do not know if hand2 is required cell-autonomously. This is an important question, but it may be answered better in a separate study. Regarding hand2-jam2b axis, it is very clear that jam2b expression in the posterior lateral plate mesoderm is completely lost in hand2 mutants, except for its more anterior domain over the yolk. This does support the idea that hand2 functions upstream of jam2b. However, the relationship may not be necessarily direct. We agree that hand2 may regulate additional genes involved in SVF cell development. We will attempt to clarify this relationship and test if jam2b overexpression may rescue hand2 mutant phenotype.

      Reviewer #3 (Public review):

      (1) Overall molecular mechanisms of Jam2 function are not fully uncovered in the study. How do the adhesion molecules Jam2a and Jam2b regulate SVF cell formation? Are they responsible for migration, adhesion or fate determination of these structures? The authors should provide a more in-depth study of the jam2a, jam2b mutations and assess the processes affected in these mutants. Combining these mutants with etv2:Kaede can also provide a stronger causative link between their functions and defects in SVF formation.

      Our data argue that the initial SVF cell specification (based on etv2 expression) is reduced in jam2a;jam2b mutants. We do not know if the migration or fate determination of the remaining SVF cells is also affected, although this may be more challenging to answer, as there are only few SVF cells remaining. We agree that further mechanistic studies of jam2a,jam2b function are needed. However, we think that this would be better addressed in a separate study. We are currently raising mutants crossed into fli1:Kaede line, which should confirm that there are fewer new cells that emerge after Kaede photoconversion in jam2a,jam2b mutants.

      (2) Have the authors tested the specificity of the jam2b knock-in reporter line? This is an important experiment, as many of the conclusions derive from lineage tracing and fluorescence reporting from this knock-in line. One suggestion is to cross the jam2b:GFP or jam2b:Gal4, UAS:GFP line to the generated jam2b mutants, and examine the expression pattern of these lines. Considering that the ISH experiment showed lack of jam2b expression, the reporter line should not be expressed in the jam2b mutants.

      We show in Suppl. Fig. 2 that jam2b<sup>Gt(2A-Gal4)</sup>;UAS:GFP knock-in line has similar expression pattern as jam2b mRNA by in situ hybridization, which argues for its specificity. In the revision, we plan to use HCR analysis to confirm than jam2b mRNA is expressed in the same cells as jam2b<sup>Gt(2A-Gal4)</sup>;UAS:GFP, as an additional evidence for its specificity. Unfortunately, it is not feasible to cross jam2b knock-in line into jam2b mutants, as suggested by the reviewer. Because jam2b knock-in line targets the endogenous jam2b genomic locus, which is very close in the genome to jam2b promoter deletion in jam2b mutants, the recombination frequency would be very low, and we would not get double jam2b knock-in and knock-out events in the same chromosome.

      (3) The rationale behind the regeneration study is not clear, and the mechanisms underlying the phenotype are not well described. How do the authors explain the phenotype with the impaired regeneration, and what is the significance of this finding as it relates to SVF formation and function? 

      We apologize for this omission. This experiment was more thouroughly described in our previous study by Metikala et al 2022. In that study we showed that when endothelial cells are ablated by treating with MTZ from 6 to 45 hpf, this results in ablation of all vascular endothelial cells except for SVF cells, because they originate later than other cells. We subsequently showed that these SVF cells can partially form PCV and intestinal vasculature, helping them regenerate, which was confirmed by time-lapse imaging. In the current study, we tested if jam2a; jam2b double mutants show defects in such vascular regeneration. Indeed, regeneration after cell ablation was reduced, which correlated with reduction in SVF cell number. This argues that jam2a/b function is required for SVF cell emergence and vascular recovery after endothelial cell ablation. We will provide better description of this experiment and discuss interpretations in the revised manuscript.

      (4) The authors need to include representative images of jam2b>CreERT2 with 4-OH activation at different timepoints in Figure 3.

      Yes, thanks for noting this; these images will be included in the revised manuscript.

      (5) The etv2:Kaede photoconversion experiment to show that the majority of intestinal vasculature derives after 24 hours needs to be supplemented with additional data on photoconverted post-24-hour-old endothelial cells, with the expectation that the majority of intestinal endothelial cells at 4 days will then be labeled with red Kaede. In addition, there have been data that show the red Kaede protein is not stable past several days in vivo, and 3 days might be sufficient for the removal or degradation of this photoconverted protein. Thus, the statement that intestinal vasculature forms largely by new vasculogenesis might be too strong based on existing data.

      It is apparent from Fig. 4B that many other vessels, such as the dorsal aorta and many intersegmental vessels show robust red Kaede expression at 4 dpf, arguing that there is sufficient photoconverted Kaede present at this stage, and its degradation is unlikely to be the reason. However, we are planning to include additional control experiments, as suggested by the reviewer, to make this argument stronger.

      (6) To strengthen the claim that hand2 acts upstream of jam2b, the authors can perform combinatorial genetic epistatic analysis and examine whether jam2b mutations worsen hand2 homozygous or heterozygous effects on the SVF. Similarly, overexpressing jam2b might rescue the loss of SVF/etv2 expression in hand2 mutants. 

      We appreciate this suggestion. Double epistatic analysis, while informative, can be tricky. In this case, we are dealing with jam2a; jam2b redundancy and also the maternal effect. It may take a while considerable effort to generate different combinations of tripple mutant lines (jam2a,jam2b,hand2), and it is unclear whether double or tripple heterozygous embryos will show any defects to clarify their epistatic relationship. Instead, as suggested, we are planning to overexpress jam2b in wild-type and hand2 mutants to address this point.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in weighted value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts that move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and modelcomparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and wellstructured.

      We thank the reviewer for recognizing the strengths of our work.

      Weaknesses:

      (Q1) I also have some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      We thank the reviewer for this suggestion. Following the comment, we added a hierarchical Bayesian estimation. We built a hierarchical model with both group-level (adolescent group and adult group) and individual-level structures for the best-fitting model. Four Markov chains with 4,000 samples each were run, and the model converged well (see Figure supplement 7)

      We then analyzed the posterior parameters for adolescents and adults separately. The results were consistent with those from the MLE analysis (see Figure 2—figure supplement 5). These additional results have been included in the Appendix Analysis section (also see Figure supplement 5 and 7). In addition, we have updated the code and provided the link for reference. We appreciate the reviewer’s suggestion, which improved our analysis.

      (Q2) There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma.

      However, our computational modeling explicitly addressed this possibility. Model 4 (inequality aversion) captures decisions that are driven purely by self-interest or aversion to unequal outcomes, including a parameter reflecting disutility from advantageous inequality, which represents self-oriented motives. If participants’ behavior were solely guided by the payoff-dominant strategy, this model should have provided the best fit. However, our model comparison showed that Model 5 (social reward) performed better in both adolescents and adults, suggesting that cooperative behavior is better explained by valuing social outcomes beyond payoff structures.

      Besides, if adolescents’ lower cooperation is that they strategically respond to the payoff structure by adopting defection as the more rewarding option. Then, adolescents should show reduced cooperation across all rounds. Instead, adolescents and adults behaved similarly when partners defected, but adolescents cooperated less when partners cooperated and showed little increase in cooperation even after consecutive cooperative responses. This pattern suggests that adolescents’ lower cooperation cannot be explained solely by strategic responses to payoff structures but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded our Discussion to acknowledge this important point and to clarify how the behavioral and modeling results address the reviewer’s concern.

      “Overall, these findings indicate that adolescents’ lower cooperation is unlikely to be driven solely by strategic considerations, but may instead reflect differences in the valuation of others’ cooperation or reduced motivation to reciprocate. Although defection is the payoffdominant strategy in the Prisoner’s Dilemma, the selective pattern of adolescents’ cooperation and the model comparison results indicate that their reduced cooperation cannot be fully explained by strategic incentives, but rather reflects weaker valuation of social reciprocity.”

      Appraisal & Discussion:

      (Q3) The authors have partially achieved their aims, but I believe the manuscript would benefit from additional methodological clarification, specifically regarding the use of hierarchical model fitting and the inclusion of Bayes Factors, to more robustly support their conclusions. It would also be important to investigate the source of the model confusion observed in two of their models.

      We thank the reviewer for this comment. In the revised manuscript, we have clarified the hierarchical Bayesian modeling procedure for the best-fitting model, including the group- and individual-level structure and convergence diagnostics. The hierarchical approach produced results that fully replicated those obtained from the original maximumlikelihood estimation, confirming the robustness of our findings. Please also see the response to Q1.

      Regarding the model confusion between the inequality aversion (Model 4) and social reward (Model 5) models in the model recovery analysis, both models’ simulated behaviors were best captured by the baseline model. This pattern arises because neither model includes learning or updating processes. Given that our task involves dynamic, multi-round interactions, models lacking a learning mechanism cannot adequately capture participants’ trial-by-trial adjustments, resulting in similar behavioral patterns that are better explained by the baseline model during model recovery. We have added a clarification of this point to the Results:

      “The overlap between Models 4 and 5 likely arises because neither model incorporates a learning mechanism, making them less able to account for trial-by-trial adjustments in this dynamic task.”

      (Q4) I am unconvinced by the claim that failures in mentalising have been empirically ruled out, even though I am theoretically inclined to believe that adolescents can mentalise using the same procedures as adults. While reinforcement learning models are useful for identifying biases in learning weights, they do not directly capture formal representations of others' mental states. Greater clarity on this point is needed in the discussion, or a toning down of this language.

      We sincerely thank the reviewer for this professional comment. We agree that our prior wording regarding adolescents’ capacity to mentalise was somewhat overgeneralized. Accordingly, we have toned down the language in both the Abstract and the Discussion to better align our statements with what the present study directly tests. Specifically, our revisions focus on adolescents’ and adults’ ability to predict others’ cooperation in social learning. This is consistent with the evidence from our analyses examining adolescents’ and adults’ model-based expectations and self-reported scores on partner cooperativeness (see Figure 4). In the revised Discussion, we state:

      “Our results suggest that the lower levels of cooperation observed in adolescents stem from a stronger motive to prioritize self-interest rather than a deficiency in predicting others’ cooperation in social learning”.

      (Q5) Additionally, a more detailed discussion of the incentives embedded in the Prisoner's Dilemma task would be valuable. In particular, the authors' interpretation of reduced adolescent cooperativeness might be reconsidered in light of the zero-sum nature of the game, which differs from broader conceptualisations of cooperation in contexts where defection is not structurally incentivised.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. However, our behavioral and computational evidence suggests that this pattern cannot be explained solely by strategic responses to payoff structures, but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded the Discussion to acknowledge this point and to clarify how both behavioral and modeling results address the reviewer’s concern (see also our response to Q2).

      (Q6) Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      We thank the reviewer for the professional comments, which have helped us improve our work.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      We thank the reviewer for highlighting the strengths of our work.

      Weaknesses:

      (Q1) A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-bytrial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      We thank the reviewer for this thoughtful comment. We agree that social learning from human partners may involve higher-order inferences beyond simple reinforcement learning from non-human sources. To address this, we had previously included such mechanisms in our behavioral modeling. In Model 7 (Social Reward Model with Influence), we tested a higher-order belief-updating process in which participants’ expectations about their partner’s cooperation were shaped not only by the partner’s previous choices but also by the inferred influence of their own past actions on the partner’s subsequent behavior. In other words, participants could adjust their belief about the partner’s cooperation by considering how their partner’s belief about them might change. Model comparison showed that Model 7 did not outperform the best-fitting model, suggesting that incorporating higher-order influence updates added limited explanatory value in this context. As suggested by the reviewer, we have further clarified this point in the revised manuscript.

      Regarding trait-based frameworks, we appreciate the reviewer’s reference to Hackel et al. (2015). That study elegantly demonstrated that learners form relatively stable beliefs about others’ social dispositions, such as generosity, especially when the task structure provides explicit cues for trait inference (e.g., resource allocations and giving proportions). By contrast, our study was not designed to isolate trait learning, but rather to capture how participants update their expectations about a partner’s cooperation over repeated interactions. In this sense, cooperativeness in our framework can be viewed as a trait-like latent belief that evolves as evidence accumulates. Thus, while our model does not include a dedicated trait module that directly modulates learning rates, the belief-updating component of our best-fitting model effectively tracks a dynamic, partner-specific cooperativeness, potentially reflecting a prosocial tendency.

      (Q2) This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      We thank the reviewer for the suggestion. Following the comment, we implemented an additional model incorporating a dynamic learning rate based on the magnitude of prediction errors. Specifically, we developed Model 9:  Social reward model with Pearce–Hall learning algorithm (dynamic learning rate), in which participants’ beliefs about their partner’s cooperation probability are updated using a Rescorla–Wagner rule with a learning rate dynamically modulated by the Pearce–Hall (PH) Error Learning mechanism. In this framework, the learning rate increases following surprising outcomes (larger prediction errors) and decreases as expectations become more stable (see Appendix Analysis section for details).

      The results showed that this dynamic learning rate model did not outperform our bestfitting model in either adolescents or adults (see Figure supplement 6). We greatly appreciate the reviewer’s suggestion, which has strengthened the scope of our analysis. We now have added these analyses to the Appendix Analysis section (also Figure Supplement 6) and expanded the Discussion to acknowledge this modeling extension and further discuss its implications.

      (Q3) Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      We thank the reviewer for this professional comment. In addition to the linear analyses, we further conducted exploratory analyses to examine potential non-linear relationships between age and the model parameters. Specifically, we fit LMMs for each of the four parameters as outcomes (α+, α-, β, and ω). The fixed effects included age, a quadratic age term, and gender, and the random effects included subject-specific random intercepts and random slopes for age and gender. Model comparison using BIC did not indicate improvement for the quadratic models over the linear models for α<sup>+</sup> (ΔBIC<sub>quadratic-linear</sub> = 5.09), α<sup>-</sup>(ΔBIC<sub>quadratic-linear</sub> = 3.04), β (ΔBIC<sub>quadratic-linear</sub> = 3.9), or ω (ΔBIC<sub>quadratic-linear</sub>= 0). Moreover, the quadratic age term was not significant for α<sup>+</sup>, α<sup>−</sup>, or β (all ps > 0.10). For ω, we observed a significant linear age effect (b = 1.41, t = 2.65, p = 0.009) and a significant quadratic age effect (b = −0.03, t = −2.39, p = 0.018; see Author response image 1). This pattern is broadly consistent with the group effect reported in the main text. The shaded area in the figure represents the 95% confidence interval. As shown, the interval widens at older ages (≥ 26 years) due to fewer participants in that range, which limits the robustness of the inferred quadratic effect. In consideration of the limited precision at older ages and the lack of BIC improvement, we did not emphasize the quadratic effect in the revised manuscript and present these results here as exploratory.

      Author response image 1.

      Linear and quadratic model fits showing the relationship between age and the ω parameter, with 95% confidence intervals.

      (Q4) Finally, the two age groups compared - adolescents (high school students) and adults (university students) - differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      We appreciate this comment. Indeed, adolescents (high school students) and adults (university students) differ not only in age but also in sociocultural and socioeconomic backgrounds. In our study, all participants were recruited from Beijing and surrounding regions, which helps minimize large regional and cultural variability. Moreover, we accounted for individual-level random effects and included participants’ social value orientation (SVO) as an individual difference measure.

      Nonetheless, we acknowledge that other contextual factors, such as differences in financial independence, socioeconomic status, and social experience—may also contribute to group differences in cooperative behavior and reward valuation. Although our results are broadly consistent with developmental theories of reward sensitivity and social decisionmaking, sociocultural influences cannot be entirely ruled out. Future work with more demographically matched samples or with socioeconomic and regional variables explicitly controlled will help clarify the relative contributions of biological and contextual factors. Accordingly, we have revised the Discussion to include the following statement:

      “Third, although both age groups were recruited from Beijing and nearby regions, minimizing major regional and cultural variation, adolescents and adults may still differ in socioeconomic status, financial independence, and social experience. Such contextual differences could interact with developmental processes in shaping cooperative behavior and reward valuation. Future research with demographically matched samples or explicit measures of socioeconomic background will help disentangle biological from sociocultural influences.”

      Reviewer #3 (Public review):

      Summary:

      Wu and colleagues find that in a repeated Prisoner's Dilemma, adolescents, compared to adults, are less likely to increase their cooperation behavior in response to repeated cooperation from a simulated partner. In contrast, after repeated defection by the partner, both age groups show comparable behavior.

      To uncover the mechanisms underlying these patterns, the authors compare eight different models. They report that a social reward learning model, which includes separate learning rates for positive and negative prediction errors, best fits the behavior of both groups. Key parameters in this winning model vary with age: notably, the intrinsic value of cooperating is lower in adolescents. Adults and adolescents also differ in learning rates for positive and negative prediction errors, as well as in the inverse temperature parameter.

      Strengths:

      The modeling results are compelling in their ability to distinguish between learned expectations and the intrinsic value of cooperation. The authors skillfully compare relevant models to demonstrate which mechanisms drive cooperation behavior in the two age groups.

      We thank the reviewer’s recognition of our work’s strengths.

      Weaknesses:

      (Q1) Some of the claims made are not fully supported by the data:

      The central parameter reflecting preference for cooperation is positive in both groups. Thus, framing the results as self-interest versus other-interest may be misleading.

      We thank the reviewer for this insightful comment. In the social reward model, the cooperation preference parameter is positive by definition, as defection in the repeated rPDG always yields a +2 monetary advantage regardless of the partner’s action. This positive value represents the additional subjective reward assigned to mutual cooperation (e.g., reciprocity value) that counterbalances the monetary gain from defection. Although the estimated social reward parameter ω was positive, the effective advantage of cooperation is Δ=p×ω−2. Given participants’ inferred beliefs p, Δ was negative for most trials (p×ω<2), indicating that the social reward was insufficient to offset the +2 advantage of defection. Thus, both adolescents and adults valued cooperation positively, but adolescents’ smaller ω and weaker responsiveness to sustained partner cooperation suggest a stronger weighting on immediate monetary payoffs.

      In this light, our framing of adolescents as more self-interested derives from their behavioral pattern: even when they recognized sustained partner cooperation and held high expectations of partner cooperation, adolescents showed lower cooperative behavior and reciprocity rewards compared with adults. Whereas adults increased cooperation after two or three consecutive partner cooperations, this pattern was absent among adolescents. We therefore interpret their behavior as relatively more self-interested, reflecting reduced sensitivity to the social reward from mutual cooperation rather than a categorical shift from self-interest to other-interest, as elaborated in the Discussion.

      (Q2) It is unclear why the authors assume adolescents and adults have the same expectations about the partner's cooperation, yet simultaneously demonstrate age-related differences in learning about the partner. To support their claim mechanistically, simulations showing that differences in cooperation preference (i.e., the w parameter), rather than differences in learning, drive behavioral differences would be helpful.

      We thank the reviewer for raising this important point. In our model, both adolescents and adults updated their beliefs about partner cooperation using an asymmetric reinforcement learning (RL) rule. Although adolescents exhibited a higher positive and a lower negative learning rate than adults, the two groups did not differ significantly in their overall updating of partner cooperation probability (Fig. 4a-b). We then examined the social reward parameter ω, which was significantly smaller in adolescents and determined the intrinsic value of mutual cooperation (i.e., p×ω). This variable differed significantly between groups and closely matched the behavioral pattern.

      Following the reviewer’s suggestion, we conducted additional simulations varying one model parameter at a time while holding the others constant. The difference in mean cooperation probability between adults and adolescents served as the index (positive = higher cooperation in adults). As shown in the Author response image 2, decreases in ω most effectively reproduced the observed group difference (shaded area), indicating that age-related differences in cooperation are primarily driven by variation in the social reward parameter ω rather than by others.

      Author response image 2.

      Simulation results showing how variations in each model parameter affect the group difference in mean cooperation probability (Adults – Adolescents). Based on the bestfitting Model 8 and parameters estimated from all participants, each line represents one parameter (i.e., α+, α-, ω, β) systematically varied within the tested range (α±:0.1–0.9; ω, β:1–9) while other parameters were held constant. Positive values indicate higher cooperation in adults. Smaller ω values most strongly reproduced the observed group difference, suggesting that reduced social reward weighting primarily drives adolescents’ lower cooperation.

      (Q3) Two different schedules of 120 trials were used: one with stable partner behavior and one with behavior changing after 20 trials. While results for order effects are reported, the results for the stable vs. changing phases within each schedule are not. Since learning is influenced by reward structure, it is important to test whether key findings hold across both phases.

      We thank the reviewer for this thoughtful and professional comment. In our GLMM and LMM analyses, we focused on trial order rather than explicitly including the stable vs. changing phase factor, due to concerns about multicollinearity. In our design, phases occur in specific temporal segments, which introduces strong collinearity with trial order. In multi-round interactions, order effects also capture variance related to phase transitions.

      Nonetheless, to directly address this concern, we conducted additional robustness analyses by adding a phase variable (stable vs. changing) to GLMM1, LMM1, and LMM3 alongside the original covariates. Across these specifications, the key findings were replicated (see GLMM<sub>sup</sub>2 and LMM<sub>sup</sub>4–5; Tables 9-11), and the direction and significance of main effects remained unchanged, indicating that our conclusions are robust to phase differences.

      (Q4) The division of participants at the legal threshold of 18 years should be more explicitly justified. The age distribution appears continuous rather than clearly split. Providing rationale and including continuous analyses would clarify how groupings were determined.

      We thank the reviewer for this thoughtful comment. We divided participants at the legal threshold of 18 years for both conceptual and practical reasons grounded in prior literature and policy. In many countries and regions, 18 marks the age of legal majority and is widely used as the boundary between adolescence and adulthood in behavioral and clinical research. Empirically, prior studies indicate that psychosocial maturity and executive functions approach adult levels around this age, with key cognitive capacities stabilizing in late adolescence (Icenogle et al., 2019; Tervo-Clemmens et al., 2023). We have clarified this rationale in the Introduction section of the revised manuscript.

      “Based on legal criteria for majority and prior empirical work, we adopt 18 years as the boundary between adolescence and adulthood (Icenogle et al., 2019; Tervo-Clemmens et al., 2023).”

      We fully agree that the underlying age distribution is continuous rather than sharply divided. To address this, we conducted additional analyses treating age as a continuous predictor (see GLMM<sub>sup</sub>1 and LMM<sub>sup</sub>1–3; Tables S1-S4), which generally replicated the patterns observed with the categorical grouping. Nevertheless, given the limited age range of our sample, the generalizability of these findings to fine-grained developmental differences remains constrained. Therefore, our primary analyses continue to focus on the contrast between adolescents and adults, rather than attempting to model a full developmental trajectory.

      (Q5) Claims of null effects (e.g., in the abstract: "adults increased their intrinsic reward for reciprocating... a pattern absent in adolescents") should be supported with appropriate statistics, such as Bayesian regression.

      We thank the reviewer for highlighting the importance of rigor when interpreting potential null effects. To address this concern, we conducted Bayes factor analyses of the intrinsic reward for reciprocity and reported the corresponding BF10 for all relevant post hoc comparisons. This approach quantifies the relative evidence for the alternative versus the null hypothesis, thereby providing a more direct assessment of null effects. The analysis procedure is now described in the Methods and Materials section:

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      (Q6) Once claims are more closely aligned with the data, the study will offer a valuable contribution to the field, given its use of relevant models and a well-established paradigm.

      We are grateful for the reviewer’s generous appraisal and insightful comments.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) I commend the authors on a well-structured, clear, and interesting piece of work. I have several questions and recommendations that, if addressed, I believe will strengthen the manuscript.

      We thank the reviewer for commending the organization of our paper.

      (2) Introduction: - Why use a zero-sum (Prisoner's Dilemma; PD) versus a mixed-motive game (e.g. Trust Task) to study cooperation? In a finite set of rounds, the dominant strategy can be to defect in a PD.

      We thank the reviewer for this helpful comment. We agree that both the rationale for using the repeated Prisoner’s Dilemma (rPDG) and the limitations of this framework should be clarified. We chose the rPDG to isolate the core motivational conflict between selfinterest and joint welfare, as its symmetric and simultaneous structure avoids the sequential trust and reputation dependencies/accumulation inherent to asymmetric tasks such as the Trust Game (King-Casas et al., 2005; Rilling et al., 2002).

      Although a finitely repeated rPDG theoretically favors defection, extensive prior research shows that cooperation can still emerge in long repeated interactions when players rely on learning and reciprocity rather than backward induction (Rilling et al., 2002; Fareri et al., 2015). Our design employed 120 consecutive rounds, allowing participants to update expectations about partner behavior and to establish stable reciprocity patterns over time. We have added the following clarification to the Introduction:

      “The rPDG provides a symmetric and simultaneous framework that isolates the motivational conflict between self-interest and joint welfare, avoiding the sequential trust and reputation dynamics characteristic of asymmetric tasks such as the Trust Game (Rilling et al., 2002; King-Casas et al., 2005)”

      (3) Methods:

      Did the participants know how long the PD would go on for?

      Were the participants informed that the partner was real/simulated?

      Were the participants informed that the partner was going to be the same for all rounds?

      We thank the reviewer for the meticulous review work, which helped us present the experimental design and reporting details more clearly. the following clarifications: I. Participants were not informed of the total number of rounds in the rPDG. This prevented endgame expectations and avoided distraction from counting rounds, which could introduce additional effects. II. Participants were told that their partner was another human participant in the laboratory. However, the partner’s behavior was predetermined by a computer program. This design enabled tighter experimental control and ensured consistent conditions across age groups, supporting valid comparisons. III. Participants were informed that they would interact with the same partner across all rounds, aligning with the essence of a multiround interaction paradigm and stabilizing partner-related expectations. For transparency, we have clarified these points in the Methods and Materials section:

      “Participants were told that their partner was another human participant in the laboratory and that they would interact with the same partner across all rounds. However, in reality, the actions of the partner were predetermined by a computer program. This setup allowed for a clear comparison of the behavioral responses between adolescents and adults. Participants were not informed of the total number of rounds in the rPDG.”

      (4) The authors mention that an SVO was also recorded to indicate participant prosociality. Where are the results of this? Did this track game play at all? Could cooperativeness be explained broadly as an SVO preference that penetrated into game-play behaviour?

      We thank the reviewer for pointing this out. We agree that individual differences in prosociality may shape cooperative behavior, so we conducted additional analyses incorporating SVO. Specifically, we extended GLMM1 and LMM3 by adding the measured SVO as a fixed effect with random slopes, yielding GLMM<sub>sup</sub>3 and LMM<sub>sup</sub>6 (Tables 12–13). The results showed that higher SVO was associated with greater cooperation, whereas its effect on the reward for reciprocity was not significant. Importantly, the primary findings remained unchanged after controlling for SVO. These results indicate that cooperativeness in our task cannot be explained solely by a broad SVO preference, although a more prosocial orientation was associated with greater cooperation. We have reported these analyses and results in the Appendix Analysis section.

      (5) Why was AIC chosen rather an BIC to compare model dominance?

      Sorry for the lack of clarification. Both the Akaike Information Criterion (AIC, Akaike, 1974) and Bayesian Information Criterion (BIC, Schwarz, 1978) are informationtheoretic criterions for model comparison, neither of which depends on whether the models to be compared are nested to each other or not (Burnham et al., 2002). We have added the following clarification into the Methods.

      “We chose to use the AICc as the metric of goodness-of-fit for model comparison for the following statistical reasons. First, BIC is derived based on the assumption that the “true model” must be one of the models in the limited model set one compares (Burnham et al., 2002; Gelman & Shalizi, 2013), which is unrealistic in our case. In contrast, AIC does not rely on this unrealistic “true model” assumption and instead selects out the model that has the highest predictive power in the model set (Gelman et al., 2014). Second, AIC is also more robust than BIC for finite sample size (Vrieze, 2012).”

      (6) I believe the model fitting procedure might benefit from hierarchical estimation, rather than maximum likelihood methods. Adolescents in particular seem to show multiple outliers in a^+ and w^+ at the lower end of the distributions in Figure S2. There are several packages to allow hierarchical estimation and model comparison in MATLAB (which I believe is the language used for this analysis;

      see https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007043).

      We thank the reviewer for this helpful comment and for referring us to relevant methodological work (Piray et al., 2019). We have addressed this point by incorporating hierarchical Bayesian estimation, which effectively mitigates outlier effects and improves model identifiability. The results replicated those obtained with MLE fitting and further revealed group-level differences in key parameters. Please see our detailed response to Reviewer#1 Q1 for the full description of this analysis and results.

      (7) Results: Model confusion seems to show that the inequality aversion and social reward models were consistently confused with the baseline model. Is this explained or investigated? I could not find an explanation for this.

      The apparent overlap between the inequality aversion (Model 4) and social reward (Model 5) models in the recovery analysis likely arises because neither model includes a learning mechanism, making them unable to capture trial-by-trial adjustments in this dynamic task. Consequently, both were best fit by the baseline model. Please see Response to Reviewer #1 Q3 for related discussion.

      (8) Figures 3e and 3f show the correlation between asymmetric learning rates and age. It seems that both a^+ and a^- are around 0.35-0.40 for young adolescents, and this becomes more polarised with age. Could it be that with age comes an increasing discernment of positive and negative outcomes on beliefs, and younger ages compress both positive and negative values together? Given the higher stochasticity in younger ages (\beta), it may also be that these values simply represent higher uncertainty over how to act in any given situation within a social context (assuming the differences in groups are true).

      We appreciate this insightful interpretation. Indeed, both α+ and α- cluster around 0.35–0.40 in younger adolescents and become increasingly polarized with age, suggesting that sensitivity to positive versus negative feedback is less differentiated early in development and becomes more distinct over time. This interpretation remains tentative and warrants further validation. Based on this comment, we have revised the Discussion to include this developmental interpretation.

      We also clarify that in our model β denotes the inverse temperature parameter; higher β reflects greater choice precision and value sensitivity, not higher stochasticity. Accordingly, adolescents showed higher β values, indicating more value-based and less exploratory choices, whereas adults displayed relatively greater exploratory cooperation. These group differences were also replicated using hierarchical Bayesian estimation (see Response to Reviewer #1 Q1). In response to this comment, we have added a statement in the Discussion highlighting this developmental interpretation.

      “Together, these findings suggest that the differentiation between positive and negative learning rates changes with age, reflecting more selective feedback sensitivity in development, while higher β values in adolescents indicate greater value sensitivity. This interpretation remains tentative and requires further validation in future research.”

      (9) A parameter partial correlation matrix (off-diagonal) would be helpful to understand the relationship between parameters in both adolescents and adults separately. This may provide a good overview of how the model properties may change with age (e.g. a^+'s relation to \beta).

      We thank the reviewer for this helpful comment. We fully agree that a parameter partial correlation matrix can further elucidate the relationships among parameters. Accordingly, we conducted a partial correlation analysis and added the visually presented results to the revised manuscript as Figure 2-figure supplement 4.

      (10) It would be helpful to have Bayes Factors reported with each statistical tests given that several p-values fall within the 0.01 and 0.10.

      We thank the reviewer for this important recommendation. We have conducted Bayes factor analyses and reported BF10 for all relevant post hoc comparisons. We also clarified our analysis in the Methods and Materials section:

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      (11) Discussion: I believe the language around ruling out failures in mentalising needs to be toned down. RL models do not enable formal representational differences required to assess mentalising, but they can distinguish biases in value learning, which in itself is interesting. If the authors were to show that more complex 'ToM-like' Bayesian models were beaten by RL models across the board, and this did not differ across adults and adolescents, there would be a stronger case to make this claim. I think the authors either need to include Bayesian models in their comparison, or tone down their language on this point, and/or suggest ways in which this point might be more thoroughly investigated (e.g., using structured models on the same task and running comparisons: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087619).

      We thank the reviewer for the comments. Please see our response to Reviewer 1 (Appraisal & Discussion section) for details.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors may want to show the winning model earlier (perhaps near the beginning of the Results section, when model parameters are first mentioned).

      We thank the reviewer for this suggestion. We agree that highlighting the winning model early improves clarity. Currently, we have mentioned the winning model before the beginning of the Results section. Specifically, in the penultimate paragraph of the Introduction we state:

      “We identified the asymmetric RL learning model as the winning model that best explained the cooperative decisions of both adolescents and adults.”

      Reviewer #3 (Recommendations for the authors):

      (1) In addition to the points mentioned above, I suggest the following:

      Clarify plots by clearly explaining each variable. In particular, the indices 1 vs. 1,2 vs 1,2,3 were not immediately understandable.

      We thank the reviewer for this suggestion. We agree that the indices were not immediately clear. We have revised the figure captions (Figure 1 and 4) to explicitly define these terms more clearly:

      “The x-axis represents the consistency of the partner’s actions in previous trials (t<sub>−1</sub>: last trial; t<sub>−1,2</sub>: last two trials;<sub>t−1,2,3</sub>: last three trials).”

      (2) It's unclear why the index stops at 3. If this isn't the maximum possible number of consecutive cooperation trials, please consider including all relevant data, as adolescents might show a trend similar to adults over more trials.

      We thank the reviewer for raising this point. In our exploratory analyses, we also examined longer streaks of consecutive partner cooperation or defection (up to four or five trials). Two empirical considerations led us to set the cutoff at three in the final analyses. First, the influence of partner behavior diminished sharply with temporal distance. In both GLMMs and LMMs, coefficients for earlier partner choices were small and unstable, and their inclusion substantially increased model complexity and multicollinearity. This recency pattern is consistent with learning and decision models emphasizing stronger weighting of recent evidence (Fudenberg & Levine, 2014; Fudenberg & Peysakhovich, 2016). Second, streaks longer than three were rare, especially among some participants, leading to data sparsity and inflated uncertainty. Including these sparse conditions risked biasing group estimates rather than clarifying them. Balancing informativeness and stability, we therefore restricted the index to three consecutive partner choices in the main analyses, which we believe sufficiently capture individuals’ general tendencies in reciprocal cooperation.

      (3) The term "reciprocity" may not be necessary. Since it appears to reflect a general preference for cooperation, it may be clearer to refer to the specific behavior or parameter being measured. This would also avoid confusion, especially since adolescents do show negative reciprocity in response to repeated defection.

      We thank you for this comment. In our work, we compute the intrinsic reward for reciprocity as p × ω, where p is the partner cooperation expectation and ω is the cooperation preference. In the rPDG, this value framework manifests as a reciprocity-derived reward: sustained mutual cooperation maximizes joint benefits, and the resulting choice pattern reflects a value for reciprocity, contingent on the expected cooperation of the partner. This quantity enters the trade-off between U<sub>cooperation</sub> and U<sub>defection</sub> and captures the participant’s intrinsic reward for reciprocity versus the additional monetary reward payoff of defection. Therefore, we consider the term “reciprocity” an acceptable statement for this construct.

      (4) Interpretation of parameters should closely reflect what they specifically measure.

      We thank the reviewer for pointing this out. We have refined the relevant interpretations of parameters in the current Results and Discussion sections.

      (5) Prior research has shown links between Theory of Mind (ToM) and cooperation (e.g., Martínez-Velázquez et al., 2024). It would be valuable to test whether this also holds in your dataset.

      We thank the reviewer for this thoughtful comment. Although we did not directly measure participants’ ToM, our design allowed us to estimate participants’ trial-by-trial inferences (i.e., expectations) about their partner’s cooperation probability. We therefore treat these cooperation expectations as an indirect representation for belief inference, which is related to ToM processes. To test whether this belief-inference component relates to cooperation in our dataset, we further conducted an exploratory analysis (GLMM<sub>sup</sub>4) in which participants’ choices were regressed on their cooperation expectations, group, and the group × cooperation-expectation interaction, controlling for trial number and gender, with random effects. Consistent with the ToM–cooperation link in prior research (MartínezVelázquez et al., 2024), participants’ expectations about their partner’s cooperation significantly predicted their cooperative behavior (Table 14), suggesting that decisions were shaped by social learning about others’ inferred actions. Moreover, the interaction between group and cooperation expectation was not significant, indicating that this inference-driven social learning process likely operates similarly in adolescents and adults. This aligns with our primary modeling results showing that both age groups update beliefs via an asymmetric learning process. We have reported these analyses in the Appendix Analysis section.

      (6) More informative table captions would help the reader. Please clarify how variables are coded (e.g., is female = 0 or 1? Is adolescent = 0 or 1?), to avoid the need to search across the manuscript for this information.

      We thank the reviewer for raising this point. We have added clear and standardized variable coding in the table notes of all tables to make them more informative and avoid the need to search the paper. We have ensured consistent wording and formatting across all tables.

      (7) I hope these comments are helpful and support the authors in further strengthening their manuscript.

      We thank the three reviewers for their comments, which have been helpful in strengthening this work.

      References

      (1) Fudenberg, D., & Levine, D. K. (2014). Recency, consistent learning, and Nash equilibrium. Proceedings of the National Academy of Sciences of the United States of America, 111(Suppl. 3), 10826–10829. https://doi.org/10.1073/pnas.1400987111.

      (2) Fudenberg, D., & Peysakhovich, A. (2016). Recency, records, and recaps: Learning and nonequilibrium behavior in a simple decision problem. ACM Transactions on Economics and Computation, 4(4), Article 23, 1–18. https://doi.org/10.1145/2956581

      (3) Hackel, L., Doll, B., & Amodio, D. (2015). Instrumental learning of traits versus rewards: Dissociable neural correlates and effects on choice. Nature Neuroscience, 18, 1233– 1235. https://doi.org/10.1038/nn.4080

      (4) Icenogle, G., Steinberg, L., Duell, N., Chein, J., Chang, L., Chaudhary, N., Di Giunta, L., Dodge, K. A., Fanti, K. A., Lansford, J. E., Oburu, P., Pastorelli, C., Skinner, A. T.Sorbring, E., Tapanya, S., Uribe Tirado, L. M., Alampay, L. P., Al-Hassan, S. M.,Takash, H. M. S., & Bacchini, D. (2019). Adolescents’ cognitive capacity reaches adult levels prior to their psychosocial maturity: Evidence for a “maturity gap” in a multinational, cross-sectional sample. Law and Human Behavior, 43(1), 69–85. https://doi.org/10.1037/lhb0000315

      (5) Krekelberg, B. (2024). Matlab Toolbox for Bayes Factor Analysis (v3.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.13744717

      (6) Martínez-Velázquez, E. S., Ponce-Juárez, S. P., Díaz Furlong, A., & Sequeira, H. (2024). Cooperative behavior in adolescents: A contribution of empathy and emotional regulation? Frontiers in Psychology, 15,1342458. https://doi.org/10.3389/fpsyg.2024.1342458

      (7) Tervo-Clemmens, B., Calabro, F. J., Parr, A. C., et al. (2023). A canonical trajectory of executive function maturation from adolescence to adulthood. Nature Communications, 14, 6922. https://doi.org/10.1038/s41467-023-42540-8

      (8) King-Casas, B., Tomlin, D., Anen, C., Camerer, C. F., Quartz, S. R., & Montague, P. R. (2005). Getting to know you: reputation and trust in a two-person economic exchange. Science, 308(5718), 78-83. https://doi.org/10.1126/science.1108062

      (9) Rilling, J. K., Gutman, D. A., Zeh, T. R., Pagnoni, G., Berns, G. S., & Kilts, C. D. (2002).A neural basis for social cooperation. Neuron, 35(2), 395-405. https://doi.org/10.1016/s0896-6273(02)00755-9

      (10) Fareri, D. S., Chang, L. J., & Delgado, M. R. (2015). Computational substrates of social value in interpersonal collaboration. Journal of Neuroscience, 35(21), 8170-8180. https://doi.org/10.1523/JNEUROSCI.4775-14.2015

      (11) Akaike, H. (2003). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723. https://doi.org/10.1109/TAC.1974.1100705

      (12) Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461464. https://doi.org/10.1214/aos/1176344136

      (13) Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.https://doi.org/10.1007/b97636

      (14) Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x

      (15) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018

      (16) Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      This work by Reitz, Z. L. et al. developed an automated tool for high-throughput identification of microbial metallophore biosynthetic gene clusters (BGCs) by integrating knowledge of chelating moiety diversity and transporter gene families. The study aimed to create a comprehensive detection system combining chelator-based and transporter-based identification strategies, validate the tool through large-scale genomic mining, and investigate the evolutionary history of metallophore biosynthesis across bacteria.

      Major strengths include providing the first automated, high-throughput tool for metallophore BGC identification, representing a significant advancement over manual curation approaches. The ensemble strategy effectively combines complementary detection methods, and experimental validation using HPLC-HRMS strengthens confidence in computational predictions. The work pioneers a global analysis of metallophore diversity across the bacterial kingdom and provides a valuable dataset for future computational modeling.

      Some limitations merit consideration. First, ground truth datasets derived from manual curation may introduce selection bias toward well-characterized systems, potentially affecting performance assessment accuracy. Second, the model's dependence on known chelating moieties and transporter families constrains its ability to detect novel metallophore architectures, limiting discovery potential in metagenomic datasets. Third, while the proposed evolutionary hypothesis is internally consistent, it lacks direct validation and remains speculative without additional phylogenetic studies.

      The authors successfully achieved their stated objectives. The tool demonstrates robust performance metrics and practical utility through large-scale application to representative genomes. Results strongly support their conclusions through rigorous validation, including experimental confirmation of predicted metallophores via HPLC-HRMS analysis.

      The work provides a significant and immediate impact by enabling the transition from labor-intensive manual approaches to automated screening. The comprehensive phylogenetic framework advances understanding of bacterial metal acquisition evolution, informing future studies on microbial metal homeostasis. Community utility is substantial, since the tool and accompanying dataset create essential resources for comparative genomics, algorithm development, and targeted experimental validation of novel metallophores.

      We thank the reviewer for their valuable feedback. We appreciate the positive words, and agree with their listed limitations. Regarding the following comment:

      “Third, while the proposed evolutionary hypothesis is internally consistent, it lacks direct validation and remains speculative without additional phylogenetic studies.”

      We agree that additional phylogenetic analyses are needed in future studies. For the revised manuscript, we have validated our evolutionary hypotheses by additionally analyzing two gene families using the likelihood-based tool AleRax, which implements a probabilistic DTL model. The results were consistent with the eMPRess parsimony-based reconstructions, showing comparable patterns of rare duplication, moderate gene loss, and extensive horizontal transfer. Both methods identified similar lineages as the most probable origin and major recipients of transfer events. This agreement between independent reconciliation frameworks supports the reliability of our evolutionary conclusions. We have added a statement referencing this cross-method validation in the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      This study presents a systematic and well-executed effort to identify and classify bacterial NRP metallophores. The authors curate key chelator biosynthetic genes from previously characterized NRP-metallophore biosynthetic gene clusters (BGCs) and translate these features into an HMM-based detection module integrated within the antiSMASH platform.

      The new algorithm is compared with a transporter-based siderophore prediction approach, demonstrating improved precision and recall. The authors further apply the algorithm to large-scale bacterial genome mining and, through reconciliation of chelator biosynthetic gene trees with the GTDB species tree using eMPRess, infer that several chelating groups may have originated prior to the Great Oxidation Event.

      Overall, this work provides a valuable computational framework that will greatly assist future in silico screening and preliminary identification of metallophore-related BGCs across bacterial taxa.

      Strengths:

      (1) The study provides a comprehensive curation of chelator biosynthetic genes involved in NRP-metallophore biosynthesis and translates this knowledge into an HMM-based detection algorithm, which will be highly useful for the initial screening and annotation of metallophore-related BGCs within antiSMASH.

      (2) The genome-wide survey across a large bacterial dataset offers an informative and quantitative overview of the taxonomic distribution of NRP-metallophore biosynthetic chelator groups, thereby expanding our understanding of their phylogenetic prevalence.

      (3) The comparative evolutionary analysis, linking chelator biosynthetic genes to bacterial phylogeny, provides an interesting and valuable perspective on the potential origin and diversification of NRP-metallophore chelating groups.

      We greatly appreciate these comments.

      Weaknesses:

      (1) Although the rule-based HMM detection performs well in identifying major categories of NRP-metallophore biosynthetic modules, it currently lacks the resolution to discriminate between fine-scale structural or biochemical variations among different metallophore types.

      We agree that this is a current limitation to the methodology. More specific metallophore structural prediction is among our future goals for antiSMASH. We have added a statement to this effect in the conclusion.

      (2) While the comparison with the transporter-based siderophore prediction approach is convincing overall, more information about the dataset balance and composition would be appreciated. In particular, specifying the BGC identities, source organisms, and Gram-positive versus Gram-negative classification would improve transparency. In the supplementary tables, the "Just TonB" section seems to include only BGCs from Gram-negative bacteria - if so, this should be clearly stated, as Gram type strongly influences siderophore transport systems.

      The reviewer raises good points here. An additional ZIP file containing all BGCs used for the manual curation was inadvertently left out of the supplemental dataset for the first version of the manuscript. We have added columns with source organisms and Gram stain (retrieved from Bacdive) to Table S2. F1 scores were similar for Gram positive and negative subsets, as seen in the new Table S2.

      We thank the reviewer for suggesting this additional analysis, and have added a brief statement in the revised manuscript.

      The “Just TonB” section (in which we tested the performance of requiring TonB without another transporter) was not used for the manuscript. We will preserve it in the revised Table S2 for transparency.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In line 43:

      "excreted" should be replace by "secreted".

      Done.

      (2) In lines 158-159:

      "we manually predicted metallophore production among a large set of BGCs."

      If they are first "annotated with default antiSMASH v6.1", then it is not entirely manual, right? I would suggest making this sentence clearer.

      We have revised the language.

      (3) In lines 165-169:

      It would be good to show the confusion matrix of these results.

      The confusion matrices are found in Table S2, columns AL-AR.

      (4) In Table 1:

      Method names (AntiSMASH rules/Transporter genes) could be misleading, since they are all AntiSMASH-based, right?

      We have adjusted the methods to clarify that while the transporter genes were detected using a modified version of antiSMASH, they are not related to our chelator-based detection rule (which is now correctly singular throughout the text).

      (5) Line 198:

      There are accidental spaces and characters inserted here.

      We could not find any accidental spaces and characters here.

      (6) Line 209:

      "In total, 3,264 NRP metallophore BGC regions were detected"

      Is this number correct? I don't see a correspondence in Table 1.

      We have added the following sentence to the Table 1 legend: “An additional 54 BGC regions were detected as NRP metallophores without meeting the requirements for the antiSMASH NRPS rule.”

      (7) Line 294:

      "From B. brennerae, we identified four catecholic compounds"

      From the bacterial cells or the culture supernatant? I think it is important to state this in a more precise way. If it is from the supernatant, it could be from EVs.

      We state in line 292 that “organic compounds were extracted from the culture supernatants”. As our goal was only to confirm the ability of the strains to produce the predicted metallophores, the precise localization (including cell pellet or EVs) was not explored.

      (8) Lines 349-357:

      These results would benefit greatly from a visualization strategy.

      Thank you, we have added a reference to the existing visualization in Fig. 5, Ring C.

      (9) Lines 452-454:

      How could clusters be de-replicated? Is there an identity equivalence scheme or similarity metric?

      The BGC regions were de-replicated with BiG-SCAPE, which uses multiple similarity metrics as described in Navarro-Muñoz et al, 2020. Clusters could be dereplicated further using a more strict cutoff.

      (10) Line 457:

      "relatively low number of published genomes."

      Could metagenome-assembled genomes help in that matter?

      This is a good question, but we find that MAGs are usually too fragmented to yield complete NRPS BGC regions. We’ve added additional sentences earlier in the discussion: “Detection rates were also lower for fragmented genomes; unfortunately, this limitation (inherent to antiSMASH itself) may hinder the identification of metallophore biosynthesis in metagenomes. As long-read sequencing of metagenomes becomes more common, we expect that detection will improve.”

      (11) Lines 514-515:

      "Adequately-performing pHMMs for Asp and His β-hydroxylase subtypes could not be constructed using the above method."

      What is the overall impact of this discrepancy in the methodology for these specific groups?

      The phylogeny-based methodology was used to reduce false positives. We expect this method will have improved precision at the possible expense of recall.

      (12) Lines 543-545:

      "RefSeq representative bacterial genomes were dereplicated at the genus level using R, randomly selecting one genome for each of the 330 genera determined by GTDB"

      Isn't it more of a random sampling than a dereplication? Dereplication would involve methods such as ANI computation.

      You are correct; we have adjusted the language to clarify.

      (13) Lines 559-560: "were filtered to remove clusters on contig edges."

      This sentence is confusing because networks will be mentioned soon, and they also have edges (not the edges mentioned here), and they could also be clustered (not the clusters mentioned here). Is there a way to make the terminology clearer?

      Thank you, we have adjusted the text to read “BGC regions on contig boundaries”

      (14) Line 560:

      "The resulting 2,523 BGC regions, as well as 78 previously reported BGCs "

      How many were there before filtering?

      We have added the number: 3,264

      (15) Lines 579-580:

      Confusing terminology, as mentioned in Lines 559-560.

      Adjusted as above.

      General comments and questions:

      An objective suggestion to enrich the discussion is to address the role of bacterial extracellular vesicles (EVs) as metallophore carriers. Studies show that EVs, such as outer membrane vesicles, can transport siderophores or other metallophores for iron acquisition in various bacteria, functioning as "public goods" for community-wide nutrient sharing. Highlighting this mechanism would add ecological and functional context to the manuscript. In the future, EV-associated metallophore transport could also be considered for integration into computational detection tools.

      We thank the reviewer for the suggestion; however, we do not think that such a discussion is needed. We briefly discuss the ecological function of metallophores as public goods (and public bads) in the first paragraph of the introduction. We did not find any reports that EV-associated genes co-localize with metallophore BGCs, which would be required for their presence to be a useful marker of metallophore production.

      Is there a feasible path to more generalizable detection of chelating motifs using chemistry-aware features? For example, a machine learning classifier trained on submolecular descriptors (e.g., functional groups, coordination motifs, SMARTS patterns, graph fingerprints, metal-binding propensity scores) could complement the current genome-based approach and broaden coverage beyond known metallophore families. While the discussion mentions future extensions centered on genomic features, integrating chemical information from predicted or known products (or biosynthetic logic inferred from BGC composition) could be explored. A hybrid framework-linking BGC-derived features with chemistry-derived features-may improve both recall for novel metallophore classes and precision in distinguishing true chelators from confounders, thereby increasing overall accuracy.

      We can envision a classifier that uses submolecular descriptors to predict the ability of a molecule to bind metal ions. However, starting with a BGC and accurately predicting the structure of a hitherto unknown chelating moiety will likely prove difficult.  We have added a sentence to the discussion stating that a future tool could use accessory genes to more completely predict chemical structure.

      Although the initial analysis was conducted using RefSeq genomes, what are the anticipated challenges and limitations when scaling this method for BGC prospecting in metagenome-assembled genomes (MAGs), particularly considering the inherent quality differences, assembly fragmentation, and taxonomic uncertainties that characterize MAG datasets compared to curated reference genomes?

      Please see our response to comment 10, line 457. Our pHMM-based approach is designed to be robust to organism taxonomy; however, fragmentation is a significant barrier to accurate antiSMASH-based BGC detection (including in contig-level single-isolate genomes, see Table 1).

      Reviewer #2 (Recommendations for the authors):

      (1) In the "Chemical identification of genome-predicted siderophores across taxa" section, it would be helpful to annotate the cross-species similarities between predicted metallophore BGCs and their reference clusters (Ref BGCs). As currently described, the main text seems to highlight the cross-species resolving power of BiG-SCAPE itself rather than demonstrating the taxonomic generalizability of the chelator HMM-based detection module.

      Thank you for this comment. We intended to display that the new rule is useful for detecting BGCs in unexplored taxa, but we acknowledge that there is not a great diversity in the strains we selected. We have removed “across taxa” to avoid misleading the reader and clarify our intent.

      (2) In addition to using eMPRess for gene-species reconciliation, it may be beneficial to explore or at least reference alternative reconciliation tools to validate the inferred duplication, transfer, and loss (DTL) scenarios. Incorporating such cross-method comparisons would enhance the robustness and credibility of the evolutionary conclusions.

      We appreciate this valuable suggestion. To validate the robustness of our reconciliation-based inferences, we additionally analyzed two gene families using the likelihood-based tool AleRax, which implements a probabilistic DTL model. The results were consistent with the eMPRess parsimony-based reconstructions, showing comparable patterns of rare duplication, moderate gene loss, and extensive horizontal transfer. Both methods identified similar lineages as the most probable origin and major recipients of transfer events. This agreement between independent reconciliation frameworks supports the reliability of our evolutionary conclusions. We have added a brief statement referencing this cross-method validation in the revised manuscript.

    1. As archivists we like these questions because they tell us that people are eager for access to archival records. They also show that people realize that not everything is digitized. Indeed only a tiny fraction of the world’s primary resources are available digitally.

      Sure, some individuals may be more eager for physical records, but it should not be a question that digital archives are significantly easier to access. So I think that is a big factor to consider.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We thank the reviewers and editors for their careful evaluation of our manuscript and their positive comments on the importance and rigor of the work. Below you will find our point-by-point response to each reviewer's suggestions. We believe that we have addressed (in the response and the revised manuscript) all of the concerns. Please note that in some cases, we have numbered a reviewer's comments for clarity, however beyond this, we have not altered any of the reviewers' text.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Lo et al., report a high-throughput functional profiling study on the gene encoding for argininosuccinate synthase (ASS1), done in a yeast experimental system. The study design is robust (see lines 141-143, main text, Methods), whereby "approximately three to four independent transformants of each variant would be isolated and assayed." (lines 140 - 141, main text, Methods). Such a manner of analysis will allow for uncertainty of the functional readout for the tested variants to be accounted for.

      This is an outstanding study providing insights on the functional landscape of ASS1. Functionally impaired ASS1 may cause citrullinemia type I, and disease severity varies according to the degree of enzyme impairment (line 30, main text; Abstract). Data from this study forms a valuable resource in allowing for functional interpretation of protein-altering ASS1 variants that could be newly identified from large-scale whole-genome sequencing efforts done in biobanks or national precision medicine programs. I have some suggestions for the Authors to consider:

      1. The specific function of ASS1 is to condense L-citrulline and L-aspartate to form argininosuccinate. Instead of measuring either depletion of substrate or formation of product, the Authors elected to study 'growth' of the yeast cells. This is a broader phenotype which could be determined by other factors outside of ASS1. Whereas i agree that the experiments were beautifully done, the selection of an indirect phenotype such as ability of the yeast cells to grow could be more vigorously discussed.

      We appreciate the reviewer's point regarding the indirect nature of growth as a functional readout. In our system, yeast growth is tightly and specifically coupled to ASS enzymatic activity. The strains used are isogenic and lack the native yeast argininosuccinate synthetase, such that arginine biosynthesis, and therefore yeast replication on minimal medium lacking arginine, depends exclusively on the activity of human ASS1. Under these defined and limiting conditions, growth provides a quantitative proxy for ASS1 function. However, we acknowledge that this assay does not resolve specific molecular mechanisms underlying reduced function, such as altered catalytic activity versus effects on protein stability. We have updated the text to clarify these points.

      "While growth is an indirect phenotype relative to direct measurement of substrate turnover or product formation, it is tightly coupled to ASS enzymatic activity in this system and is expected to be impaired by amino acid substitutions that reduce catalytic activity or protein stability. Therefore, growth on minimal medium lacking arginine is a quantitative measure of ASS enzyme function, allowing the impact of ASS1 missense variants to be assessed at scale through a high-throughput growth assay, in a single isogenic strain background, under controlled, defined conditions that limit confounding factors unrelated to ASS1 activity. We expect that the assay will detect reductions in both catalytic activity and protein stability but will not distinguish between these mechanisms."

      1. One of the key reasons why studies such as this one are valuable is due to the limitations of current variant classification methods that rely on 'conservation' status of amino acid residues to predict which variants might be 'pathogenic' and which variants might be 'likely benign'. However, there are serious limitations, and Figures 2 and 6 in the main text shows this clearly. Specifically, there is an appreciable number of variants that, despite being classified as "ClinVar Pathogenic", were shown by the assay to unlikely be functionally impaired. This should be discussed vigorously. Could these inconsistencies be potentially due to the read out (growth instead of a more direct evaluation of ASS1 function)?

      We interpret this discrepancy as reflecting a sensitivity limitation of the growth-based readout rather than a fundamental disagreement between functional effect and clinical annotation. Specifically, we believe that our assay is unable to resolve the very mildest hypomorphic variants from true wild type, i.e., the residual activity of these variants is sufficient to fully support yeast growth under the conditions used. On this basis, we have chosen not to treat wild-type-like growth in our assay as informative for benignity; conversely, reduced growth provides evidence supporting pathogenicity (all clinically validated variants examined in this range are pathogenic).

      We have revised the manuscript to clarify this point explicitly and to frame these variants as lying outside the effective resolution limit of the assay rather than representing true false positives. Additional discussion of this limitation and its implications is provided in our responses to Reviewer 2 (points 1 and 4) along with specific changes made to the text.

      1. Figure 3 is very interesting, showing a continuum of functional readout ranging from 'wild-type' to 'null'. It is very interesting that the Authors used a threshold of less than 0.85 as functionally hypomorphic. What does this mean? It would be very nice if they have data from patients carrying two hypomorphic ASS1 alleles, and correlate their functional readout with severity of clinical presentation. The reader might be curious as to the clinical presentation of individuals carrying, for example, two ASS1 alleles with normalized growth of 0.7 to 0.8.

      I hope you will find these suggestions helpful.

      We thank the reviewer for this thoughtful comment. Figure 3 indeed illustrates a continuum of functional effects, and we agree that careful interpretation of the thresholds used is important. To clarify the rationale for the hypomorphic threshold, the interpretation of intermediate growth values, and to emphasize that these labels reflect only behavior in the functional assay, we have rewritten the relevant section of the Results:

      "The normalized growth scores of the 2,193 variants tested in our functional assay form a clear bimodal distribution (Figure 3), with two distinct peaks corresponding to functional extremes, as is commonly reported in large-scale functional assays of protein function [9, 10]. The smaller peak, centered around the null control (normalized growth = 0), represents variants that fail to support growth in the assay (growth 0.85). Variants with growth values falling between these two peak-based thresholds display partial functional impairment and are classified as functionally hypomorphic (n = 323). Crucially, these classifications are entirely derived from the observed peaks in the distribution of growth values and reflect differences in functional activity under the assay conditions. They do not provide direct evidence for clinical pathogenicity or benignity and should not be used for clinical variant interpretation without proper benchmarking against clinical reference datasets, as implemented below within an OddsPath framework."

      We agree with the reviewer that correlating functional measurements with clinical severity in individuals carrying two hypomorphic ASS1 alleles would be highly informative, particularly given that ASS1 deficiency is an autosomal recessive disorder. While mild hypomorphic variants (for example, variants with normalized growth values of 0.7-0.8 in our assay) could plausibly contribute to disease when paired with a complete loss-of-function allele, systematic analysis of combinatorial genotype effects and genotype-phenotype correlations is beyond the scope of the present study, which focuses on the functional effects of individual variants. We view this as an important direction for future work.

      Reviewer #1 (Significance (Required)):

      This is an outstanding study providing insights on the functional landscape of ASS1. Functionally impaired ASS1 may cause citrullinemia type I, and disease severity varies according to the degree of enzyme impairment (line 30, main text; Abstract). Data from this study forms a valuable resource in allowing for functional interpretation of protein-altering ASS1 variants that could be newly identified from large-scale whole-genome sequencing efforts done in biobanks or national precision medicine programs.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, Lo et al characterize the phenotypic effect of ~90% of all possible ASS1 missense mutations using an elegant yeast-based system, and use this dataset to aid the interpretation of clinical ASS1 variants. Overall, the manuscript is well-written and the experimental data are interpretated rigorously. Of particular interest is the identification of pairs of deleterious alleles that rescue ASS1 activity in trans. My comments mainly pertain to the relevance of using a yeast screening methodology to infer functional effects of human ASS1 mutations.

      1. Since human ASS1 is heterologously expressed in yeast for this mutational screen, direct comparison of native expression levels between human cells and yeast is not possible. Could the expression level of human ASS1 (driven by the pARG1 promoter) in yeast alter the measured fitness defect of each variant? For instance, if ASS1 expression in yeast is sufficiently high to mask modest reductions in catalytic activity, such variants may be misclassified as hypomorphic rather than amorphic. Conversely, if expression is intrinsically low, even mild catalytic impairments could appear deleterious. While it is helpful that the authors used non-human primate SNV data to calibrate their assay, experiments could be performed to directly address this possibility.

      The nature of the relationship between yeast growth and availability of functional ASS1 could also influence the interpretation of results from the yeast-based screen. Does yeast growth scale proportionately with ASS1 enzymatic activity?

      We completely agree that the expression level of human ASS1 in yeast could influence the measured fitness effects of individual variants. We expect the rank ordering of variants in our growth assay to reflect their relative enzymatic activity (i.e. a monotonic relationship) but acknowledge that the precise mapping between activity and growth is unknown and may include ceiling and floor effects that limit the assay's dynamic range. As the reviewer notes, under high expression conditions moderate loss-of-function variants could appear indistinguishable from wild type (ceiling effect), whereas under lower expression the same variants could behave closer to the null control (floor effect).

      In our system, ASS1 is expressed from the pARG1 promoter, chosen under the assumption that the native expression level of ARG1 (the yeast ASS1 ortholog) is appropriately tuned for yeast growth. Crucially, rather than assuming a fixed mapping from assay growth to clinical pathogenicity (given potential nonlinearities in the relationship between ASS function and growth) we benchmark the assay against external data, including known pathogenic and benign variants and non-human primate SNVs, to calibrate thresholds and guide interpretation within an OddsPath framework. This benchmarking indicates that ceiling effects are likely present, with some mild loss-of-function pathogenic variants appearing indistinguishable from wild type in the growth assay. We explicitly account for this by not using high-growth scores as evidence toward benignity. We have made the following changes the manuscript:

      "A subset of clinically pathogenic ASS1 variants exhibit near-wild-type growth in our yeast assay. In general, we expect a monotonic relationship between ASS function and yeast growth, but with the potential for floor and ceiling effects that constrain the assay's dynamic range. In this context, we interpret high-growth pathogenic variants as likely causing mild loss of function that cannot be distinguished from wild type in our assay"

      "Based on these findings and given that 22/56 pathogenic variants show >85% growth, we conclude that growth above this threshold should not be used as evidence toward benignity."

      1. It would be helpful to add an additional diagram to Figure 1A explaining how the screen was performed, in particular: when genotype and phenotype were measured, relative to plating on selective vs non-selective media? This is described in "Variant library sequence confirmation" and "Measuring the growth of individual isolates" of the Methods section but could also be distilled into a diagram.

      We thank the reviewer for this helpful suggestion. We have updated Figure 1 by adding a new schematic panel (Figure 1C) that distills the experimental workflow into a visual overview. This diagram is intended to complement the detailed descriptions in the Methods and improve clarity for the reader.

      1. The authors rationalize the biochemical consequences of ASS1 mutations in the context of ASS1 per se - for example, mutations in the active site pocket impair substrate binding and therefore catalytic activity, which is expected. Does ASS1 physically interact with other proteins in human cells, and could these interactions be altered in the presence of specific ASS1 mutations? Such effects may not be captured by performing mutational scanning in yeast.

      We are not aware of any specific protein-protein interactions involving ASS that are required for its enzymatic function. However, we agree that ASS could engage in non-essential interactions with other human proteins that might be altered by specific missense variants and that such interactions would not necessarily be captured in a yeast-based assay.

      Importantly, our complementation system depends on human ASS providing the essential enzymatic activity required for arginine biosynthesis in yeast. If ASS1 required obligate human-specific protein interactions to function, even the wild-type enzyme would fail to support yeast growth, which is clearly not the case. We therefore conclude that the assay robustly reports on the intrinsic enzymatic activity of ASS, while acknowledging that non-essential human-specific interactions may not be assessed. We have updated the manuscript to reflect this point.

      "Importantly, successful functional complementation indicates that ASS enzymatic activity does not depend on any obligate human-specific protein interactions."

      1. The authors note that only a small number (2/11) of mutations at the ASS1 monomer-monomer interface lead to growth defects in yeast. It would be helpful for the authors to discuss this further.

      As discussed in response to the reviewer's comments on the relationship between ASS activity and yeast growth (point 1 above), we expect growth to be a monotonic but nonlinear function of enzymatic activity, with potential ceiling effects at high activity. Under this model, variants causing weak or moderate loss of function may remain indistinguishable from wild type when residual activity is sufficient to support normal growth. We favor this explanation for the observation that only 2/11 interface variants show reduced growth, as many pathogenic interface substitutions are associated with milder disease presentations, consistent with higher residual enzyme function. Consistent with this interpretation, variants affecting the active site, where substitutions are expected to cause large reductions in catalytic activity, are readily detected by the assay.

      Although we cannot exclude partial buffering of dimerization defects in yeast, we interpret the reduced sensitivity to interface variants primarily as a general limitation of growth-based assays. Accordingly, our decision not to use growth >85% as evidence toward benignity is conservative relative to approaches that would classify high-growth variants as benign except at the monomer-monomer interface, avoiding reliance on structural subclassification and minimizing the risk of false benign interpretation. Reduced growth, by contrast, provides strong evidence of loss of ASS1 function and pathogenicity, validated under the OddsPath framework.

      We have updated the Results and Discussion sections to clarify these points (also see response to the reviewer's point 1).

      "A subset of clinically pathogenic ASS1 variants exhibit near-wild-type growth in our yeast assay. In general, we expect a monotonic relationship between ASS function and yeast growth, but with the potential for floor and ceiling effects that constrain the assay's dynamic range. In this context, we interpret high-growth pathogenic variants as likely causing mild loss of function that cannot be distinguished from wild type in our assay. Consistent with this view, many pathogenic variants with high assay growth are located at the monomer-monomer interface rather than the active site, and are associated with milder or later-onset clinical presentations, suggesting partial enzymatic impairment that is clinically relevant in humans but not resolved by the yeast assay."

      "Based on these findings and given that 22/56 pathogenic variants show >85% growth, we conclude that growth above this threshold should not be used as evidence toward benignity. Notably, this approach is conservative relative to treating high-growth variants as benign except at the monomer-monomer interface, avoiding reliance on structural subclassification and minimizing the risk of false benign interpretation arising from assay ceiling effects. Conversely, the variants with

      Reviewer #2 (Significance (Required)):

      This study presents the first comprehensive mutational profiling of human ASS1 and would be of broad interest to clinical geneticists as well as those seeking biochemical insights into the enzymology of ASS1. The authors' use of a yeast system to profile human mutations would be particularly useful for researchers performing deep mutational scans, given that it provides functional insights in a rapid and inexpensive manner.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Section 1 - Evidence, reproducibility, and clarity Summary This manuscript presents a comprehensive functional profiling of 2,193 ASS1 missense variants using a yeast complementation assay, providing valuable data for variant interpretation in the rare disease citrullinemia type I. The dataset is extensive, technically sound, and clinically relevant. The demonstration of intragenic complementation in ASS1 is novel and conceptually important. Overall, the study represents a substantial contribution to functional genomics and rare disease variant interpretation.

      Major comments 1. This is an exciting paper as it can provide support to clinicians to make actionable decisions when diagnosing infants. I have a few major comments, but I want to emphasize the label of "functionally unimpaired" variants to be misleading. The authors explain that there are several pathogenic ClinVar variants that fall into this category (above the >.85 growth threshold) but I think this category needs a more specific name and I would ask the authors to reiterate the shortcomings of the assay again in the Discussion section.

      We thank the reviewer for raising this important point. We agree that the label "functionally unimpaired" could be misleading if interpreted as implying clinical benignity rather than assay behavior. We have therefore clarified that this designation refers strictly to variant behavior in the yeast growth assay and does not imply absence of pathogenicity.

      In addition, we have expanded the Discussion to explicitly address the existence of clinically pathogenic variants with high growth scores (>0.85), emphasizing that these likely reflect a ceiling effect of the assay and represent a key limitation for interpretation. This clarification reiterates that high-growth scores should not be used as evidence toward benignity, while reduced growth provides strong functional evidence of pathogenicity. Relevant revisions are described in our responses to Reviewers 1 and 2.

      1. I think there's an important discussion to be had here, is the assay detecting variants that alter the function of ASS or is it detecting a complete ablation of enzymatic activity? The results might be strengthened with a follow-up experiment that identifies stably expressed ASS1 variants.

      We agree with the review that distinguishing between stability and enzyme activity would be valuable information. Unfortunately, we do not currently have the resources to perform this type of large-scale study. We have acknowledged in the text that our assay does not distinguish between enzyme activity and protein stability:

      "We expect that the assay will detect reductions in both catalytic activity and protein stability, but will not distinguish between these mechanisms."

      At the very least, it would be great to see the authors replicate some of their interesting results from the high-throughput screen by down-selecting to ~12 variants of uncertain significance that could be newly considered pathogenic.

      We have included new analysis of all 25 VUS variants falling in the pathogenic range of our assay (Supplemental Table S7). Reclassification under current guidelines (in the absence of our data) shifts six variants to Pathogenic/Likely Pathogenic and 11 more are reclassified to Likely Pathogenic with the application of our functional data as PS3_Supporting. The remaining eight VUS are all reclassified to Likely Pathogenic when inclusion of homozygous PrimateAI-benign variants allows the assay to satisfy full PS3 criteria.

      1. I would ask the authors to provide more citations of the literature in the introduction of the manuscript. I would be especially interested in knowing more about human ASS being identified as a homolog of yeast ARG1, as they share little sequence similarity (27.5%) at the protein level. That said, I find the yeast complementation assay exciting.

      We thank the reviewer for this suggestion. Human ASS and yeast Arg1 catalyze the same biochemical reaction and share approximately 49% amino acid sequence identity. We have revised the Introduction to clarify this relationship and to note explicitly that the Saccharomyces Genome Database (SGD) identifies the human gene encoding argininosuccinate synthase (ASS1) as the ortholog of yeast ARG1. An appropriate citation has been added to support this statement. The protein alignments have been provided as File S2.

      "This assay is based on the ability of human ASS to functionally replace (complement) its yeast ortholog (Arg1) in S. cerevisiae (Saccharomyces Genome Database, 2026). Importantly, successful functional complementation indicates that ASS enzymatic activity does not depend on any obligate human-specific protein interactions. At the protein level, human ASS and yeast Arg1 display 49% sequence identity (File S2) and share identical enzymatic roles in converting citrulline and aspartate into argininisuccinate."

      1. I appreciate the efforts made by the authors to share their work and make this study more reproducible, such as sharing the hASS1 and yASS1 plasmids being shared on NCBI Genbank (Line 121) and publishing the ONT reads on SRA (Line 154). I made a requests for additional data to be shared, such as the custom method/code for codon optimization and a table of Twist variant cassettes that were ordered. I would also love to see these results shared on MaveDB.org.

      We thank the reviewer for these suggestions regarding data sharing and reproducibility. As requested, we have provided the custom codon optimization script as File S1 and the amino acid alignment used to perform codon harmonization as File S2. The sequence of the underlying variant cassette is included in the corresponding GenBank entry, and we have clarified this point in the legend of Figure 1. For each amino acid substitution, Twist Bioscience used a yeast-specific codon scheme with a single consistent codon per amino acid; accordingly, the sequence of each variant cassette can be inferred from the base construct and the specified amino acid change. A complete list of variant amino acid substitutions used in this study is provided in Table S3.

      1. I find this manuscript very exciting as the authors have a compelling assay that identifies pathogenic variants, but I was generally disappointed by the quality and organization of the figures. For example, Figure 4 provides very little insight, but could be dramatically improved with an overlay of the normalized growth score data or highlighting variants surrounding the substrate or ATP interfaces. There are some very interesting aspects of this manuscript that could be shine through with some polished figures.

      We thank the reviewer for this feedback and agree that clear and well-organized figures are essential for conveying the key results of the study. In response, we have substantially revised Figure 4 by adding colored overlays showing residue conservation and median normalized growth scores (new panels Figure 4C and 4D), which more directly link structural context to functional outcomes and highlight patterns surrounding the active site and substrate interfaces.

      I would also encourage the authors to generate a heatmap of the data represented in Figure 2 (see Fowler and Fields 2014 PMID 25075907, Figure 2), this would be more helpful reference to the readers.

      The reviewer also suggested that a heatmap representation, similar to that used in Fowler and Fields (2014), might aid interpretation of the data shown in Figure 2. Because our dataset consists of sparse single-amino acid substitutions rather than a complete mutational scan, such heatmaps are inherently less dense and less effective at conveying patterns than in saturation mutagenesis studies. Nevertheless, to aid readers who may find this visualization useful, we have generated and included a single-nucleotide variant heatmap as Supplemental Figure S1.

      My major comments are as follows: 6. Citations needed - especially in the introduction and for establishing that hASS is a homolog of yARG1

      We have added the requested citations and clarified the ASS1-ARG1 orthology in the Introduction, as described in our response to point 3 above.

      1. Generally, the authors do a nice job distinguishing the ASS1 gene from the ASS enzyme, though I found some ambiguities (Line 685). Please double-check the use of each throughout the manuscript.

      We have edited the manuscript to ensure consistent and unambiguous use of gene and enzyme nomenclature throughout.

      1. Generally, I'm confused about what strain was used for integrating all these variants, was is the arg1 knock-out strain from the yeast knockout collection or was it FY4? I think FY4 was used for the preliminary experiments, then the KO collection strain was used for making the variant library but I think this could be made more clear in the text and figures. Lines 226-229 describes introducing the hASS1 and yASS1 sequences into the native ARG1 locus in strain FY4, but the Fig1A image depicts the ASS1 variants going into arg1 KO locus. Fig1A should be moved to Fig2.

      We agree that the strain construction steps were not described as clearly as they could have been. We have therefore clarified the strain construction workflow in the Materials & Methods and Results sections, as well as in the Figure 1 legend, to explicitly distinguish preliminary experiments performed in strain FY4 from construction of the variant library in the arg1 knockout background.

      As we have also added an additional panel to Figure 1 that schematically explains how the screen was performed (per Reviewer #2's request), we believe that Figure 1A is appropriately placed and should remain in Figure 1.

      1. Line 303 - "We classify these variants as 'functionally unimpaired'", this is not an accurate description of these variants as Figure 2 highlights 24 pathogenic ClinVar variants that would fall into this category of "functionally unimpaired". The yeast growth assay appears to capture pathogenic variants, but there is likely some nuance of human ASS functionality that is not being assessed here. I would make the language more specific, e.g. "complementary to Arg1" or "growth-compatible".

      We agree that the label "functionally unimpaired" could be misinterpreted if read as implying clinical benignity. We have therefore clarified within the manuscript that this designation refers strictly to variant behavior in the yeast growth assay (i.e., wild-type-like growth under assay conditions) and does not imply absence of pathogenicity. We also expanded the Discussion to explicitly address the subset of clinically pathogenic variants with high growth scores (>0.85), consistent with a ceiling effect of the assay and a key limitation for interpretation. See response to reviewer #3 point 1. Relevant revisions are also discussed in our responses to Reviewers #1 and #2.

      1. Lines 345-355 - It is interesting that there are variants that appear functional at the substrate interfacing sites. Is there anything common across these variants? Are they maintaining the polarity or hydrophobicity of the WT residue? Are any of these variants included in ClinVar or gnomAD? Are pathogenic variants found at any of these sites

      Yes. For highly sensitive active-site residues that have few permissible variants, the vast majority of amino acid substitutions that do retain activity preserve key physicochemical properties of the wild-type residue, such as hydrophobicity or charge. We have added this important observation to the manuscript:

      "Any variants at these sensitive residues that are permissive for activity in our assay retain hydrophobicity or charged states relative to the original amino acid side chain (Figure 5A & Table S5)."

      None of these variants are present in ClinVar. Only L15V and E191D are present in gnomAD (Table S4).

      1. Lines 423-430 - The OddsPath calculation would seem to rely heavily on the thresholds of .85 for normalized growth. The OddsPath calculation could be bolstered with some additional analysis that emphasizes the robustness to alternative thresholds.

      We agree that the sensitivity of the OddsPath calculation to the choice of growth thresholds is an important consideration. In our assay, benign ClinVar variants and non-human primate variants are observed exclusively within the peak centered on wild-type growth, whereas clinically annotated variants falling below this peak are exclusively pathogenic. On this basis, we defined the upper boundary of the assay range interpreted as supporting pathogenicity as the lower boundary of the wild-type-centered peak in the growth distribution (as defined in Figure 3), rather than selecting a cutoff by direct optimization of the OddsPath. This choice reflects the observed concordance, in our dataset, between the onset of measurable functional impairment in the assay and clinical pathogenic annotation. Importantly, in practice the OddsPath value is locally robust to the precise placement of this boundary, remaining invariant across the range 0.82-0.88. Supporting our chosen threshold of 0.85, the lowest-growth benign or primate variant observed has a normalized growth value of 0.88, while the lowest growth observed among variants present as homozygotes in gnomAD was 0.86. We have clarified this rationale and analysis in the revised manuscript.

      "Notably, the "Among all nine of the human ASS1 missense variants observed as homozygotes in gnomAD which were tested as amino acid substitutions in our assay, the lowest observed growth value was 0.86 (Ala258Val) consistent with the lower boundary of the PrimateAI variants which was a growth value of 0.87 (Ala81Thr) (Figure 6) and with our use of a 0.85 classification threshold."

      "If we treat PrimateAI variants as benign (solely for OddsPath calculation purposes), the OddsPath for growth

      1. Lines 432-441 - This is an interesting idea to use variants observed in primates, has ACMG weighed in on this? I understand that CTLN1 is an autosomal recessive disorder but I'd still be interested in seeing how the observed ASS1 missense variants in gnomAD perform in your growth assay, possibly a supplemental figure?

      To our knowledge, the ACMG/AMP guidelines do not currently address the use of homozygous missense variants observed in non-human primates. We are currently in discussion with two ClinGen working groups to discuss the possibility of formalizing the use of this data source.

      We agree that comparison with human population data is also important. Accordingly, total gnomAD allele counts and homozygous counts for all applicable ASS1 missense variants are provided in Table S4, and the growth behavior of ASS1 missense variants observed in the homozygous state in gnomAD is shown in Figure 6. These homozygous variants uniformly exhibit high growth in our assay, consistent with the absence of strong loss-of-function effects. We have updated the manuscript text to clarify these points.

      Minor comments 1. Lines 53-59 - This paragraph needs to cite the literature, especially lines 56, 57, and 59 2. Line 61 - no need to repeat "citrullinemia type I", just use the abbreviation as it was introduced in the paragraph above 3. Lines 61-71 - again, this paragraph needs more literature citations 4. Line 62 - change to "results"

      The changes suggested in points 1-4 have all been implemented in the revised manuscript.

      1. Line 74-75 - "RUSP" acronym not needed as it's never used in the manuscript, the same goes for "HHS"

      We agree that the acronyms "RUSP" and "HHS" are not reused elsewhere in the manuscript. We have nevertheless retained them at first mention, alongside the expanded names, because these acronyms are commonly used in newborn screening and public health policy contexts and may be more familiar to some readers than the expanded terms. We would be happy to remove the acronyms if preferred.

      1. Line 86 - "ASS1" I think is referring to the enzyme and should just be "ASS"? If referring to the gene then italicize to "ASS1"
      2. Lines 91-93 - It would be helpful to mention this is a functional screen in yeast
      3. Line 101 - It would be helpful to the readers to define SD before using the acronym, consider changing to "minimal synthetic defined (SD) medium" and afterwards can refer to as "SD medium"
      4. 109-114 - It would be great if you could share your method for designing the codon-harmonized yASS1 gene, consider sharing as a supplemental script or creating a GitHub repository linked to a Zenodo DOI for publication.

      The changes suggested in points 6-9 have all been implemented in the revised manuscript. The codon harmonization script has been provided as File S1.

      1. Lines 135-137 - I think it's helpful to provide a full table of the cassettes ordered from Twist as well as the primers used to amplify them, consider a supplemental table.

      Details of Twist cassette and the primer sequences used for amplification have been added to the Materials & Methods.

      1. Line 138 - "standard methods" is a bit vague, I'm guessing this is a Geitz and Schiestl 2007 LiAc/ssDNA protocol (PMID 17401334)? Also, was ClonNAT used to select for natMX colonies?

      The reviewer is correct about which protocol was used, and we have added the citation. We have also clarified that selection was carried out based on resistance to nourseothricin.

      1. Line 150 - change to "sequence the entire open reading frame, as previously described [4]."
      2. Line 222-223 - remove "replace" and just use "complement" (and remove the parenthesis)
      3. Line 249 - It would be great to see a supplemental alignment of the hASS1 and yASS1 sequences.
      4. Line 261 - spelling "citrullemia" should be corrected to "citrullinemia"
      5. Line 280 - "using Oxford Nanopore sequencing" is a bit vague, I suggest specifying the equipment used (e.g. Oxford Nanopore Technologies MinION platform) or simplify to "via long-read sequencing (see Materials & Methods)"

      The changes suggested in points 12-16 have all been implemented in the revised manuscript. An alignment of the ASS and Arg1 protein sequences has been provided as File S2.

      1. Line 287-289 - It would be great to see the average number of isolates per variant, as well as a plot of the variant growth estimate vs individual isolate growth.

      We agree with the reviewer that conveying measurement precision is important. The number of isolates assayed per variant is provided in Table S4, and we have added explicit mention of this in the text. Because variants were assayed with a mixture of 1, 2, or {greater than or equal to}3 independent isolates, a scatterplot of variant-level growth estimates versus individual isolate measurements would be difficult to interpret and potentially misleading. Instead, we report standard error estimates for each variant in Table S4, derived from the linear model used to estimate growth effects, which more appropriately summarizes measurement uncertainty given the experimental design.

      1. Lines 324-25 - consider removing the last sentence of this paragraph, it is redundant as the following paragraph starts with the same statement.

      We have removed this sentence.

      1. Lines 327-335 - This is interesting and would benefit from its own subpanel or plot in which the normalized growth score is plotted against variants that are at conserved or diverse residues in human ASS, and see if there's a statistical difference in score between the two groupings.

      As suggested by the reviewer, we have added Supplemental Figure 2 (Figure S2) in which the normalized growth score of each variant is plotted against the conservation of the corresponding residue, as measured by ConSurf. The manuscript already includes a statistical analysis of the relationship between residue conservation and functional impact, showing that amorphic variants occur significantly more frequently at highly conserved residues than unimpaired variants do (one-sided Fisher's exact test). We now refer to this new supplemental figure in the relevant Results section.

      1. Lines 339-341 - As written, it is unclear if aspartate interacts with all of the same residues as citrulline or just Asn123 and Thr119.
      2. Lines 345-355 - As with my above comment, I find this interesting and would
      3. Line 353 - add a period to "al" in "Diez-Fernandex et al."

      The issues raised in points 20 and 22 have all addressed. Point 21 appears to be truncated.

      1. Figure 1 a. Remove "Figure" from the subpanels and show just "A" and "B" (as you do for Figure 4) and combine the two images into a single image. Also make this correction to Figure 5 and Figure 8. b. Panel A - I thought the hASS1 and yASS1 were dropped into FY4, not the arg1 KO strain. This needs clarification. c. Panel A - I'm assuming the natMX cassette contains its own promoter, you could use a right-angled arrow to indicate where the promotors are in your construct. d. Panel B - I'm not sure the bar graph is necessary, it would be more helpful to see calculations of the colony size (or growth curves for each strain) and plot the raw values (maybe pixel counts?) for each replicate rather than normalizing to yeast ARG1. I would be great to have a supplemental figure showing all the replicates side-by-side. e. Panel B - Would be helpful to denote the pathogenic and benign ClinVar variants with an icon or colored text.

      f. Figure 1 Caption - make "A)" and "B)" bold.

      We have implemented the requested changes in Figure 1 with the following exceptions. We have retained panels A and B as separate subfigures because they illustrate distinct experimental concepts. In addition, we respectfully disagree with point (d). The bar graph is intended to provide a clear, high-level comparison of functional complementation by hASS1 versus yASS1 and to illustrate the gross differences in growth between benign and pathogenic proof-of-principle variants. As the bar graph includes error bars for standard deviations, presenting raw colony size measurements or growth curves for individual replicates would substantially complicate the figure without materially improving interpretability for this purpose.

      1. Figure 2 a. "Shown in magenta are amino acid substitutions corresponding to ClinVar pathogenic, pathogenic/likely pathogenic, and likely pathogenic variants" is repeated in the figure caption. b. "Shown in green are amino acid substitutions corresponding to ClinVar benign and likely benign variants." I don't see any green points. c. Identify the colors used for ASS1 substrate binding residues. d. This plot would benefit from a depiction of the human ASS secondary structure and any protein domains (nucleotide-binding domain, synthase domain, and C-terminal helix from Fig4B)

      e. Line 685 675 - "ASS1" is being used in reference to the enzyme, is this correct or should it be "ASS"?

      We have made the requested changes to Figure 2. The repeated caption text has been removed, and references to green points have been corrected to orange points to match the figure. The colors used to indicate ASS substrate-binding residues are explicitly described in the figure key. Secondary structure annotations have been added. References to the enzyme have been corrected to "ASS" rather than "ASS1" where appropriate.

      1. Figure 3 a. Rename the "unimpaired" category as there are several pathogenic ClinVar variants that fall into this category.

      To address this point, we have clarified the labeling by adding "in our yeast assay" to the figure legend, making explicit that the "unimpaired" category refers only to wild-type-like behavior under assay conditions and does not imply clinical benignity. See also response to Reviewer #3, Major Comment 1.

      1. Figure 4 a. List the PDB or AlphaFold accession used for this structure b. Panel A - state which colors are used for to depict each monomer. It is confusing to see several shades of pink/purple used to depict a single monomer in Panel A. c. It is very difficult to make out the aspartate and citrulline substrates in the catalytic binding activity, consider making an inset zooming-in on this domain and displaying a ribbon diagram of the structure rather than the surface. d. Generally, it would be more helpful here to label any particular residues that were identified as pathogenic from your screen, or to overlay average grow scores per residue data onto the structure

      We have implemented the requested changes to Figure 4. The relevant PDB/AlphaFold accession is now listed, and the colors used to depict each monomer in Panel A are clarified in the figure legend. An inset focusing on the active site has been added to improve visualization of the citrulline and aspartate substrates. In addition, we have added new panels (Figure 4C and 4D) overlaying pathogenic residues and average growth scores onto the structure to more directly link structural context with functional data.

      1. Figure 5 a. Line 716 - Insert a page break to place Figure 5 on its own page b. I suggest using a heatmap for this type of plot, as it is very difficult to track which color corresponds to which residue.

      c. Fig5A - This plot could be improved by identifying which residue positions interface with which substrate.

      We have placed Figure 5 on its own page and added information to the legend identifying which residue positions interface with each substrate. We have retained the active-site variant strip charts raised in point (b), as we believe they effectively illustrate how the distribution of variant effects differs between residues. In addition, we have provided a supplemental heatmap showing variant growth across the entire protein (Figure S1), and individual variant scores for all residues are provided in Table S4.

      1. Figure 7 a. Line 735 - Insert page break to place figure on a new page

      List the PDB accession used for these images. c. For clarity I would mention "human ASS" in the figure title d. State the colors of the substrates e. Panels A and B could be combined into a single panel, making it easier to distinguish the active site and dimerization variants.

      f. Could be interesting to get SASA scores for the ClinVar structural variants to determine if they are surface-accessible

      We have implemented the requested changes in Figure 7 with the following exceptions. For point (e), there is no single orientation of the structure that allows a clear simultaneous view of both active-site and dimerization variants; accordingly, we have retained panels A and B as separate subfigures to preserve clarity. With respect to point (f), we agree that solvent accessibility analysis could be informative in other contexts. However, such an analysis does not integrate naturally with the functional and assay-based framework of the present study and was therefore not included.

      1. Figure 8 a. Panel B - overlay a square frame in the larger protein structure that depicts where the below inset is focused, and frame inset image as well.

      We have framed the inset image as requested. We did not add a corresponding frame to the full protein structure, as doing so obscured structural details in the region of interest.

      Reviewer #3 (Significance (Required)):

      Section 2 - Significance This study represents a substantial technical, functional, and translational advance in the interpretation of missense variation in ASS1, a gene of high clinical relevance for the rare disease citrullinemia type I. Its principal strength lies in the generation of an experimentally validated functional atlas of ASS1 missense variants that covers ~90% of all SNV-accessible substitutions. The scale, internal reproducibility, and careful benchmarking of the yeast complementation assay against known pathogenic and benign variants provide a robust foundation for identifying pathogenic ASS1 variants. Particularly strong aspects include the rigorous quality control of variant identities, the quantitative nature of the functional readout, and the thoughtful integration of results into the ACMG/AMP OddsPath framework. The discovery of intragenic complementation between variants affecting distinct structural regions of the enzyme is a notable conceptual and mechanistic contribution. Limitations include the assay's reduced sensitivity to variants impacting oligomerization or subtle folding defects, and the use of yeast as a heterologous system, which may mask disease-relevant mechanisms as several pathogenic ClinVar variants were found to be "functionally unimpaired". Future work extending functional testing to additional cellular contexts or expanding genotype-level combinatorial analyses would further enhance clinical applicability. Relative to prior studies, which have relied on small numbers of patient-derived variants or low-throughput biochemical assays, this work extends the field decisively by delivering a comprehensive, variant-resolved functional map for ASS1. To the best of my current knowledge, this is the first systematic functional screen of ASS1 at this scale and the first direct experimental demonstration that ASS active sites span multiple subunits, enabling intragenic complementation consistent with Crick and Orgel's classic variant sequestration model. As such, the advance is simultaneously technical (high-throughput functional genomics), mechanistic (defining structural contributors to catalysis and epistasis), and clinical (enabling evidence-based reclassification of VUS). I find the use of homozygous non-human primate variants as an orthogonal benign calibration set both creative and controversial, my hope would be that this manuscript will prompt a productive discussion.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #3

      Evidence, reproducibility and clarity

      Summary

      This manuscript presents a comprehensive functional profiling of 2,193 ASS1 missense variants using a yeast complementation assay, providing valuable data for variant interpretation in the rare disease citrullinemia type I. The dataset is extensive, technically sound, and clinically relevant. The demonstration of intragenic complementation in ASS1 is novel and conceptually important. Overall, the study represents a substantial contribution to functional genomics and rare disease variant interpretation.

      Major comments

      This is an exciting paper as it can provide support to clinicians to make actionable decisions when diagnosing infants. I have a few major comments, but I want to emphasize the label of "functionally unimpaired" variants to be misleading. The authors explain that there are several pathogenic ClinVar variants that fall into this category (above the >.85 growth threshold) but I think this category needs a more specific name and I would ask the authors to reiterate the shortcomings of the assay again in the Discussion section. I think there's an important discussion to be had here, is the assay detecting variants that alter the function of ASS or is it detecting a complete ablation of enzymatic activity? The results might be strengthened with a follow-up experiment that identifies stably expressed ASS1 variants. At the very least, it would be great to see the authors replicate some of their interesting results from the high-throughput screen by down-selecting to ~12 variants of uncertain significance that could be newly considered pathogenic. I would ask the authors to provide more citations of the literature in the introduction of the manuscript. I would be especially interested in knowing more about human ASS being identified as a homolog of yeast ARG1, as they share little sequence similarity (27.5%) at the protein level. That said, I find the yeast complementation assay exciting. I appreciate the efforts made by the authors to share their work and make this study more reproducible, such as sharing the hASS1 and yASS1 plasmids being shared on NCBI Genbank (Line 121) and publishing the ONT reads on SRA (Line 154). I made a requests for additional data to be shared, such as the custom method/code for codon optimization and a table of Twist variant cassettes that were ordered. I would also love to see these results shared on MaveDB.org. I find this manuscript very exciting as the authors have a compelling assay that identifies pathogenic variants, but I was generally disappointed by the quality and organization of the figures. For example, Figure 4 provides very little insight, but could be dramatically improved with an overlay of the normalized growth score data or highlighting variants surrounding the substrate or ATP interfaces. There are some very interesting aspects of this manuscript that could be shine through with some polished figures. I would also encourage the authors to generate a heatmap of the data represented in Figure 2 (see Fowler and Fields 2014 PMID 25075907, Figure 2), this would be more helpful reference to the readers.

      My major comments are as follows:

      1. Citations needed - especially in the introduction and for establishing that hASS is a homolog of yARG1
      2. Generally, the authors do a nice job distinguishing the ASS1 gene from the ASS enzyme, though I found some ambiguities (Line 685). Please double-check the use of each throughout the manuscript
      3. Generally, I'm confused about what strain was used for integrating all these variants, was is the arg1 knock-out strain from the yeast knockout collection or was it FY4? I think FY4 was used for the preliminary experiments, then the KO collection strain was used for making the variant library but I think this could be made more clear in the text and figures. Lines 226-229 describes introducing the hASS1 and yASS1 sequences into the native ARG1 locus in strain FY4, but the Fig1A image depicts the ASS1 variants going into arg1 KO locus. Fig1A should be moved to Fig2.
      4. Line 303 - "We classify these variants as 'functionally unimpaired'", this is not an accurate description of these variants as Figure 2 highlights 24 pathogenic ClinVar variants that would fall into this category of "functionally unimpaired". The yeast growth assay appears to capture pathogenic variants, but there is likely some nuance of human ASS functionality that is not being assessed here. I would make the language more specific, e.g. "complementary to Arg1" or "growth-compatible".
      5. Lines 345-355 - It is interesting that there are variants that appear functional at the substrate interfacing sites. Is there anything common across these variants? Are they maintaining the polarity or hydrophobicity of the WT residue? Are any of these variants included in ClinVar or gnomAD? Are pathogenic variants found at any of these sites
      6. Lines 423-430 - The OddsPath calculation would seem to rely heavily on the thresholds of <.05 and >.85 for normalized growth. The OddsPath calculation could be bolstered with some additional analysis that emphasizes the robustness to alternative thresholds.
      7. Lines 432-441 - This is an interesting idea to use variants observed in primates, has ACMG weighed in on this? I understand that CTLN1 is an autosomal recessive disorder but I'd still be interested in seeing how the observed ASS1 missense variants in gnomAD perform in your growth assay, possibly a supplemental figure?

      Minor comments

      1. Lines 53-59 - This paragraph needs to cite the literature, especially lines 56, 57, and 59
      2. Line 61 - no need to repeat "citrullinemia type I", just use the abbreviation as it was introduced in the paragraph above
      3. Lines 61-71 - again, this paragraph needs more literature citations
      4. Line 62 - change to "results"
      5. Line 74-75 - "RUSP" acronym not needed as it's never used in the manuscript, the same goes for "HHS"
      6. Line 86 - "ASS1" I think is referring to the enzyme and should just be "ASS"? If referring to the gene then italicize to "ASS1"
      7. Lines 91-93 - It would be helpful to mention this is a functional screen in yeast
      8. Line 101 - It would be helpful to the readers to define SD before using the acronym, consider changing to "minimal synthetic defined (SD) medium" and afterwards can refer to as "SD medium"
      9. 109-114 - It would be great if you could share your method for designing the codon-harmonized yASS1 gene, consider sharing as a supplemental script or creating a GitHub repository linked to a Zenodo DOI for publication.
      10. Lines 135-137 - I think it's helpful to provide a full table of the cassettes ordered from Twist as well as the primers used to amplify them, consider a supplemental table
      11. Line 138 - "standard methods" is a bit vague, I'm guessing this is a Geitz and Schiestl 2007 LiAc/ssDNA protocol (PMID 17401334)? Also, was ClonNAT used to select for natMX colonies?
      12. Line 150 - change to "sequence the entire open reading frame, as previously described [4]."
      13. Line 222-223 - remove "replace" and just use "complement" (and remove the parenthesis)
      14. Line 249 - It would be great to see a supplemental alignment of the hASS1 and yASS1 sequences
      15. Line 261 - spelling "citrullemia" should be corrected to "citrullinemia"
      16. Line 280 - "using Oxford Nanopore sequencing" is a bit vague, I suggest specifying the equipment used (e.g. Oxford Nanopore Technologies MinION platform) or simplify to "via long-read sequencing (see Materials & Methods)"
      17. Line 287-289 - It would be great to see the average number of isolates per variant, as well as a plot of the variant growth estimate vs individual isolate growth
      18. Lines 324-25 - consider removing the last sentence of this paragraph, it is redundant as the following paragraph starts with the same statement
      19. Lines 327-335 - This is interesting and would benefit from its own subpanel or plot in which the normalized growth score is plotted against variants that are at conserved or diverse residues in human ASS, and see if there's a statistical difference in score between the two groupings
      20. Lines 339-341 - As written, it is unclear if aspartate interacts with all of the same residues as citrulline or just Asn123 and Thr119.
      21. Lines 345-355 - As with my above comment, I find this interesting and would
      22. Line 353 - add a period to "al" in "Diez-Fernandex et al."
      23. Figure 1

      a. Remove "Figure" from the subpanels and show just "A" and "B" (as you do for Figure 4) and combine the two images into a single image. Also make this correction to Figure 5 and Figure 8

      b. Panel A - I thought the hASS1 and yASS1 were dropped into FY4, not the arg1 KO strain. This needs clarification

      c. Panel A - I'm assuming the natMX cassette contains its own promoter, you could use a right-angled arrow to indicate where the promotors are in your construct

      d. Panel B - I'm not sure the bar graph is necessary, it would be more helpful to see calculations of the colony size (or growth curves for each strain) and plot the raw values (maybe pixel counts?) for each replicate rather than normalizing to yeast ARG1. I would be great to have a supplemental figure showing all the replicates side-by-side

      e. Panel B - Would be helpful to denote the pathogenic and benign ClinVar variants with an icon or colored text

      f. Figure 1 Caption - make "A)" and "B)" bold 24. Figure 2

      a. "Shown in magenta are amino acid substitutions corresponding to ClinVar pathogenic, pathogenic/likely pathogenic, and likely pathogenic variants" is repeated in the figure caption

      b. "Shown in green are amino acid substitutions corresponding to ClinVar benign and likely benign variants." I don't see any green points

      c. Identify the colors used for ASS1 substrate binding residues

      d. This plot would benefit from a depiction of the human ASS secondary structure and any protein domains (nucleotide-binding domain, synthase domain, and C-terminal helix from Fig4B)

      e. Line 685 - "ASS1" is being used in reference to the enzyme, is this correct or should it be "ASS"? 25. Figure 3

      a. Rename the "unimpaired" category as there are several pathogenic ClinVar variants that fall into this category 26. Figure 4

      a. List the PDB or AlphaFold accession used for this structure

      b. Panel A - state which colors are used for to depict each monomer. It is confusing to see several shades of pink/purple used to depict a single monomer in Panel A

      c. It is very difficult to make out the aspartate and citrulline substrates in the catalytic binding activity, consider making an inset zooming-in on this domain and displaying a ribbon diagram of the structure rather than the surface.

      d. Generally, it would be more helpful here to label any particular residues that were identified as pathogenic from your screen, or to overlay average grow scores per residue data onto the structure 27. Figure 5

      a. Line 716 - Insert a page break to place Figure 5 on its own page

      b. I suggest using a heatmap for this type of plot, as it is very difficult to track which color corresponds to which residue

      c. Fig5A - This plot could be improved by identifying which residue positions interface with which substrate 28. Figure 7

      a. Line 735 - Insert page break to place figure on a new page

      b. List the PDB accession used for these images

      c. For clarity I would mention "human ASS" in the figure title

      d. State the colors of the substrates

      e. Panels A and B could be combined into a single panel, making it easier to distinguish the active site and dimerization variants

      f. Could be interesting to get SASA scores for the ClinVar structural variants to determine if they are surface-accessible 29. Figure 8

      a. Panel B - overlay a square frame in the larger protein structure that depicts where the below inset is focused, and frame inset image as well.

      Significance

      This study represents a substantial technical, functional, and translational advance in the interpretation of missense variation in ASS1, a gene of high clinical relevance for the rare disease citrullinemia type I. Its principal strength lies in the generation of an experimentally validated functional atlas of ASS1 missense variants that covers ~90% of all SNV-accessible substitutions. The scale, internal reproducibility, and careful benchmarking of the yeast complementation assay against known pathogenic and benign variants provide a robust foundation for identifying pathogenic ASS1 variants. Particularly strong aspects include the rigorous quality control of variant identities, the quantitative nature of the functional readout, and the thoughtful integration of results into the ACMG/AMP OddsPath framework. The discovery of intragenic complementation between variants affecting distinct structural regions of the enzyme is a notable conceptual and mechanistic contribution. Limitations include the assay's reduced sensitivity to variants impacting oligomerization or subtle folding defects, and the use of yeast as a heterologous system, which may mask disease-relevant mechanisms as several pathogenic ClinVar variants were found to be "functionally unimpaired". Future work extending functional testing to additional cellular contexts or expanding genotype-level combinatorial analyses would further enhance clinical applicability.

      Relative to prior studies, which have relied on small numbers of patient-derived variants or low-throughput biochemical assays, this work extends the field decisively by delivering a comprehensive, variant-resolved functional map for ASS1. To the best of my current knowledge, this is the first systematic functional screen of ASS1 at this scale and the first direct experimental demonstration that ASS active sites span multiple subunits, enabling intragenic complementation consistent with Crick and Orgel's classic variant sequestration model. As such, the advance is simultaneously technical (high-throughput functional genomics), mechanistic (defining structural contributors to catalysis and epistasis), and clinical (enabling evidence-based reclassification of VUS). I find the use of homozygous non-human primate variants as an orthogonal benign calibration set both creative and controversial, my hope would be that this manuscript will prompt a productive discussion.

    1. You should always say, ma’am and sir. You should never say, ma’am and sir.

      Points like this remind us that what we consider as "right" or "proper" or even kind can come across as offensive or blatantly wrong to others. What does it look like for us to be humble and open enough to the fact that our conceptions of what is acceptable may not be as objective as we think?

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) Legionella effectors are often activated by binding to eukaryote-specific host factors, including actin. The authors should test the following: a) whether Lfat1 can fatty acylate small G-proteins in vitro; b) whether this activity is dependent on actin binding; and c) whether expression of the Y240A mutant in mammalian cells affects the fatty acylation of Rac3 (Figure 6B), or other small G-proteins.

      We were not able to express and purify the full-length recombinant Lfat1 to perform fatty acylation of small GTPases in vitro. However, In cellulo overexpression of the Y240A mutant still retained ability to fatty acylate Rac3 and another small GTPase RheB (see Figure 6-figure supplement 2). We postulate that under infection conditions, actin-binding might be required to fatty acylate certain GTPases due to the small amount of effector proteins that secreted into the host cell.

      (2) It should be demonstrated that lysine residues on small G-proteins are indeed targeted by Lfat1. Ideally, the functional consequences of these modifications should also be investigated. For example, does fatty acylation of G-proteins affect GTPase activity or binding to downstream effectors?

      We have mutated K178 on RheB and showed that this mutation abolished its fatty acylation by Lfat1 (see Author response image 1 below). We were not able to test if fatty acylation by Lfat1 affect downstream effector binding.

      Author response image 1.

      (3) Line 138: Can the authors clarify whether the Lfat1 ABD induces bundling of F-actin filaments or promotes actin oligomerization? Does the Lfat1 ABD form multimers that bring multiple filaments together? If Lfat1 induces actin oligomerization, this effect should be experimentally tested and reported. Additionally, the impact of Lfat1 binding on actin filament stability should be assessed. This is particularly important given the proposed use of the ABD as an actin probe.

      The ABD domain does not form oligomer as evidenced by gel filtration profile of the ABD domain. However, we do see F-actin bundling in our in vitro -F-actin polymerization experiment when both actin and ABD are in high concentration (data not shown). Under low concentration of ABD, there is not aggregation/bundling effect of F-actin.

      (4) Line 180: I think it's too premature to refer to the interaction as having "high specificity and affinity." We really don't know what else it's binding to.

      We have revised the text and reworded the sentence by removing "high specificity and affinity."

      (5) The authors should reconsider the color scheme used in the structural figures, particularly in Figures 2D and S4.

      Not sure the comments on the color scheme of the structure figures.

      (6) In Figure 3E, the WT curve fits the data poorly, possibly because the actin concentration exceeds the Kd of the interaction. It might fit better to a quadratic.

      We have performed quadratic fitting and replaced Figure 3E.

      (7) The authors propose that the individual helices of the Lfat1 ABD could be expressed on separate proteins and used to target multi-component biological complexes to F-actin by genetically fusing each component to a split alpha-helix. This is an intriguing idea, but it should be tested as a proof of concept to support its feasibility and potential utility.

      It is a good suggestion. We plan to thoroughly test the feasibility of this idea as one of our future directions.

      (8) The plot in Figure S2D appears cropped on the X-axis or was generated from a ~2× binned map rather than the deposited one (pixel size ~0.83 Å, plot suggests ~1.6 Å). The reported pixel size is inconsistent between the Methods and Table 1-please clarify whether 0.83 Å refers to super-resolution.

      Yes, 0.83 Å is super-resolution.  We have updated in the cryoEM table

      Reviewer #2:

      Weaknesses:

      (1) The authors should use biochemical reactions to analyze the KFAT of Llfat1 on one or two small GTPases shown to be modified by this effector in cellulo. Such reactions may allow them to determine the role of actin binding in its biochemical activity. This notion is particularly relevant in light of recent studies that actin is a co-factor for the activity of LnaB and Ceg14 (PMID: 39009586; PMID: 38776962; PMID: 40394005). In addition, the study should be discussed in the context of these recent findings on the role of actin in the activity of L. pneumophila effectors.

      We have new data showed that Actin binding does not affect Lfat1 enzymatic activity. (see response to Reviewer #1). We have added this new data as Figure S7 to the paper. Accordingly, we also revised the discussion by adding the following paragraph.

      “The discovery of Lfat1 as an F-actin–binding lysine fatty acyl transferase raised the intriguing question of whether its enzymatic activity depends on F-actin binding. Recent studies have shown that other Legionella effectors, such as LnaB and Ceg14, use actin as a co-factor to regulate their activities. For instance, LnaB binds monomeric G-actin to enhance its phosphoryl-AMPylase activity toward phosphorylated residues, resulting in unique ADPylation modifications in host proteins  (Fu et al, 2024; Wang et al, 2024). Similarly, Ceg14 is activated by host actin to convert ATP and dATP into adenosine and deoxyadenosine monophosphate, thereby modulating ATP levels in L. pneumophila–infected cells (He et al, 2025). However, this does not appear to be the case for Lfat1. We found that Lfat1 mutants defective in F-actin binding retained the ability to modify host small GTPases when expressed in cells (Figure S7). These findings suggest that, rather than serving as a co-factor, F-actin may serve to localize Lfat1 via its actin-binding domain (ABD), thereby confining its activity to regions enriched in F-actin and enabling spatial specificity in the modification of host targets.”

      (2) The development of the ABD domain of Llfat1 as an F-actin domain is a nice extension of the biochemical and structural experiments. The authors need to compare the new probe to those currently commonly used ones, such as Lifeact, in labeling of the actin cytoskeleton structure.

      We fully agree with the reviewer’s insightful suggestion. However, a direct comparison of the Lfat1 ABD domain with commonly used actin probes such as Lifeact, as well as evaluation of the split α-helix probe (as suggested by Reviewer #1), would require extensive and technically demanding experiments. These are important directions that we plan to pursue in future studies.

      For all other minors, we have made corrections/changes in our revised text and figures.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript by Yamamoto et al. presents a model by which the four main axes of the limb are required for limb regeneration to occur in the axolotl. A longstanding question in regeneration biology is how existing positional information is used to regenerate the correct missing elements. The limb provides an accessible experimental system by which to study the involvement of the anteroposterior, dorsoventral, and proximodistal axes in the regenerating limb. Extensive experimentation has been performed in this area using grafting experiments. Yamamoto et al. use the accessory limb model and some molecular tools to address this question. There are some interesting observations in the study. In particular, one strength the potent induction of accessory limbs in the dorsal axis with BMP2+Fgf2+Fgf8 is very interesting. Although interesting, the study makes bold claims about determining the molecular basis of DV positional cues, but the experimental evidence is not definitive and does not take into account the previous work on DV patterning in the amniote limb. Also, testing the hypothesis on blastemas after limb amputation would be needed to support the strong claims in the study.

      Strengths:

      The manuscript presents some novel new phenotypes generated in axolotl limbs due to Wnt signaling. This is generally the first example in which Wnt signaling has provided a gain of function in the axolotl limb model. They also present a potent way of inducing limb patterning in the dorsal axis by the addition of just beads loaded with Bmp2+Fgf8+Fgf2.

      Comments on revised version:

      Re-evaluation: The authors have significantly improved the manuscript and their conclusions reflect the current state of knowledge in DV patterning of tetrapod limbs. My only point of consideration is their claim of mesenchymal and epithelial expression of Wnt10b and the finding that Fgf2 and Wnt10b are lowly expressed. It is based upon the failed ISH, but this doesn't mean they aren't expressed. In interpreting the Li et al. scRNAseq dataset, conclusions depend heavily on how one analyzes and interprets it. The 7DPA sample shows a very low representation of epithelial cells compared to other time points, but this is likely a technical issue. Even the epithelial marker, Krt17, and the CT/fibroblast marker show some expression elsewhere. If other time points are included in the analysis, Wnt10b, would be interpreted as relatively highly expressed almost exclusively in the epithelium. By selecting the 7dpa timepoint, which may or may not represent the MB stage as it wasn't shown in the paper, the conclusions may be based upon incomplete data. I don't expect the authors to do more work, but it is worth mentioning this possibility. The authors have considered and made efforts to resolve previous concerns.

      We are grateful for the constructive comments. As Reviewer #1 suggested, we noted that clearer expression patterns of Wnt10b and Fgf2 may be detectable in scRNA-seq analyses at other stages, and we also clarified that low-level signals of epithelial and CT/fibroblast markers outside their expected clusters may reflect technical bias in the Discussion section. In addition, we agree with the reviewer’s point that our unsuccessful ISH experiments and the low abundance detected by RT-qPCR do not demonstrate absence of expression, and that conclusions from reanalyzing the Li et al. scRNA-seq dataset can depend strongly on analytical choices; therefore, while we focused on the 7 dpa sample because our RT-qPCR data suggested that Wnt10b and Fgf2 may be most enriched around the MB stage (the original study refers to 7 dpa as MB), we explicitly acknowledged that analyzing a single time point—especially one with a low representation of epithelial cells—may yield incomplete or stage-biased interpretations, and that inclusion of additional datasets could reveal clearer and potentially different expression patterns in the Discussion section. We also tempered our wording regarding the inferred cellular sources to avoid over-interpretation based on the current data in the Results section.

      Reviewer #2 (Public review):

      Summary:

      This study explores how signals from all sides of a developing limb, front/back and top/bottom, work together to guide the regrowth of a fully patterned limb in axolotls, a type of salamander known for its impressive ability to regenerate limbs. Using a model called the Accessory Limb Model (ALM), the researchers created early staged limb regenerates (called blastemas) with cells from different sides of the limb. They discovered that successful limb regrowth only happens when the blastema contains cells from both the top (dorsal) and bottom (ventral) of the limb. They also found that a key gene involved in front/back limb patterning, called Shh (Sonic hedgehog), is only turned on when cells from both the dorsal and ventral sides come into contact. The study identified two important molecules, Wnt10B and FGF2, that help activate Shh when dorsal and ventral cells interact. Finally, the authors propose a new model that explains how cells from all four sides of a limb, dorsal, ventral, anterior (front), and posterior (back), contribute at both the cellular and molecular level to rebuilding a properly structured limb during regeneration.

      Strengths:

      The techniques used in this study, like delicate surgeries, tissue grafting, and implanting tiny beads soaked with growth factors, are extremely difficult, and only a few research groups in the world can do them successfully. These methods are essential for answering important questions about how animals like axolotls regenerate limbs with the correct structure and orientation. To understand how cells from different sides of the limb communicate during regeneration, the researchers used a technique called in situ hybridization, which lets them see where specific genes are active in the developing limb. They clearly showed that the gene Shh, which helps pattern the front and back of the limb, only turns on when cells from both the top (dorsal) and bottom (ventral) sides are present and interacting. The team also took a broad, unbiased approach to figure out which signaling molecules are unique to dorsal and ventral limb cells. They tested these molecules individually and discovered which could substitute for actual dorsal and ventral cells, providing the same necessary signals for proper limb development. Overall, this study makes a major contribution to our understanding of how complex signals guide limb regeneration, showing how different regions of the limb work together at both the cellular and molecular levels to rebuild a fully patterned structure.

      Weaknesses:

      Because the expressional analyses are performed on thin sections of regenerating tissue, in the original manuscript, they provided only a limited view of the gene expression patterns in their experiments, opening the possibility that they could be missing some expression in other regions of the blastema. Additionally, the quantification method of the expressional phenotypes in most of the experiments did not appear to be based on a rigorous methodology. The authors' inclusion of an alternate expression analysis, qRT-PCR, on the entire blastema helped validate that the authors are not missing something in the revised manuscript.

      Overall, the number of replicates per sample group in the original manuscript was quite low (sometimes as low as 3), which was especially risky with challenging techniques like the ones the authors employ. The authors have improved the rigor of the experiment in the revised manuscript by increasing the number of replicates. The authors have not performed a power analysis to calculate the number of animals used in each experiment that is sufficient to identify possible statistical differences between groups. However, the authors have indicated that there was not sufficient preliminary data to appropriately make these quantifications.

      Likewise, in the original manuscript, the authors used an AI-generated algorithm to quantify symmetry on the dorsal/ventral axis, and my concern was that this approach doesn't appear to account for possible biases due to tissue sectioning angles. They also seem to arbitrarily pick locations in each sample group to compare symmetry measurements. There are other methods, which include using specific muscle groups and nerve bundles as dorsal/ventral landmarks, that would more clearly show differences in symmetry. The authors have now sufficiently addressed this concern by including transverse sections of the limbs annd have explained the limitations of using a landmark-based approach in their quantification strategy.

      We are grateful for the careful evaluation of the technical rigor and quantification. We have benefited from the reviewer’s earlier feedback, which guided revisions that improved the manuscript’s rigor and presentation.

      Reviewer #3 (Public review):

      Summary:

      After salamander limb amputation, the cross-section of the stump has two major axes: anterior-posterior and dorsal-ventral. Cells from all axial positions (anterior, posterior, dorsal, ventral) are necessary for regeneration, yet the molecular basis for this requirement has remained unknown. To address this gap, Yamamoto et al. took advantage of the ALM assay, in which defined positional identities can be combined on demand and their effects assessed through the outgrowth of an ectopic limb. They propose a compelling model in which dorsal and ventral cells communicate by secreting Wnt10b and Fgf2 ligands respectively, with this interaction inducing Shh expression in posterior cells. Shh was previously shown to induce limb outgrowth in collaboration with anterior Fgf8 (PMID: 27120163). Thus, this study completes a concept in which four secreted signals from four axial positions interact for limb patterning. Notably, this work firmly places dorsal-ventral interactions upstream of anterior-posterior, which is striking for a field that has been focussed on anterior-posterior communication. The ligands identified (Wnt10b, Fgf2) are different to those implicated in dorsal-ventral patterning in the non-regenerative mouse and chick models. The strength of this study is in the context of ALM/ectopic limb engineering. Although the authors attempt to assay the expression of Wnt10b and Fgf2 during limb regeneration after amputation, they were unable to pinpoint the precise expression domains of these genes beyond 'dorsal' and 'ventral' blastema. Given that experimental perturbations were not performed in regenerating limbs - almost exclusively under ALM conditions - this author finds the title "Dorsoventral-mediated Shh induction is required for axolotl limb regeneration" a little misleading.

      Strengths:

      (1) The ALM and use of GFP grafts for lineage tracing (Figures 1-3) take full advantage of the salamander model's unique ability to outgrow patterned limbs under defined conditions. As far as I am aware, the ALM has not been combined with precise grafts that assay 2 axial positions at once, as performed in Figure 3. The number of ALMs performed in this study deserves special mention, considering the challenging surgery involved.

      (2) The authors identify that posterior Shh is not expressed unless both dorsal and ventral cells are present. This echoes previous work in mouse limb development models (AER/ectoderm-mesoderm interaction) but this link between axes was not known in salamanders. The authors elegantly reconstitute dorsal-ventral communication by grafting, finding that this is sufficient to trigger Shh expression (Figure 3 - although see also section on Weaknesses).

      (3) Impressively, the authors discovered two molecules sufficient to substitute dorsal or ventral cells through electroporation into dorsal- or ventral- depleted ALMs (Figure 5). These molecules did not change the positional identity of target cells. The same group previously identified the ventral factor (Fgf2) to be a nerve-derived factor essential for regeneration. In Figure 6, the authors demonstrate that nerve-derived factors, including Fgf2, are alone sufficient to grow out ectopic limbs from a dorsal wound. Limb induction with a 3-factor cocktail without supplementing with other cells is conceptually important for regenerative engineering.

      (4) The writing style and presentation of results is very clear.

      Overall appraisal:

      This is a logical and well-executed study that creatively uses the axolotl model to advance an important framework for understanding limb patterning. The relevance of the mechanisms to normal limb regeneration are not yet substantiated, in the opinion of this reviewer. Additionally, Wnt10b and Fgf2 should be considered molecules sufficient to substitute dorsal and ventral identity (solely in terms of inducing Shh expression). It is not yet clear whether these molecules are truly necessary (loss of function would address this).

      Comments on revisions:

      Congratulations - I still find this an elegant and easy-to-read study with significant implications for the field! Linking your mechanisms to normal limb regeneration (i.e. regenerating blastema, not ALM), as well as characterising the cell populations involved, will be interesting directions for the future.

      We are grateful for the constructive comments. To mitigate the concerns raised by Reviewer #3, we cited a previous study suggesting that ALM was used as the alternative experimental system for studying limb regeneration (Nacu et al., 2016, Nature, PMID: 27120163; Satoh et al., 2007, Developmental Biology, PMID: 17959163) in the Introduction section. We are confident that our ALM-based data provide a reasonable basis for understanding limb regeneration. We agree that there are important remaining questions—such as which cell populations express Wnt10b and Fgf2 and how endogenous WNT10B and FGF2 signals induce Shh expression in normal regeneration—which should be investigated in future studies to deepen our understanding of limb regeneration.


      The following is the authors’ response to the original reviews.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The authors should be commended for addressing this gap - how cues from the DV axis interact with the AP axis during limb regeneration. Overall, the concept presented in this manuscript is extremely interesting and could be of high value to the field. However, the manuscript in its current form is lacking a few important data and resolution to fully support their conclusions, and the following needs to be addressed before publication:

      (1) ISH data on Wnt10b and FGF2 from various regeneration time points are essential to derive the conclusion. Preferably multiplex ISH of Wnt10b/Fgf2/Shh or at least canonical ISH on serial sections to demonstrate their expression in dermis/epidermis and order of gene expression i.e. Shh is only expressed after expression of Wnt10b/FGF2. It would certainly help if this can also be shown in regular blastema.

      We are grateful for the constructive suggestion on assessing Wnt10b and Fgf2 expression during regular regeneration, and we agree that clarifying their expression patterns in regular blastemas is important for strengthening the conclusions of our study. Because we cannot currently ensure sufficient sensitivity with multiplex FISH in our laboratory—partly due to high background—, we conducted conventional ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. We further quantified expression levels of Wnt10b, Fgf2, and Shh across stages (intact, EB, MB, LB, and ED) and found that Wnt10b and Fgf2 peaked at the MB stage, whereas Shh peaked at the LB stage—consistent with the editor’s request regarding the order of gene expression (Fig. S5C). This temporal offset in upregulation supports our model. These results are now included in the revised manuscript (Line 294‒306).

      To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). These results are now included in the revised manuscript (Line 307‒321). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue. These results suggest that Wnt10b/Fgf2 expression is not restricted to dorsal/ventral cells but mediated by dorsal/ventral cells, and co-existence of both signals should provide a permissive environment for Shh induction. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work.  

      (2) Validation of the absence of gene expression via qRT PCR in the given sample will increase the rigor, as suggested by reviewers.

      We thank for this important suggestion and agree that validation by qRT-PCR increases the rigor of our study. Accordingly, we performed RT-qPCR on AntBL, PostBL, DorBL, and VentBL to corroborate the ISH results. The results are now included in Fig. 2. We also verified by RT-qPCR that Shh expression following electroporation and the quantitative results are now provided in Fig. 5.

      (3) Please increase n for experiments where necessary and mention n values in the figures.

      We thank for this helpful comment and agree on the importance of providing sufficient sample sizes. Accordingly, we increased the n for the relevant experiments and have indicated the n values in the corresponding figure legends.

      (4) Most comments by all three reviewers are constructive and largely focus on improving the tone and language of the manuscript, and I expect that the authors should take care of them.

      We thank the reviewers for their constructive feedback on the tone and language of the manuscript. We have carefully revised the text according to each comment, and we hope these modifications have improved both clarity and readability.

      In addition, in revising the manuscript we also refined the conceptual framework. Our new analysis of Wnt10b and Fgf2 expression during normal regeneration suggests that these genes are not expressed in a strictly dorsal- or ventral-specific manner at the single-cell level. When these observations are considered together with (i) the RNA-seq comparison of dorsally and ventrally induced ALM blastemas, (ii) RT-qPCR of microdissected dorsal and ventral halves of regenerating blastemas, and (iii) the functional electroporation experiments, our interpretation is that Wnt10b and Fgf2 act as dorsal- and ventral-mediated signals, respectively: their production is regulated by dorsal or ventral cells, and the presence of both signals is required to induce Shh expression. Given those, we now think our conclusion might be explained without using the confusing term, “positional cue”. Because the distinction between “positional cue” and “positional information” could be confusing as noted by the reviewers, we rewrote our manuscript without using “positional cue.

      Reviewer #1 (Recommendations for the authors):

      (1) Line 61: More explanation for what a double-half limb means is needed.

      We thank the reviewer for this suggestion. We have revised the manuscript (Line 73‒76). Specifically, we now explain that a double-dorsal limb, for example, is a chimeric limb generated by excising the ventral half and replacing it with a dorsal half from the contralateral limb while preserving the anteroposterior orientation.

      (2) Line 63-65: "Such blastemas form hypomorphic, spike-like structures or fail to regenerate entirely." This statement does not represent the breadth of work on the APDV axis in limb regeneration. The cited Bryant 1976 reference tested only double-posterior and double-anterior newt limbs, demonstrating the importance of disposition along the AP axis, not DV. Others have shown that the regeneration of double-half limbs depends upon the age of the animal and the length of time between the grafting of double-half limbs and amputation. Also, some double-dorsal or double-ventral limbs will regenerate complete AP axes with symmetrical DV duplications (Burton, Holder, and Jesani, 1986). Also, sometimes half dorsal stylopods regenerate half dorsal and half ventral, or regenerate only half ventral, suggesting there are no inductive cues across the DV axis as there are along the AP axis. Considering this is the basis of the study under question, more is needed to convince that the DV axis is necessary for the generation of the AP axis.

      We thank the reviewer for this detailed and constructive comment. We acknowledge that previous studies have reported a range of outcomes for double-half limbs. For example, Burton et al. (1986) described regeneration defects in double-dorsal (DD) and double-ventral (VV) limbs, although limb patterning did occur in some cases (Burton et al., 1986, Table 1). As the reviewer notes, regenerative outcomes depend on variables such as animal age and the interval between construction of the double-half limb and amputation, sometimes called the effect of healing time (Tank and Holder, 1978). Moreover, variability has been reported not only in DD/VV limbs but also in double-anterior (AA) and double-posterior (PP) limbs (e.g., Bryant, 1976; Bryant and Baca, 1978; Burton et al., 1986). In the revised manuscript, we have therefore modified the statement to avoid over-generalization and to emphasize that regeneration can be incomplete under these conditions (Line 76‒82). Importantly, in order to provide the additional evidence requested and to directly re-evaluate whether dorsal and ventral cells are required for limb patterning, we performed the ALM experiments shown in Fig. 1. The ALM system allows us to assess this question in a binary manner (regeneration vs. non-regeneration), thereby strengthening the rationale for our conclusions regarding the necessity of the APDV orientations. We also revised a sentence at the beginning of the Results section to emphasize this point (Line 139‒140).

      (3) Line 71: These findings suggest that specific signals from all four positional domains must be integrated for successful limb patterning, such that the absence of any one of them leads to failure." I was under the impression that half posterior limbs can grow all elements, but half anterior can only grow anterior elements.

      We thank the reviewer for this helpful clarification. As summarized by Stocum, half-limb experiments show that while some digit formation can occur, limb patterning remains incomplete in both anterior-half and posterior-half limbs in some cases (Stocum, 2017). We see this point as closely related to the broader question of whether proper limb patterning requires the integration of signals from all four positional domains. As noted in our response above, our ALM experiments in Fig. 1 were designed to test this point directly, and our data support the interpretation that cells from all four orientations are necessary for correct limb patterning.

      (4) Line 79-81: This is stated later in lines 98-105. I suggest expanding here or removing it here.

      We thank the reviewer for this suggestion. In the original version, lines 79–81 introduced our use of the terms “positional cue” and “positional information,” and this content partially overlapped with what later appeared in lines 98–105. In the revised manuscript, we have substantially rewritten this section (Line 82‒84), including the sentences corresponding to lines 79–81 in the original version, to remove the term “positional cue,” as explained in our response to the Editor’s comment (4); our revision reflects new analyses indicating that Wnt10b and Fgf2 appear not be strictly restricted to dorsal or ventral cell populations, and we now describe these factors as dorsal- or ventral-mediated signals that act across dorsoventral domains to induce Shh expression. Accordingly, we no longer maintain the original use of “positional cue” and “positional information.”

      (5) Line 92 - 93: "Similarly, an ALM blastema can be induced in a position-specific manner along the limb axes. In this case, the induced ALM blastema will lack cells from the opposite side." This sentence is difficult to follow. Isn't it the same thing stated in lines 88-90?

      We thank the reviewer for this comment. We revised the sentence to improve readability and to avoid redundancy with original Lines 88–90 (Line 104‒106).

      (6) Line 107: I think the appropriate reference is McCusker et al., 2014 (Position-specific induction of ectopic limbs in non-regenerating blastemas on axolotl forelimbs), although Vieira et al., 2019 can be included here. In addition, Ludolph et al 1990 should be cited.

      We thank the reviewer for this suggestion. We have added McCusker et al. (2014) and Ludolph et al. (1990) as references in the revised manuscript (Line 120‒121).

      (7) Line 107-109: A missing point is how the ventral information is established in the amniote limb. From what I remember, it is the expression of Engrailed 1, which inhibits the ventral expression of Wnt7a, and hence Lmx1b. This would suggest that there is no secreted ventral cue. This is a relatively large omission in the manuscript.

      We thank the reviewer for this comment. We agree that ventral fate in amniotes is specified by En1 in the ventral ectoderm, which represses Wnt7a and thereby prevents induction of Lmx1b; accordingly, a secreted ventral morphogen analogous to dorsal Wnt7a has not been established. We added this point to the revised Introduction (Line 61‒64).

      By contrast, in axolotl limb regeneration, our previous work on Lmx1b expression suggests that DV identities reflect the original positional identity rather than being re-specified during regeneration (Yamamoto et al., 2022). Within this framework, our original use of the term “ventral positional cue” does not imply a ventral patterning morphogen in the amniote sense; rather, it denotes downstream signals induced by cells bearing ventral identity that are required for the blastema to form a patterned limb. This interpretation is consistent with classic studies on double-half chimeras and ectopic contacts between opposite regions (Iten & Bryant, 1975; Bryant & Iten, 1976; Maden, 1980; Stocum, 1982) as well as with our ALM data (Fig. 1). For this reason, we intentionally used the term “positional cues” to refer to signals provided by cells bearing ventral identity, which can be considered separable from the DV patterning mechanism itself, in the original text. As explained in our response to the Editor’s comment (4), we describe these signals as “signals mediated by dorsal/ventral cells,” rather than “positional cues” in the revised manuscript.

      The necessity of dorsal- and ventral-mediated signals is supported by classic studies on the double-half experiment. In the non-regenerating cases, structural patterns along the anteroposterior axis appear to be lost even though both anterior and posterior cells should, in principle, be present in a blastema induced from a double-dorsal or double-ventral limbs. In limb development of amniotes, Wnt7a/Lmx1b or En-1 mutants show that limbs can exhibit anteroposterior patterning even when tissues are dorsalized or ventralized—that is, in the relative absence of ventral or dorsal cells, respectively (Riddle et al., 1995; Chen et al., 1998; Loomis et al., 1996). Taken together, axolotl limb regeneration, in which the presence of both dorsal and ventral cells plays a role in anteroposterior patterning, should differ from other model organisms. It is reasonable to predict the dorsal- and ventral-mediated signals in axolotl limb regeneration. We included this point in the revised manuscript (Line 82‒89). However, there is no evidence that these signals are secreted molecules. For this reason, we have carefully used the term “dorsal-/ventral-mediated signals” in the Introduction without implying secretion.

      (8) Introduction - In general, the argument is a bit misleading. It is written as if it is known that a ventral cue is necessary, but the evidence from other animal models is lacking, from what I know. I may be wrong, but further argument would strengthen the reasoning for the study.

      We thank the reviewer for this thoughtful comment. We agree that it should not read as if it is known that a ventral cue is necessary. In the revised Introduction, we have addressed this in several ways. First, as described in our response to comment (7), we now explicitly note that in amniote limb development ventral identity is specified by En1-mediated repression of Wnt7a, and that a secreted ventral morphogen equivalent to dorsal Wnt7a has not been established. Second, we removed the term “positional cue” and no longer present “ventral positional cue” as a defined entity. Instead, we use mechanistic phrasing such as “signals mediated by ventral cells” and “signals mediated by dorsal cells,” which does not assume that such signals are secreted morphogens or universally conserved. Third, we have reframed the role of dorsal- and ventral-mediated signals as a working hypothesis specific to axolotl limb regeneration, rather than as a general conclusion across model systems.

      (9) Line 129: Remove "As mentioned before".

      We thank the reviewer for this suggestion. We have removed the phrase “As mentioned before” in the revised manuscript (Line 143).

      (10) Figure 1: Are Lmx1, Fgf8, and Shh mutually exclusive? Multiplexed FISH would provide this information, and is a relatively important question considering the strong claims in the study.

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we cannot currently ensure sufficiently high detection sensitivity with multiplex FISH in our laboratory. However, based on previous reports (Nacu et al., 2016), Fgf8 and Shh should be mutually exclusive. In contrast, with respect to Lmx1b, our analysis suggests that its expression is not mutually exclusive with either Fgf8 or Shh, at least their expression domains. To confirm this, we analyzed the published scRNA-seq data and the results were added to the supplemental figure 6. Fgf8 and Shh were expressed in both Lmx1b-positive and Lmx1b-negative cells (Fig. S6H, I), but Fgf8 and Shh themselves were mutually exclusive (Fig. S6M). This point is now included in the revised manuscript (Line 314‒317).

      (11) Results section and Figure 2: More evidence is needed for the lack of Shh expression ISH in tissue sections. Demonstrating the absence of something needs some qPCR or other validation to make such a claim.

      We thank the reviewer for this suggestion. We performed qRT-PCR on ALM blastemas to complement the ISH data (Fig. 2).

      (12) Line 179: I think they are likely leucistic d/d animals and not wild-type animals based upon the images.

      We thank the reviewer for this observation. In the revised manuscript, we have corrected the description to “leucistic animals” (Line 194).

      (13) Line 183-186: I'm a bit confused about this interpretation. If Shh turns on in just a posterior blastema, wouldn't it turn on in a grafted posterior tissue into a dorsal or ventral region? Isn't this independent of environment, meaning Shh turns on if the cells are posterior, regardless of environment?

      Our interpretation is that only posterior-derived cells possess the competency to express Shh. In other words, whether a cell is capable of expressing Shh depends on its original positional identity (Iwata et al., 2020), but whether it actually expresses Shh depends on the environment in which the cell is placed. The results of Fig. 3E and G indicate that Shh activation is dependent on environment and that the posterior identity is not sufficient to activate Shh expression. We have revised the manuscript to emphasize this distinction more clearly (Line 198‒203).

      (14) Figure 4: Do the limbs have an elbow, or is it just a hand?

      We thank the reviewer for this thoughtful question. From the appearance, an elbow-like structure can occasionally be seen; however, we did not examine the skeletal pattern in detail because all regenerated limbs used for this analysis were sectioned for the purpose of symmetry evaluation, and we therefore cannot state this conclusively. While this is indeed an important point, analyzing proximodistal patterning would require a very large number of additional experiments, which falls outside the main focus of the present study. For this reason, and also to minimize animal use in accordance with ethical considerations, we did not pursue further experiments here. In response to this point, we have added a description of the skeletal morphology of ectopic limbs induced by BMP2+FGF2+FGF8 bead implantation (Fig. 6). In these experiments, multiple ectopic limbs were induced along the same host limb. In most cases, these ectopic limbs did not show fusion with the proximal host skeleton, similar to standard ALM-induced limbs, although in one case we observed fusion at the stylopod level. We now note this observation in the revised manuscript (Line 347‒354).

      We regard the relationship between APDV positional information and proximodistal patterning as an important subject for future investigation.

      (15) Line 203 - 237: I appreciate the symmetry score to estimate the DV axis. Are there landmarks that would better suggest a double-dorsal or double-ventral phenotype, like was done in the original double-half limb papers?

      We thank the reviewer for this thoughtful comment. In most cases, the limbs induced by the ALM exhibit abnormal and highly variable morphologies compared to normal limbs, making it difficult to apply consistent morphological landmarks as used in the original double-half limb studies. For this reason, we focused our analysis on “morphological symmetry” as a quantitative measure of DV axis patterning, and we have added this explanation to the manuscript (Line 232‒235). Additionally, we provided transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      (16) Line 245-247: The experiment was done using bulk sequencing, so both the epithelium and mesenchyme were included in the sample. The posterior (Shh) and anterior (Fgf8) patterning cues are mesenchymally expressed. In amniotes, the dorsal cue has been thought to be Wnt7a from the epithelium. Can ISH, FISH, or previous scRNAseq data be used to identify genes expressed in the mesenchyme versus epithelium? This is very important if the authors want to make the claim for defining "The molecular basis of the dorsal and ventral positional cues" as was stated by the authors.

      We thank the reviewer for highlighting this important point. As the reviewer notes, our bulk RNA-seq data do not distinguish between epithelial and mesenchymal expression domains. As noted in our response to the editor’s comment, we performed ISH and qPCR on regular blastemas. However, these approaches did not provide definitive information regarding the specific cell types expressing Wnt10b and Fgf2. To complement this, we re-analyzed publicly available single-cell RNA-seq data (from Li et al., 2021). As a results, Fgf2 was expressed mainly by the mesenchymal cells, and Wnt10b expression was observed in both mesenchymal and epithelial cells. These results are now included in the revised manuscript (Line 294‒321) and in supplemental figures (Fig. S6, S7).

      (17) Was engrailed 1, lmx1b, or Wnt7a differentially expressed along the DV axis, suggesting similar signaling between? Are these expressed in mesenchyme? Previous work suggests Wnt7a is expressed throughout the mesenchyme, but publicly available scRNAseq suggests that it is expressed in the epithelium.

      We thank the reviewer for this important comment. As noted, the reported expression patterns of DV-related genes are not consistent across studies, which likely reflects the technical difficulty of detecting these genes with high sensitivity. In our own experiments, expression of DV markers other than Lmx1b has been very weak or unclear by ISH. Whether these genes are expressed in the epithelium or mesenchyme also appears to vary depending on the detection method used. In our RNA-seq dataset, Wnt7a expression was detected at very low levels and showed no significant difference along the DV axis, while En1 expression was nearly absent. We have clarified these results in the revised manuscript (Line 437‒441). Our reanalysis of the published scRNA-seq likewise detected Wnt7a in only a very small fraction of cells. Accordingly, we consider it premature to reach a definitive conclusion—such as whether Wnt7a is broadly mesenchymal or restricted to epithelium—as suggested in prior reports. We also note that whether Wnt7a is epithelial or mesenchymal does not affect the conclusions or arguments of the present study. Although the roles of Wnt7a and En1 in axolotl DV patterning are certainly important, we feel that drawing a definitive conclusion on this issue lies beyond the scope of the present study, and we have therefore limited our description to a straightforward presentation of the data.

      (18) Line 247-249: The sentence suggests that all the ligands were tried. This should be included in the supplemental data.

      We thank the reviewer for this clarification. In fact, we tested only Wnt4, Wnt10b, Fgf2, Fgf7, and Tgfb2, and all of these results are presented in the figures. To avoid misunderstanding, we have revised the text to explicitly state that our analysis focused on these five genes (Line 272‒274).

      (19) Line 249: An n =3 seems low and qPCR would be a more sensitive means of measuring gene induction compared to ISH. The ISH would confirm the qPCR results. Figure 5C is also not the most convincing image of Shh induction without support from a secondary method.

      We have increased the sample size for these experiments (Line 277‒280). In addition, to complement the ISH results, we confirmed Shh induction by qPCR following electroporation of Wnt10b and Fgf2 (Fig. 5D, E). In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. These data are now included in the revised manuscript (Line 280‒282).

      (20) Line 253: It is confusing why Wnt10b, but not Wnt4 would work? As far as I know, both are canonical Wnt ligands. Was Wnt7a identified as expressed in the RNAseq, but not dorsally localized? Would electroporation of Wnt7a do the same thing as Wnt10b and hence have the same dorsalizing patterning mechanisms as amniotes?

      We thank the reviewer for raising this challenging but important question. Wnt10b was identified directly from our bulk RNA-seq analysis, as was Wnt4. The difference in the ability of Wnt10b and Wnt4 to induce Shh expression in VentBL may reflect differences in how these ligands activate downstream WNT signaling programs. WNT10B is a potent activator of the canonical WNT/β-catenin pathway (Bennett et al., 2005), although WNT10B has also been reported to trigger a β-catenin–independent pathway (Lin et al., 2021). By contrast, WNT4 can signal through both canonical and non-canonical (β-catenin–independent) pathways, and the balance between these outputs is known to depend on cellular context (Li et al., 2013; Li et al., 2019). Consistent with a requirement for canonical WNT signaling, we found that pharmacological activation of canonical WNT signaling with BIO (a GSK3 inhibitor) was also sufficient to induce Shh expression in VentBL. However, despite this, it is still unclear why Wnt10b, but not Wnt4, was able to induce Shh under our experimental conditions. One possible explanation is that different WNT ligands can engage the same receptors (e.g., Frizzled/LRP6) yet can drive distinct downstream transcriptional programs (This may depend on the state of the responding cells, as Voss et al. predicted), resulting in ligand-specific outputs (Voss et al., 2025). This point is now included in the revised discussion section (Line 402‒412). At present, we cannot distinguish between these possibilities experimentally, and we therefore refrain from making a stronger mechanistic claim.

      With respect to Wnt7a, we detected Wnt7a expression at very low levels, and without a clear dorsoventral bias, in our RNA-seq analysis of ALM blastemas (we describe this point in Line 437‒440). This is consistent with previous work suggesting that axolotl Wnt7a is not restricted to the dorsal region in regeneration. Because of this low and unbiased expression, and because our data already implicated Wnt10b as a dorsal-mediated signal that can act across dorsoventral domains to permit Shh induction, we did not prioritize Wnt7a electroporation in the present study. We therefore cannot conclude whether Wnt7a would behave similarly to Wnt10b in this context.

      Importantly, these uncertainties about ligand-specific mechanisms do not alter our main conclusion. Our data support the idea that a dorsal-mediated WNT signal (represented here by WNT10B and canonical WNT activation) and a ventral-mediated FGF signal (FGF2) must act together to permit Shh induction, and that the coexistence of these dorsal- and ventral-mediated signals is required for patterned limb formation in axolotl limb regeneration.

      (21) Is canonical Wnt signaling induced after electroporation of Wnt10b or Wnt4? qPCR of Lef1 and axin is the most common way of showing this.

      We thank the reviewer for this helpful suggestion. In addition to examining Shh expression, we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation. The data is now included in Fig. 5.

      (22) Line 255-256: qPCR was presented for Figure 5D, but ISH was used for everything else. Is there a technical reason that just qPCR was used for the bead experiments?

      We thank the reviewer for this helpful comment. In the original submission, our goal was to test whether treatment with commercial FGF2 protein or BIO could reproduce the results obtained by electroporation. In the revised manuscript, to avoid confusion between distinct experimental aims, we removed the FGF2–bead data from this section and instead used RT-qPCR to quantitatively corroborate Shh induction after electroporation (Fig. 5D–E). RT-qPCR provided a sensitive, whole-blastema readout and allowed a paired design (left limb: factor; right limb: GFP control) that increased statistical power while minimizing animal use. To address the reviewer’s point more directly, we additionally performed ISH for the BIO treatment and now include those results in Supplementary Figure 3 (Line 287‒288).

      (23) Line 261-263: The authors did not show where Wnt10B or Fgf2 is expressed in the limb as claimed. The RNAseq was bulk, so ISH of these genes is needed to make this claim. Where are Wnt10b and Fgf2 expressed in the amputated limb? Do they show a dorsal (Wnt10b) and ventral (Fgf2) expression pattern?

      We thank the reviewer for raising this important point. As noted in our response to the editor’s comment, we performed ISH on serial sections of regular blastemas at several time points (Fig. S5A). However, the expression patterns of Wnt10b and Fgf2 along the dorsoventral axis were not clear. To complement the ISH results, we performed RT-qPCR on microdissected dorsal and ventral halves of regular blastemas at the MB stage (Fig. S5B). We found that Wnt10b and Fgf2 were expressed at significantly higher levels in the dorsal and ventral halves, respectively, compared to the opposite half. This dorsal/ventral biased expression of Wnt10b/Fgf2 is consistent with our RNA-seq data. To identify the cell types expressing Wnt10b or Fgf2, we analyzed published single-cell RNA-seq data (7 dpa blastema (MB), Li et al., 2021). As a result, Fgf2 expression was observed in the mesenchymal cluster, whereas Wnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. The apparent low abundance likely contributes to the weak ISH signals and reflects current technical limitations. In addition, Wnt10b and Fgf2 expression did not follow Lmx1b expression (Fig. S6J, K), and Wnt10b and Fgf2 themselves were not exclusive (Fig. S6L). Together with the RT-qPCR data (Fig. S5B), these results suggest that Wnt10b and Fgf2 are not exclusively confined to purely dorsal or ventral cells at the single-cell level, even though they show dorsoventral bias when assessed in bulk tissue, suggesting that Wnt10b/Fgf2 expression is not dorsal-/ventral-specific but mediated by dorsal/ventral cells. Defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will therefore be an important goal for future work. These points are now included in the revised manuscript (Line 485‒501).

      (24) Line 266-288: The formation of multiple limbs is impressive. Do these new limbs correspond to the PD location they are generated?

      We thank the reviewer for this interesting question. Interestingly, from our observations, there does appear to be a tendency for the induced limbs to vary in length depending on their PD location. The skeletal patterns of the induced multiple limbs are now included in Fig. 6. However, as noted earlier, the supernumerary limbs exhibit highly variable morphologies, and a rigorous analysis of PD correlation would require a large number of induced limbs. Since this lies outside the main focus of the present study, we have not pursued this point further in the manuscript.

      (25) Line 288: The minimal requirement for claiming the molecular basis for DV signaling was identified is to ISH or multiplexed FISH for Wnt10b and Fgf2 in amputated limb blastemas to show they are expressed in the mesenchyme or epithelium and are dorsally and ventrally expressed, respectively. In addition, the current understanding of DV patterning through Wnt7a, Lmx1b, and En1 shown not to be important in this model.

      We thank the reviewer for this comment and fully agree with the point raised. We would like to clarify that we are not claiming to have identified the molecular basis of DV patterning. As the reviewer notes, molecules such as Lmx1b, Wnt7a, and En1 are well identified in other animal models as key regulators of DV positional identity. There is no doubt that these molecules play central roles in DV patterning. However, in axolotl limb regeneration, clear DV-specific expression has not been demonstrated for these genes except for Lmx1b. Therefore, further studies will be required to elucidate the molecular basis of DV patterning in axolotls.

      Our focus here is more limited: we aim to identify the molecular basis for the mechanisms in which positional domain-mediated signals (FGF8, SHH, WNT10B, and FGF2) regulate the limb patterning process, rather than the molecular basis of DV patterning. In fact, our results on Wnt10b and Fgf2 suggest that these genes did not affect dorsoventral identities.

      We recognize that this distinction was not sufficiently clear in the original text, and we have revised the manuscript to describe DV patterning mechanisms in other animals and clarify that the dorsal- and ventral-mediated signals are distinct from DV patterning (Line 444‒450). At least, we avoid claiming that the molecular basis for DV signaling was identified.

      (26) Line 335: References are needed for this statement. From what I found, Wnt4 can be canonical or non-canonical.

      We thank the reviewer for this helpful comment. We have revised the manuscript (Line 404‒407). We added these citations at the relevant location and adjusted nearby wording to avoid implying pathway exclusivity, in alignment with our response to comment (20).

      (27) Line 337-338: The authors cannot claim "that canonical, but not non-canonical, WNT signaling contributes to Shh induction" as this was not thoroughly tested is based upon the negative result that Wnt4 electroporation did not induce Shh expression.

      We thank the reviewer for this important clarification. We agree that our data do not allow us to conclude that non-canonical WNT signaling in general does not contribute to Shh induction. Accordingly, we have removed the phrase “but not non-canonical” and revised the text to emphasize that, within the scope of our experiments, Shh induction was not observed following Wnt4 electroporation, whereas it was observed with Wnt10b.

      (28) Line 345: In order to claim "WNT10B via the canonical WNT pathway...appears to regulate Shh expression" needs at least qPCR to show WNT10B induces canonical signaling.

      We thank the reviewer for this comment. As noted in our response to comment (21), we also assessed canonical WNT signaling by qPCR analysis of Axin2 and Lef1 following Wnt10b electroporation (Line 282‒285).

      (29) Lines 361-372: A few studies have been performed on DV patterning of the mouse digit regeneration in regards to Lmx1b and En1. It may be good to discuss how the current study aligns with these findings.

      We appreciate the reviewer’s suggestion. As the reviewer refers, several studies have been performed on dorsoventral (DV) patterning in mouse digit tip regeneration in relation to Lmx1b and En1 (e.g., Johnson et al., 2022; Castilla-Ibeas et al., 2023). In the present study, however, our main conclusion is different in the scope of studies on mouse digit tip regeneration. We show that, in the axolotl, pre-existing dorsal and ventral identities (as reflected by dorsally derived and ventrally derived cells in the ALM blastema) are required together to induce Shh expression, and that this Shh induction in turn supports anteroposterior interaction at the limb level. This mechanism—dorsal-mediated and ventral-mediated signals acting in combination to permit Shh expression—does not have a clear direct counterpart in the mouse digit tip literature. Moreover, even with respect to Lmx1b, the two systems behave differently. In mouse digit tip regeneration, loss of Lmx1b during regeneration does not grossly affect DV morphology of the regenerate (Johnson et al., 2022). By contrast, in our axolotl ALM system, the presence or absence of Lmx1b-positive dorsal tissue correlates with the final dorsoventral organization of the induced limb-like structures (e.g., production of double-dorsal or double-ventral symmetric structures in the absence of appropriate dorsoventral contact). Thus, the role of dorsoventral identity in our model is directly tied to patterned limb outgrowth at the whole-limb scale, whereas in the mouse digit tip it has been reported primarily in the context of digit tip regrowth and bone regeneration competence, not robust DV repatterning (Johnson et al., 2022).

      For these reasons, we believe that an extended discussion of mouse digit tip regeneration would risk implying a mechanistic equivalence between axolotl limb regeneration and mouse digit tip regeneration that is not supported by current data. Because the regenerative contexts differ, and because Lmx1b does not appear to re-establish DV patterning in the mouse regenerates (Johnson et al., 2022), we have chosen not to include an explicit discussion of mouse digit tip regeneration in the main text.

      (30) Line 408-433: Although I appreciate generating a model, this section takes some liberties to tell a narrative that is not entirely supported by previous literature or this study. For example, lines 415-416 state "Wnt10b and Fgf2 are expressed at higher levels in dorsal and the ventral blastemal cells, respectively" which were not shown in the study or other studies.

      We thank the reviewer for this important comment. We agree that the original model based on RNA-seq data overstated the evidence. To address this point experimentally, we examined Wnt10b and Fgf2 expression in regular blastemas (Supplemental Figure 5 and 6). Accordingly, our model is now framed as an inductive mechanism for Shh expression—supported by results in ALM (WNT10B in VentBL; FGF2 in DorBL) and by DV-biased expression. Concretely, the sentence previously paraphrased as “Wnt10b and Fgf2 are expressed at higher levels in dorsal and ventral blastemal cells, respectively” has been replaced with wording that (i) avoids single-cell DV specificity and (ii) emphasizes dorsal-/ventral-mediated regulation and the requirement for both signals to allow Shh induction (Line 510‒511).

      Reviewer #2 (Recommendations for the authors):

      (1) Introduction:

      The authors' definitions of positional cues vs positional information are a little hard to follow, and do not appear to be completely accurate. From my understanding of what the authors explain, "positional information" is defined as a signal that generates positional identities in the regenerating tissue. This is a somewhat different definition than what I previously understood, which is the intrinsic (likely epigenetic) cellular identity associated with specific positional coordinates. On the other hand, the authors define "positional cues" as signals that help organize the cells according to the different axes, but don't actually generate positional identities in the regenerating cells. The authors provide two examples: Wnt7a as an example of positional information, and FGF8 as a positional cue. I think that coording to the authors definitions, FGF8 (and probobly Shh) are bone fide positional cues, since both signals work together to organize the regenerating limb cells - yet do not generate positional identities, because ectopic limbs formed from blastemas where these pathways have been activated do not regenerate (Nacu et al 2016). However, I am not sure Wnt7a constitutes an example of a "positional information" signal, since as far as I know, it has not been shown to generate stable dorsal limb identities (that remain after the signal has stopped) - at least yet. If it has, the authors should cite the paper that showed this. I think that some sort of diagram to help define these visually will be really helpful, especially to people who do not study regenerative patterning.

      We thank the reviewer for this thoughtful comment. We now agree with the reviewer that our use of “positional cue” and “positional information” may have been confusing. In the revision—and as noted in our response to the Editor’s comment (4)—we have removed the term “positional cue” and no longer attempt to contrast it with “positional information.” Instead, we adopt phrasing that reflects our data and hypothesis: during limb patterning, dorsal-mediated signals act on ventral cells and ventral-mediated signals act on dorsal cells to induce Shh expression. This wording avoids implying that these signals specify dorsoventral identity.

      Regarding WNT7A, we agree it has not been shown to generate a stable dorsal identity after signal withdrawal. In the revised Introduction we therefore describe WNT7A in amniote limb development as an extracellular regulator that induces Lmx1b in dorsal mesenchyme (with En1 repressing Wnt7a ventrally), rather than labeling it as “positional information” in a strict, identity-imprinting sense. We highlight this contrast because, in our axolotl experiments, WNT10B and FGF2 did not alter Lmx1b expression or dorsal–ventral limb characteristics when overexpressed, consistent with the idea that they act downstream of DV identity to enable Shh induction, not to establish DV identity.

      (2) Results:

      It would be helpful if the number of replicates per sample group were reported in the figure legends.

      We thank the reviewer for this suggestion. In accordance with the comment, we have added the number of replicates (n) for each sample group in the figure legends.

      Figure 2 shows ISH for A/P and D/V transcripts in different-positioned blastemas without tissue grafts. The images show interesting patterns, including the lack of Shh expression in all blastemas except in posterior-located blastemas, and localization of the dorsal transcript (Lmx1b) to the dorsal half of A or P located blastemas. My only concern about this data is that the expression patterns are described in only a small part of the ectopic blastema (how representative is it?) and the diagrams infer that these expression patterns are reflective of the entire blastema, which can't be determined by the limited field of view. It is okay if the expression patterns are not present in the entire blastema -in fact, that might be an important observation in terms of who is generating (and might be receiving) these signals.

      We thank the reviewer for this insightful comment. Because Fgf8 and Shh expression was detectable only in a limited subset of cells, the original submission included only high-magnification images. In response to the reviewer’s valid concern about representativeness, we have now added low-magnification overviews of the entire blastema as a supplemental figure (Fig. S1) and clarified in the figure legend that these expression patterns can be focal rather than pan-blastemal (Line 795‒796).

      In Figure 3, they look at all of these expression patterns in the grafted blastemas, showing that Shh expression is only visible when both D and V cells are present in the blastema. My only concern about this data is that the number of replicates is very low (some groups having only an N=3), and it is unclear how many sections the authors visualized for each replicate. This is especially important for the sample groups where they report no Shh expression -I agree that it is not observable in the single example sections they provide, but it is uncertain what is happening in other regions of the blastema.

      We thank the reviewer for this important comment. To increase the reliability of the results, we have increased the number of biological replicates in groups where n was previously low. For all samples, we collected serial sections spanning the entire blastema. For blastemas in which Shh expression was observed, we present representative sections showing the signal. For blastemas without detectable Shh expression, we selected a section from the central region that contains GFP-positive cells for the Figure. To make these points explicit, we have added the following clarification to the Fig. 3 legend (Line 811‒815).

      Figure 4: Shh overexpression in A/P/D/V blastemas - expression induces ectopic limbs in A/D/V locations. They analyzed the symmetry of these regenerates (assuming that Do and V located blastemas will exhibit D/V symmetry because they only contain cells from one side of that axis. I am a little concerned about how the symmetry assay is performed, since oblique sections through the digits could look asymmetric, while they are actually symmetric. It is also unclear how the angle of the boxes that the symmetry scores were based on was decided - I imagine that the score would change depending on the angle. It also appears that the authors picked different digits to perform this analysis on the different sample groups. I also admit that the logic of classification scheme that the authors used AI to perform their symmetry scoring analysis (both in Figures 4 and 5) is elusive to me. I think it would have been more informative if the authors leveraged the structural landmarks, like the localization of specific muscle groups. (If this experiment were performed in WT animals, the authors could have used pigment cell localization)... or generate more proximal sections to look at landmarks in the zeugopod.

      We thank the reviewer for these detailed comments regarding the symmetry analysis. Because reliance on a computed symmetry score alone could raise the concerns noted by the reviewer, we now provide transverse sections along the proximodistal axis as supplemental figures (Figs. S2 and S4). These include levels corresponding to the distal end of the zeugopod and the proximal end of the autopod. In addition to reporting the symmetry score, we have explicitly stated in the text that symmetry was also assessed by visual inspection of these sections.

      As also noted in our response to Reviewer #1 (comment 15), ALM-induced limbs frequently exhibit abnormal and highly variable morphologies, which makes it difficult to use consistent anatomical landmarks such as particular digits or muscle groups. For this reason, we focused our analysis on morphological symmetry rather than landmark-based metrics, and we emphasize this rationale in the revised text (Line 232‒235).

      Regarding the use of bounding boxes, this procedure was chosen to minimize the effects of curvature or fixation-induced distortion. For each section, the box angle was adjusted so that the outer contour (epidermal surface) was aligned symmetrically; this procedure was applied uniformly across all conditions to avoid bias. We analyzed multiple biological replicates in each group, which helps mitigate potential artifacts due to oblique sectioning. To further reduce bias, we increased the number of fields included in the analysis to n = 24 per group in the revised version.

      In addition, staining intensity varied among samples, such that a region identified as “muscle” in one sample could be assigned differently in another if classification were based solely on color. To avoid this problem, we used a machine-learning classifier trained separately for each sample, allowing us to group the same tissues consistently within that sample irrespective of intensity differences. In the context of ALM-induced limbs, where stable anatomical landmarks are not available, we consider this strategy the most appropriate. We have added this rationale to the revised manuscript for clarity (Line 239‒247).

      Figure 5: The number of replicates in sample groups is relatively low and is quite variable between groups (ranging between 3 and 7 replicates). Zoom in to visualize Shh expression is small relative to the blastema, and it is difficult to discern why the authors positioned the window where they did, and how they maintained consistency among their different sample groups. In the examples of positive Shh expression - the signal is low and hard to see. Validating these expression patterns using some sort of quantitative transcriptional assay (like qRTPCR) would increase the rigor of this experiment ... especially given that they will be able to analyze gene expression in the entire blastema as opposed to sections that might not capture localized expression.

      We thank the reviewer for this important comment. To increase the rigor of these experiments, we have increased the number of biological replicates in groups where n was previously low. In addition, because Shh signal in the Wnt10b-electroporated VentBL images was particularly weak and difficult to discern, we replaced that panel with a representative example in which Shh signal is more clearly visible. We also validated the Shh expression for Wnt10b–electroporated VentBL and Fgf2–electroporated DorBL by RT-qPCR, which assesses gene expression across the entire blastema. These results are now included in Fig. 5 and Line 280‒282. Finally, we clarified in the figure legend how the “window” for imaging was chosen: for samples with detectable Shh expression, the window was placed in the region where the signal was observed; for conditions without detectable Shh expression, the window was positioned in a comparable region containing GFP-positive cells (Line 836‒839). These revisions are included in the revised manuscript.

      Figure 6: They treat dorsal and ventral wounds with gelatin beads soaked in a combination of BMP2+FGF8 (nerve factors) and FGF2 proposed ventral factor). Remarkably, they observe ectopic limb expression in only dorsal wounds, further supporting the idea that FGF2 provides the "ventral" signal. They show examples of this impressive phenotype on limbs with multiple ectopic structures that formed along the Pr/Di axis. Including images of tubulin staining (as they have in Figures 1 and 2) to ensure that the blastemas (or final regenerates) are devoid of nerves. The authors' whole-mount skeletal staining which shows fusion of the ectopic humerus with the host humerus, is a phenotype associated with deep wounding, which could provide an opportunity for more cellular contribution from different limb axes.

      We thank the reviewer for these constructive comments. As noted in the prior study, when beads are used to induce blastemas without surgical nerve orientation, fine nerve ingrowth can still occur (Makanae et al., 2014), and the induced blastemas are not completely devoid of nerves. While it is still uncertain whether these recruited nerves are functional after blastema induction, it is an important point, and we added sentences about this in the revised manuscript (Line 341‒345).

      Regarding the skeletal phenotype, despite careful implantation to avoid injuring deep tissues, bead-induced ectopic limbs on the dorsal side occasionally displayed fusion of the stylopod with the host humerus—a phenotype associated with deep wounding, as the reviewer notes. This observation suggests that contributions from a broader cellular population cannot be excluded. However, because fusion was observed in only 1 of 16 induced limbs analyzed, and because ectopic limbs induced at the forearm (zeugopod) level did not exhibit such fusion (n=1/6 for stylopod-level inductions; n=0/10 for zeugopod-level inductions), we believe that our main conclusion remains valid. Because fusion is not a typical outcome, we now present representative non-fusion cases—including zeugopod-origin examples—in the figure (Fig. 6L1, L2), and we report the fusion incidence explicitly in the text (Line 350‒354). We also note in the revised manuscript that stylopod fusion can occur in a minority of cases (Line 347‒349).

      Figure 7 nicely summarizes their findings and model for patterning.

      We thank the reviewer for this positive comment.

      The table is cut off in the PDF, so it cannot be evaluated at this time.

      In our copy of the PDF, the table appears in full, so this may have been a formatting issue. We have carefully checked the file and ensured that the table is completely included in the revised submission.

      There is a supplemental figure that doesn't seem to be referenced in the text.

      The supplemental figure (Fig. S1 of the original manuscript) is referenced in the text, but it may have been overlooked. To improve clarity, we have expanded the description in the manuscript so that the supplemental figure is more clearly referenced (Line 285‒291).

      (3) Materials and Methods:

      No power analysis was performed to calculate sample group sizes. The authors have used these experimental techniques in the past and could have easily used past data to inform these calculations.

      We thank the reviewer for this important comment. We did not include a power analysis in the manuscript because this was the first time we compared Shh and other gene expression levels among ALM blastemas of different positional origins using RT-qPCR in our experimental system. As we did not have prior knowledge of the expected variability under these specific conditions, it was difficult to predetermine appropriate sample sizes.

      Reviewer #3 (Recommendations for the authors):

      General:

      Congratulations - I found this an elegant and easy-to-read study with significant implications for the field! If possible, I would urge you to consider adding some more characterisation of Wnt10b and Fgf2- which cell types are they expressed in? If you can link your mechanisms to normal limb regeneration too (i.e., regenerating blastema, not ALM), this would significantly elevate the interest in your study.

      We sincerely thank the reviewer for these encouraging comments. As also noted in our response to the editor’s comment, we have analyzed the expression patterns of Wnt10b and Fgf2 in regular blastemas (Line 294‒306). Although clear specific expression patterns along dorsoventral axis were not detected by ISH, likely due to technical limitations of sensitivity, RT-qPCR revealed significantly higher expression levels of Wnt10b in the dorsal half and Fgf2 in the ventral half of a regular blastema (Fig. S5). In addition, we analyzed published single-cell RNA-seq data (7 dpa blastema, Li et al., 2021) (Line 307‒321). As a result, Fgf2 expression was observed in the mesenchymal clusters, whereasWnt10b expression was observed in both mesenchymal and epithelial clusters (Fig. S6). However, because only a small fraction of cells expressed Wnt10b, the principal cellular source of WNT10B protein remains unclear. Therefore, defining the precise spatial patterns of Wnt10b and Fgf2 in regular regeneration will be an important goal for future work.

      Data availability:

      I assume that the RNA-sequencing data will be deposited at a public repository.

      RNA-seq FASTQ files have been deposited in the DNA Data Bank of Japan (DDBJ; https://www.ddbj.nig.ac.jp/) under BioProject accession PRJDB38065. We have added a Data availability section to the revised manuscript.

      References

      Castilla-Ibeas, A., Zdral, S., Oberg, K. C., & Ros, M. A. (2024). The limb dorsoventral axis: Lmx1b’s role in development, pathology, evolution, and regeneration. Developmental Dynamics, 253(9), 798–814. https://doi.org/10.1002/dvdy.695

      Johnson, G. L., Glasser, M. B., Charles, J. F., Duryea, J., & Lehoczky, J. A. (2022). En1 and Lmx1b do not recapitulate embryonic dorsal-ventral limb patterning functions during mouse digit tip regeneration. Cell Reports, 41(8), 111701. https://doi.org/10.1016/j.celrep.2022.111701

      Stocum, D. (2017). Mechanisms of urodele limb regeneration. Regeneration, 4. https://doi.org/10.1002/reg2.92

      Tank, P. W., & Holder, N. (1978). The effect of healing time on the proximodistal organization of double-half forelimb regenerates in the axolotl, Ambystoma mexicanum. Developmental Biology, 66(1), 72–85. https://doi.org/10.1016/0012-1606(78)90274-9

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #3 (Public review):

      To summarize: The authors' overfilling hypothesis depends crucially on the premise that the very quickly reverting paired-pulse depression seen after unusually short rest intervals of << 50 ms is caused by depletion of release sites whereas Dobrunz and Stevens (1997) concluded that the cause was some other mechanism that does not involve depletion on. The authors now include experiments where switching extracellular Ca2+ from 1.2 to 2.5 mM increases synaptic strength on average, but not by as much as at other synapse types. They contend that the result supports the depletion on hypothesis. I didn't agree because the model used to generate the hypothesis had no room for any increase at all, and because a more granular analysis revealed a mixed population with a subset where: (a) synaptic strength increased by as much as at standard synapses; and yet (b) the quickly reverting depression for the subset was the same as the overall population.

      The authors raise the possibility of additional experiments, and I do think this could clarify things if they pre-treat with EGTA as I recommended initially. They've already shown they can do this routinely, and it would allow them to elegantly distinguish between pv and pocc explanations for both the increases in synaptic strength and the decreases in the paired pulse ratio upon switching Ca2+ to 2.5 mM. Plus/minus EGTA pre-treatment trials could be interleaved and done blind with minimal additional effort.

      Showing reversibility would be a great addition too, because, in our experience, this does not always happen in whole-cell recordings in ex-vivo tissue even when electrical properties do not change. If the goal is to show that L2/3 synapses are less sensitive to changes in Ca2+ compared to other synapse types - which is interesting but a bit off point - then I would additionally include a positive control, done by the same person with the same equipment, at one of those other synapse types using the same kind of presynaptic stimulation (i.e. ChRs).

      Specific points (quotations are from the Authors' rebuttal)

      (1) Regarding the Author response image 1, I was instead suggesting a plot of PPR in 1.2 mM Ca2+ versus the relative increase in synaptic strength in 2.5 versus in 1.2 mM. This continues to seem relevant.

      Complying with your suggestion, we studied the effects of external [Ca<sup>2+</sup>] ([Ca<sup>2+</sup>]<sub>o</sub>) after pre-incubating the slice in aCSF containing 50 μM EGTA-AM, and added the results as Figure 3—figure supplement 3C-D. Elevation of ([Ca<sup>2+</sup>]<sub>o</sub>) from 1.3 to 2.5 mM produced no significant change in either baseline EPSC amplitude or PPR, supporting that the p<sub>v</sub> is already saturated at 1.3 mM [Ca<sup>2+</sup>]<sub>o</sub> and implying that the modest Ca<sup>2+</sup> dependence of baseline EPSCs and PPR in the absence of EGTA (Figure 3—figure supplement 3A-B) is mediated by the change in baseline vesicular occupancy of release sites (p<sub>occ</sub>) rather than fusion probability of docked vesicles (p<sub>v</sub>).

      We found some correlation of high Ca<sup>2+</sup>-induced relative increase in synaptic strength with the PPR at low Ca<sup>2+</sup> (Author response image 1-A). But this correlation was abolished by pre-incubating the slices in EGTA-AM too (Author response image 1-B). It should be noted that high PPR does not always mean low p<sub>v</sub>. For example, when the replenishment is equal between high and low baseline p<sub>occ</sub> synapses, the PPR would be higher at low p<sub>occ</sub> synapses than that at high p<sub>occ</sub> synapses, even if p<sub>v</sub> is close to unity. Therefore, high baseline release probability (Pr), whatever it is attributed to high p<sub>v</sub> or high p<sub>occ</sub>, can result in low PPR, considering that Pr = p<sub>occ</sub> x p<sub>v</sub>.

      As we have already mentioned in our previous letter, the relationship of PPR with refilling rate is complicated and can be bidirectional, whereas an increase in p<sub>v</sub> always results in a reduction of PPR. For example, PPR can be reduced by both a decrease and an increase in the refilling rate (Figure 2— figure supplement 1 and Lin et al., 2025). Therefore, the PPR analysis alone is insufficient to differentiate the contributions of p<sub>v</sub> and p<sub>occ</sub> Thanks to your suggestion, we could resolve this ambiguity by the EGTA-AM pre-incubation study (Figure 3—figure supplement 3C-D).

      Author response image 1.

      Plot of PPR at low [Ca<sup>2+</sup>]<sub>o</sub> (1.3 mM) as a function of the baseline EPSC at high [Ca<sup>2+</sup>]<sub>o</sub> (2.5 mM) normalized to that at low [Ca<sup>2+</sup>]<sub>o</sub> measured at recurrent excitatory synapses in L2/3 of the prelimbic cortex under the conditions without EGTA-AM (A) and after pre-incubating the slices in EGTA-AM (50 μM) (B)

      (2) "Could you explain in detail why two-fold increase implies pv < 0.2?"

      (a) start with power((2.5/(1 + (2.5/K1) + 1/2.97)),4) = 2<sup>*</sup>power((1.3/(1 + (1.3/K1) + 1/2.97)),4);

      (b) solve for K1 (this turns out to be 0.48);

      (c) then implement the premise that pv -> 1.0 when Ca2+ is high by calculating Max = power((C/(1 + (C/K1) + 1/2.97)),4) where C is [Ca] -> infinity.

      (d) pv when [Ca] = 1.3. mM must then be power((1.3/(1 + (1.3/K1) + 1/2.97)),4)/Max, which is <0.2. Note that modern updates of Dodge and Rahamimoff typically include a parameter that prevents pv from approaching 1.0; this is the gamma parameter in the versions from Neher group.

      Thank you very much for your kind explanation. This interpretation, however, based on the premise that pv is not saturated at low[Ca<sup>2+</sup>]<sub>o</sub>, and that Pr = p<sub>v</sub>. In the present study, however, we presented multiple convergent lines of evidence supporting that p<sub>v</sub> is already saturated at 1.3 mM [Ca<sup>2+</sup>]<sub>o</sub> as follows: (1) little effect of EGTA-AM on the baseline EPSCs (Figure 2—figure supplement 1); (2) high double failure rates (Figure 3—figure supplement 2); (3) little effect of high [Ca<sup>2+</sup>]<sub>o</sub> on baseline EPSC (Figure 3—figure supplement 3). Therefore, our results suggest that the classical Dodge-Rahamimoff fourth-power relationship can not be applied to estimate p<sub>v</sub> at the L2/3 recurrent excitatory synapses. 

      (3) "If so, we can not understand why depletion-dependent PPD should lead to PPF." When PPD is caused by depletion and pv < 0.2, the number of occupied release sites should not be decreased by more than one-filth at the second stimulus so, without facilitation, PPR should be > 0.8. The EGTA results then indicate there should be strong facilitation, driving PPR to something like 1.2 with conservative assumptions. And yet, a value of < 0.4 is measured, which is a large miss.

      As mentioned above, the framework used for inferring that p<sub>v</sub> < 0.2, the Dodge-Rahamimoff equation, is not applicable to our experimental system. Consequently, the subsequent deduction— that depletion-dependent PPD should logically lead to PPF—is based on a model that does not compatible with aforementioned multiple convergent lines of evidence, which supports high p<sub>v</sub> rather than the low p<sub>v</sub> facilitation model.

      (4) Despite the authors' suggestion to the contrary, I continue to think there is a substantial chance that Ca2+-channel inactivation is the mechanism underlying the very quickly reverting paired-pulse depression. However, this is only one example of a non-depletion mechanism among many, with the main point being that any non-depletion mechanism would undercut the reasoning for overfilling. And, this is what Dobrunz and Stevens claimed to show; that the mechanism - whatever it is - does not involve depletion. The most effective way to address this would be affirmative experiments showing that the quickly reverting depression is caused by depletion after all. Attempting to prove that Ca2+channel inactivation does not occur does not seem like a worthwhile strategy because it would not address the many other possibilities.

      We have systematically ruled out alternative possibilities that may underlie the strong PPD observed at our synapses and demonstrated that it arises from high p<sub>v</sub>-induced vesicle depletion through multiple independent lines of evidence. First, we excluded (1) AMPAR desensitization or saturation (Figure 1—figure supplement 5), (2) Ca<sup>2+</sup> channel inactivation (Figure 2—figure supplement 2), (3) channelrhodopsin inactivation (Figure 1—figure supplement 2), (4) artificial bouton stimulation (Figure 1—figure supplement 4), and (5) transient vesicle undocking (Figure 5; addressed in our previous rebuttal). Second, EGTA-AM experiments (Figure 2, Figure 2—figure supplement 1) revealed that release sites are tightly coupled to Ca<sup>2+</sup>  channels, and that EGTA further exacerbates PPD. Third, we validated high baseline p<sub>v</sub> through analysis of double failure rates (Figure 3—figure supplement 2). Fourth, the minimal increase in baseline EPSCs upon elevation of external [Ca<sup>2+</sup>] (Figure 3—figure supplement 3) further supports that baseline p<sub>v</sub> is already saturated at low [Ca<sup>2+</sup>]<sub>o</sub>. Additionally, to further validate our hypothesis, we performed the specific experiment suggested by the reviewer. We have now added EGTA pre-incubation experiments (Figure 3—figure supplement 3C-D) and have revised the manuscript. Specifically, when slices were pre-incubated with 50 μM EGTA-AM, elevation of extracellular [Ca<sup>2+</sup>] from 1.3 to 2.5 mM produced no significant change in either baseline EPSC amplitude or PPR, strongly supporting that the high [Ca<sup>2+</sup>]<sub>o</sub> effects in the absence of EGTA are primarily mediated by changes in p<sub>occ</sub> rather than p<sub>v</sub>

      (5) True that Kusick et al. observed morphological re-docking, but then vesicles would have to re-prime and Mahfooz et al. (2016) showed that re-priming would have to be slower than 110 ms (at least during heavy use at calyx of Held).

      As previously discussed, Kusick et al. (2020) demonstrated that the transient destabilization of the docked vesicle pool recovers very rapidly within 14 ms after stimulation. This implies that any posts stimulation undocking events are likely recovered before the 20 ms ISI used in our PPR experiments. Consequently, transient undocking/re-docking events are unlikely to significantly influence the PPR measured at this interval. Furthermore, regarding the slow re-priming kinetics (>100 ms) reported by Mahfooz et al. (2016) and Kusick et al., (2020), our 20 ms ISI effectively falls into a me window that avoids the potential confounds of both processes: it is long enough for the rapid morphological recovery (~14 ms) of docked vesicles to occur, yet too short for the slow re-priming process to make a substantial  contribution. Furthermore, Vevea et al. (2021) showed that post-stimulus undocking is facilitated in synaptotagmin-7 (Syt7) knockout synapses. In our study, however, Syt7 knockdown did not affect PPR at 20 ms ISI, suggesting that the undocking process described in Kusick et al. (2020) is not a major contributor to the PPD observed at 20 ms intervals in our experiments. Therefore, we conclude that the 20 ms ISI used in our experiments falls within a me window that is influenced neither by the rapid undocking (<14 ms) reported nor by the slow re-priming process (>100 ms).

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The revised manuscript presents an interesting and technically competent set of experiments exploring the role of the infralimbic cortex (IL) in extinction learning. The inclusion of histological validation in the supplemental material improves the transparency and credibility of the results, and the overall presentation has been clarified. However, several key issues remain that limit the strength of the conclusions.

      We thank the Reviewer for their positive assessment of our revised manuscript. We discussed the issues raised by the Reviewer below.

      The behavioral effects reported are modest, as evident from the trial-by-trial data included in the supplemental figures. Although the authors interpret their findings as evidence that IL stimulation facilitates extinction only after prior inhibitory learning, this conclusion is not directly supported by their data. The experiments do not include a condition in which IL stimulation is delivered during extinction training alone, without prior inhibitory experience. Without this control, the claim that prior inhibitory memory is necessary for facilitation remains speculative.

      The manuscript provides evidence across five experiments (Figures 2-6) that IL stimulation fails to facilitate extinction training in the absence of prior inhibitory experience. We therefore remain confident that the data support our conclusion: prior inhibitory learning enables IL stimulation to facilitate subsequent inhibitory learning.

      The electrophysiological example provided shows that IL stimulation induces a sustained inhibition that outlasts the stimulation period. This prolonged suppression could potentially interfere with consolidation processes following tone presentation rather than facilitating them. The authors should consider and discuss this alternative interpretation in light of their behavioral data.

      The possibility that IL stimulation exerted its effects by interfering with consolidation processes is inconsistent with the literature. Disrupting consolidation processes in the IL impairs extinction learning (1), even when animals have prior inhibitory learning experience (2). Yet our experiments found that IL stimulation failed to interfere with initial extinction learning but instead facilitated subsequent learning. Furthermore, the electrophysiological example demonstrates that the inhibitory effect is transient: the cell returned to firing properties similar to those observed pre-stimulation, making it unlikely that inhibition persists during the consolidation window.

      It is unfortunate that several animals had to be excluded after histological verification, but the resulting mismatch between groups remains a concern. Without a power analysis indicating the number of subjects required to achieve reliable effects, it is difficult to determine whether the modest behavioral differences reflect genuine biological variability or insufficient statistical power. Additional animals may be needed to properly address this imbalance.

      As noted in the revised manuscript, we are confident about the reliability of the findings reported. The manuscript provides evidence across five experiments that IL stimulation fails to facilitate brief extinction in the absence of prior inhibitory experience, replicating previous findings (3, 4). The manuscript also replicates these prior studies by demonstrating that experience with either fear or appetitive extinction enables IL stimulation to facilitate subsequent fear extinction. Furthermore, the present experiments replicate the facilitative effects of IL stimulation following fear or appetitive backward conditioning.

      Overall, while the manuscript is improved in clarity and methodological detail, the behavioral effects remain weak, and the mechanistic interpretation requires stronger experimental support and consideration of alternative explanations.

      We respectfully disagree with the assertion that the reported results are weak. The manuscript replicates all main findings internally or reproduces findings from previously published studies. While alternative explanations cannot be entirely excluded, we are not aware of any competing account that predicts the pattern of results reported here.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors examine the mechanisms by which stimulation of the infralimbic cortex (IL) facilitates the retention and retrieval of inhibitory memories. Previous work has shown that optogenetic stimulation of the IL suppresses freezing during extinction but does not improve extinction recall when extinction memory is probed one day later. When stimulation occurs during a second extinction session (following a prior stimulation-free extinction session), freezing is suppressed during the second extinction as well as during the tone test the following day. The current study was designed to further explore the facilitatory role of the IL in inhibitory learning and memory recall. The authors conducted a series of experiments to determine whether recruitment of IL extends to other forms of inhibitory learning (e.g., backward conditioning) and to inhibitory learning involving appetitive conditioning. Further, they assessed whether their effects could be explained by stimulus familiarity. The results of their experiments show that backward conditioning, another form of inhibitory learning, also enabled IL stimulation to enhance fear extinction. This phenomenon was not specific to aversive learning as backward appetitive conditioning similarly allowed IL stimulation to facilitate extinction of aversive memories. Finally, the authors ruled out the possibility that IL facilitated extinction merely because of prior experience with the stimulus (e.g., reducing the novelty of the stimulus). These findings significantly advance our understanding of the contribution of IL to inhibitory learning. Namely, they show that the IL is recruited during various forms of inhibitory learning and its involvement is independent of the motivational value associated with the unconditioned stimulus.

      We thank the Reviewer for their positive assessment.

      Strengths to highlight:

      (1) Transparency about the inclusion of both sexes and the representation of data from both sexes in figures

      We thank the Reviewer for their positive assessment.

      (2) Very clear representation of groups and experimental design for each figure

      We thank the Reviewer for their positive assessment.

      (3) The authors were very rigorous in determining the neurobehavioral basis for the effects of IL stimulation on extinction. They considered multiple interpretations and designed experiments to address these possible accounts of their data.

      We thank the Reviewer for their positive assessment.

      (4) The rationale for and the design of the experiments in this manuscript are clearly based on a wealth of knowledge about learning theory. The authors leveraged this expertise to narrow down how the IL encodes and retrieves inhibitory memories.

      We thank the Reviewer for their positive assessment.

      Reviewer #3 (Public review):

      Summary:

      This is a really nice manuscript with different lines of evidence to show that the IL encodes inhibitory memories that can then be manipulated by optogenetic stimulation of these neurons during extinction. The behavioral designs are excellent, with converging evidence using extinction/re-extinction, backwards/forwards aversive conditioning, and backwards appetitive/forwards aversive conditioning. Additional factors, such as nonassociative effects of the CS or US, also are considered, and the authors evaluate the inhibitory properties of the CS with tests of conditioned inhibition. The authors have addressed the prior reviews. I still think it is unfortunate that the groups were not properly balanced in some of the figures (as noted by the authors, they were matched appropriately in real time, but some animals had to be dropped after histology, which caused some balancing issues). I think the overall pattern of results is compelling enough that more subjects do not need to be added, but it would still be nice to see more acknowledgement and statistical analyses of how these pre-existing differences may have impacted test performance.

      We thank the Reviewer for their positive assessment of our revised manuscript. We discussed the comments regarding group balancing below.

      Strengths:

      The experimental designs are very rigorous with an unusual level of behavioral sophistication.

      We thank the Reviewer for their positive assessment

      Weaknesses:

      The various group differences in Figure 2 prior to any manipulation are still problematic. There was a reliable effect of subsequent group assignment in Figure 2 (p<0.05, described as "marginal" in multiple places). Then there are differences in extinction (nonsignificant at p=.07). The test difference between ReExt OFF/ON is identical to the difference at the end of extinction and the beginning of Forward 2, in terms of absolute size. I really don't think much can be made of the test result. The authors state in their response that this difference was not evident during the forward phase, but there clearly is a large ordinal difference on the first trial. I think it is appropriate to only focus on test differences when groups are appropriately matched, but when there are pre-existing differences (even when not statistically significant) then they really need to be incorporated into the statistical test somehow.

      We carefully considered the Reviewer's suggestion, but it is not possible to adjust the statistical analyses at test because these analyses do not directly compare the two ReExt groups. Any scaling of performance would require including the two Ext groups, which is not feasible since these groups did not receive initial extinction. Moreover, the analyses provide no conclusive evidence of pre-existing differences between the two ReExt groups: the difference was not significant during initial extinction and was absent during the Forward 2 stage. We acknowledge that closer performance between the two ReExt groups during initial extinction would have been preferable. However, we remain confident in the results obtained because they replicate previous experiments in which the two ReExt groups displayed identical performance during initial extinction.

      The same problem is evident in Figure 4B, but here the large differences in the Same groups are opposite to the test differences. It's hard to say how those large differences ultimately impacted the test results. I suppose it is good that the differences during Forward conditioning did not ultimately predict test differences, but this really should have been addressed with more subjects in these experiments. The authors explore the interactions appropriately but with n=6 in the various subgroups, it's not surprising that some of these effects were not detected statistically.

      As the Reviewer noted, the unexpected differences in Figure 4B are opposite in direction to the test differences. Importantly, Figure 4B replicates the main findings from Figure 3, which did not show these unexpected differences.

      It is useful to see the trial-by-trial test data now presented in the supplement. I think the discussion does a good job of addressing the issues of retrieval, but the ideas of Estes about session cues that the authors bring up in their response haven't really held up over the years (e.g., Robbins, 1990, who explicitly tested this; other demonstrations of within-session spontaneous recovery), for what it's worth.

      We thank the Reviewer for bringing our attention to Robbins’ work on session cues. We understand that the issue of retrieval is important but as we noted before, our manuscript and its conclusions do not claim to differentiate retrieval from additional learning.

      References

      (1) K. E. Nett, R. T. LaLumiere, Infralimbic cortex functioning across motivated behaviors: Can the differences be reconciled Neurosci Biobehav Rev 131, 704–721 (2021).

      (2) V. Laurent, R. F. Westbrook, Inactivation of the infralimbic but not the prelimbic cortex impairs consolidation and retrieval of fear extinction Learn Mem 16, 520–529 (2009).

      (3) N. W. Lingawi, R. F. Westbrook, V. Laurent, Extinction and Latent Inhibition Involve a Similar Form of Inhibitory Learning that is Stored in and Retrieved from the Infralimbic Cortex Cereb Cortex 27, 5547–5556 (2017).

      (4) N. W. Lingawi, N. M. Holmes, R. F. Westbrook, V. Laurent, The infralimbic cortex encodes inhibition irrespective of motivational significance Neurobiol Learn Mem 150, 64–74 (2018).


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The manuscript reports a series of experiments designed to test whether optogenetic activation of infralimbic (IL) neurons facilitates extinction retrieval and whether this depends on animals' prior experience. In Experiment 1, rats underwent fear conditioning followed by either one or two extinction sessions, with IL stimulation given during the second extinction; stimulation facilitated extinction retrieval only in rats with prior extinction experience. Experiments 2 and 3 examined whether backward conditioning (CS presented after the US) could establish inhibitory properties that allowed IL stimulation to enhance extinction, and whether this effect was specific to the same stimulus or generalized to different stimuli. Experiments 5 - 7 extended this approach to appetitive learning: rats received backward or forward appetitive conditioning followed by extinction, and then fear conditioning, to determine whether IL stimulation could enhance extinction in contexts beyond aversive learning and across conditioning sequences. Across studies, the key claim is that IL activation facilitates extinction retrieval only when animals possess a prior inhibitory memory, and that this effect generalizes across aversive and appetitive paradigms.

      Strengths:

      (1) The design attempts to dissect the role of IL activity as a function of prior learning, which is conceptually valuable.

      We thank the Reviewer for their positive assessment.

      (2) The experimental design of probing different inhibitory learning approaches to probe how IL activation facilitates extinction learning was creative and innovative.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) Non-specific manipulation.

      ChR2 was expressed in IL without distinction between glutamatergic and GABAergic populations. Without knowing the relative contribution of these cell types or the percentage of neurons affected, the circuit-level interpretation of the results is unclear.

      ChR2 was intentionally expressed in the infralimbic cortex (IL) without distinction between local neuronal populations for two reasons. First, the primary aim of this was to uncover some of the features characterizing the encoding of inhibitory memories in the IL, and this encoding likely engages interactions among various neuronal populations within the IL. Second, the hypotheses tested in the manuscript derived from findings that indiscriminately stimulated the IL using the GABA<sub>A</sub> receptor antagonist picrotoxin, which is best mimicked by the approach taken. We agree that it is also important to determine the respective contributions of distinct IL neuronal populations to inhibitory encoding; however, the global approach implemented in the present experiments represents a necessary initial step. These matters have been incorporated in the Discussion of the revised manuscript.

      (2) Extinction retrieval test conflates processes

      The retrieval test included 8 tones. Averaging across this many tone presentations conflate extinction retrieval/expression (early tones) with further extinction learning (later tones). A more appropriate analysis would focus on the first 2-4 tones to capture retrieval only. As currently presented, the data do not isolate extinction retrieval.

      It is unclear when retrieval of what has been learned across extinction ceases and additional extinction learning occurs. In fact, it is only the first stimulus presentation that unequivocally permits a distinction between retrieval and additional extinction learning, as the conditions for this additional learning have not been fulfilled at that presentation. However, confining evidence for retrieval to the first stimulus presentation introduces concerns that other factors could influence performance. For instance, processing of the stimulus present at the start of the session may differ from that present at the end of the previous session, thereby affecting what is retrieved. Such differences between the stimuli present at the start and end of an extinction session have been long recognized as a potential explanation for spontaneous recovery (Estes, 1955). More importantly, whether the test data presented confound retrieval and additional extinction learning or not, the interpretation remains the same with respect to the effects of a prior history of inhibitory learning on enabling the facilitative effects of IL stimulation. Finally, it is unclear how these facilitative effects could occur in the absence of the subjects retrieving the extinction memory formed under the stimulation. Nevertheless, the revised manuscript now provides the trial-by-trial performance (see Supplemental Figure 3) during the post-extinction retrieval tests and addresses this issue in the Discussion.

      (3) Under-sampling and poor group matching.

      Sample sizes appear small, which may explain why groups are not well matched in several figures (e.g., 2b, 3b, 6b, 6c) and why there are several instances of unexpected interactions (protocol, virus, and period). This baseline mismatch raises concerns about the reliability of group differences.

      Efforts were made to match group performance upon completion of each training stage and before IL stimulation. Unfortunately, these efforts were not completely successful due to exclusions following post-mortem analyses. This has been made explicit in the revised manuscript (Materials and Methods, Subjects section). However, we acknowledge that the unexpected interactions deserve further discussion, and this has been incorporated into the revised manuscript (see also comment from Reviewer 2). Although we cannot exclude the possibility that sample sizes may have contributed to some of these interactions, we remain confident about the reliability of the main findings reported, especially given their replication across the various protocols. Overall, the manuscript provides evidence that IL stimulation does not facilitate brief extinction in the absence of prior inhibitory experience in five different experiments, replicating previous findings (Lingawi et al., 2018; Lingawi et al., 2017). It also replicates these previous findings by showing that prior experience with either fear or appetitive extinction enables IL stimulation to facilitate subsequent fear extinction. Furthermore, the facilitative effects of such stimulation following fear or appetitive backward conditioning are replicated in the present manuscript. This is discussed in the Discussion of the revised manuscript.

      (4) Incomplete presentation of conditioning data

      Figure 3 only shows a single conditioning session despite five days of training. Without the full dataset, it is difficult to evaluate learning dynamics or whether groups were equivalent before testing.

      We apologize, as we incorrectly labeled the X axis for the backward conditioning data in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. This error has been corrected in the revised manuscript (see also second comment from Reviewer 2).

      (5) Interpretation stronger than evidence.

      The authors conclude that IL activation facilitates extinction retrieval only when an inhibitory memory has been formed. However, given the caveats above, the data are insufficient to support such a strong mechanistic claim. The results could reflect nonspecific facilitation or disruption of behavior by broad prefrontal activation. Moreover, there is compelling evidence that optogenetic activation of IL during fear extinction does facilitate subsequent extinction retrieval without prior extinction training (DoMonte et al 2015, Chen et al 2021), which the authors do not directly test in this study.

      As noted above, the interpretations of the main findings stand whether the test data confounds retrieval with additional extinction learning or not. The revised manuscript also clarifies the plotting of the data for the backward conditioning stages. We do agree that further discussion of the unexpected interactions is necessary, and this has been incorporated into the revised manuscript. However, the various replications of the core findings provide strong evidence for their reliability and the interpretations advanced in the original manuscript. The proposal that the results reflect non-specific facilitation or disruption of behavior seems highly unlikely. Indeed, the present experiments and previous findings (Lingawi et al., 2018; Lingawi et al., 2017) provide multiple demonstrations that IL stimulation fails to produce any facilitation in the absence of prior inhibitory experience with the target stimulus. Although these demonstrations appear inconsistent with previous studies (Do-Monte et al., 2015; Chen et al., 2021), this inconsistency is likely explained by the fact that these studies manipulated activity in specific IL neuronal populations. Previous work has already revealed differences between manipulations targeting discrete IL neuronal populations as opposed to general IL activity (Kim et al., 2016). Importantly, as previously noted, the present manuscript aimed to generally explore inhibitory encoding in the IL that is likely to engage several neuronal populations within the IL. Adequate statements on these matters have been included in the Discussion of the revised manuscript.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors examine the mechanisms by which stimulation of the infralimbic cortex (IL) facilitates the retention and retrieval of inhibitory memories. Previous work has shown that optogenetic stimulation of the IL suppresses freezing during extinction but does not improve extinction recall when extinction memory is probed one day later. When stimulation occurs during a second extinction session (following a prior stimulation-free extinction session), freezing is suppressed during the second extinction as well as during the tone test the following day. The current study was designed to further explore the facilitatory role of the IL in inhibitory learning and memory recall. The authors conducted a series of experiments to determine whether recruitment of IL extends to other forms of inhibitory learning (e.g., backward conditioning) and to inhibitory learning involving appetitive conditioning. Further, they assessed whether their effects could be explained by stimulus familiarity. The results of their experiments show that backward conditioning, another form of inhibitory learning, also enabled IL stimulation to enhance fear extinction. This phenomenon was not specific to aversive learning, as backward appetitive conditioning similarly allowed IL stimulation to facilitate extinction of aversive memories. Finally, the authors ruled out the possibility that IL facilitated extinction merely because of prior experience with the stimulus (e.g., reducing the novelty of the stimulus). These findings significantly advance our understanding of the contribution of IL to inhibitory learning. Namely, they show that the IL is recruited during various forms of inhibitory learning, and its involvement is independent of the motivational value associated with the unconditioned stimulus.

      Strengths:

      (1) Transparency about the inclusion of both sexes and the representation of data from both sexes in figures.

      We thank the Reviewer for their positive assessment.

      (2) Very clear representation of groups and experimental design for each figure.

      We thank the Reviewer for their positive assessment.

      (3) The authors were very rigorous in determining the neurobehavioral basis for the effects of IL stimulation on extinction. They considered multiple interpretations and designed experiments to address these possible accounts of their data.

      We thank the Reviewer for their positive assessment.

      (4) The rationale for and the design of the experiments in this manuscript are clearly based on a wealth of knowledge about learning theory. The authors leveraged this expertise to narrow down how the IL encodes and retrieves inhibitory memories.

      We thank the Reviewer for their positive assessment.

      Weaknesses:

      (1) In Experiment 1, although not statistically significant, it does appear as though the stimulation groups (OFF and ON) differ during Extinction 1. It seems like this may be due to a difference between these groups after the first forward conditioning. Could the authors have prevented this potential group difference in Extinction 1 by re-balancing group assignment after the first forward conditioning session to minimize the differences in fear acquisition (the authors do report a marginally significant effect between the groups that would undergo one vs. two extinction sessions in their freezing during the first conditioning session)?

      Efforts were made daily to match group performance across the training stages, but these efforts were ultimately hampered by the necessary exclusions following postmortem analyses. This has been made explicit in the revised manuscript (Materials and Methods, Subjects section). Regarding freezing during Extinction 1, as noted by the Reviewer, the difference, which was not statistically significant, was absent across trials during the subsequent forward fear conditioning stage. Likewise, the protocol difference observed during the initial forward fear conditioning was absent in subsequent stages. We are therefore confident that these initial differences (significant or not) did not impact the main findings at test. Importantly, these findings replicate previous work using identical protocols in which no differences were present during the training stages. These considerations have been addressed in the revised manuscript (see Results for Experiment 1).

      (2) Across all experiments (except for Experiment 1), the authors state that freezing during the initial conditioning increased across "days". The figures that correspond to this text, however, show that freezing changes across trials. In the methods, the authors report that backward conditioning occurred over 5 days. It would be helpful to understand how these data were analyzed and collated to create the final figures. Was the freezing averaged across the five days for each trial for analyses and figures?

      We apologize, as noted above, for having incorrectly labeled the X axis across the backward conditioning data sets in Figures 3B, 4B, 4D and 5B. It should have indicated “Days” instead of “Trials”. The data shown in these Figures use the average of all trials on a given day. This has been clarified in the methods section of the revised manuscript (Statistical Analyses section). The labeling errors on the Figures have been corrected.

      (3) In Experiment 3, the authors report a significant Protocol X Virus interaction. It would be useful if the authors could conduct post-hoc analyses to determine the source of this interaction. Inspection of Figure 4B suggests that freezing during the two different variants of backward conditioning differs between the virus groups. Did the authors expect to see a difference in backward conditioning depending on the stimulus used in the conditioning procedure (light vs. tone)? The authors don't really address this confounding interaction, but I do think a discussion is warranted.

      We agree with the Reviewer that further discussion of the Protocol x Virus interaction that emerged during the backward conditioning and forward conditioning stages of Experiment 3 is warranted. This discussion has been provided in the revised manuscript (see Results section). Briefly, during both stages, follow-up analyses did not reveal any differences (main effects or interactions) between the two groups trained with the light stimulus (Diff-EYFP and Diff-ChR2). By contrast, the ChR2 group trained with the tone (Back-ChR2) froze more overall than the EYFP group (Back-EYFP), but there were no other significant differences between the two groups. Based on these analyses, the Protocol x Virus interaction appears to be driven by greater freezing in the ChR2 group trained with the tone rather than a difference in the backward conditioning performance based on stimulus identity. Consistent with this, the statistical analyses did not reveal a main effect of Protocol during either the backward conditioning stage or the stimulus trials during the forward conditioning stage. Nevertheless, during this latter stage, a main effect of Protocol emerged during baseline performance, but once again, this seems to be driven by the Back-ChR2 group. Critically, it is unclear how greater stimulus freezing in the Back-ChR2 group during forward conditioning would lead to lower freezing during the post-extinction retrieval test.

      We note that an unexpected Protocol x Period interaction was found during appetitive backward conditioning in Experiment 5. For consistency, we conducted additional analyses to determine the source of this interaction (see Results section). As previously noted, performance during appetitive backward conditioning is noisy and cannot be taken as a failure to generate inhibitory learning. It is therefore unlikely that this interaction implied a difference in such learning.

      (4) In this same experiment, the authors state that freezing decreased during extinction; however, freezing in the Diff-EYFP group at the start of extinction (first bin of trials) doesn't look appreciably different than their freezing at the end of the session. Did this group actually extinguish their fear? Freezing on the tone test day also does not look too different from freezing during the last block of extinction trials.

      We confirm that overall, there was a significant decline in freezing across the extinction session shown in Figure 4B. The Reviewer is correct to point out that this decline was modest (if not negligible) in the Diff-EYFP group, which was receiving its first inhibitory training with the target tone stimulus. It is worth noting that across all experiments, most groups that did not receive infralimbic stimulation displayed a modest decline in freezing during the extinction session since it was relatively brief, involving only 6 or 8 tone alone presentations. This was intentional, as we aimed for the brief extinction session to generate minimal inhibitory learning and thereby to detect any facilitatory effect of infralimbic stimulation. This has been clarified and explained in the revised version of the manuscript (see Results section, description of Experiment 1).

      (5) The Discussion explored the outcomes of the experiments in detail, but it would be useful for the authors to discuss the implications of their findings for our understanding of circuits in which the IL is embedded that are involved in inhibitory learning and memory. It would also be useful for the authors to acknowledge in the Discussion that although they did not have the statistical power to detect sex differences, future work is needed to explore whether IL functions similarly in both sexes.

      In line with the Reviewer’s suggestion (see also Reviewer 3), the Discussion section has been substantially altered in the revised manuscript. Among other things, it does mention that future studies will need to examine the role of additional brain regions in the effects reported and it acknowledges the need to further explore sex differences and IL functions.

      Reviewer #3 (Public review):

      Summary:

      This is a really nice manuscript with different lines of evidence to show that the IL encodes inhibitory memories that can then be manipulated by optogenetic stimulation of these neurons during extinction. The behavioral designs are excellent, with converging evidence using extinction/re-extinction, backwards/forwards aversive conditioning, and backwards appetitive/forwards aversive conditioning. Additional factors, such as nonassociative effects of the CS or US, are also considered, and the authors evaluate the inhibitory properties of the CS with tests of conditioned inhibition.

      Strengths:

      The experimental designs are very rigorous with an unusual level of behavioral sophistication.

      We thank the Reviewer for their positive assessment

      Weaknesses:

      (1) More justification for parametric choices (number of days of backwards vs forwards conditioning) could be provided.

      All experimental parameters were based on previously published experiments showing the capacity of the backward conditioning protocols to generate inhibitory learning and the forward conditioning protocols to produce excitatory learning. Although this was mentioned in the methods section, we acknowledge that further explanation was required to justify the need for multiple days of backward training. This has been provided in the revised manuscript (see Results section and description of the backward parameters.

      (2) The current discussion could be condensed and could focus on broader implications for the literature.

      The discussion has been severely condensed and broader implications have been discussed with respect to the existing literature looking at the neural circuitry underlying inhibitory learning.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Re-analyze extinction retrieval, focusing only on the first 2-4 tones to capture extinction expression.

      This recommendation corresponds to the second public comment made by the Reviewer, and we have replied to this comment.

      (2) Directly test whether activation of IL during fear extinction is insufficient to facilitate extinction retrieval without prior extinction training.

      The manuscript provides five separate demonstrations that the optogenetic approach to stimulate IL activity did not facilitate the initial brief extinction session. This reproduces what had been found with indiscriminate pharmacological stimulation in our previous research (Lingawi et al., 2018; Lingawi et al., 2017). We appreciate that other work that stimulated specific IL neuronal populations has observed facilitation of extinction but, the present manuscript focuses on the role of all IL neuronal populations in encoding inhibitory memories. The Reviewer’s request would imply contrasting the role of various neuronal populations, which is beyond the scope of this manuscript. Nevertheless, we have modified our discussion to indicate that future research should establish which IL neuronal population(s) contribute to the effects reported here.

      (3) Show the percentage of neurons that exhibit excitatory or inhibitory responses in IL after non-specific optogenetic activation to better understand how this manipulation is affecting IL circuitry.

      All electrophysiological recordings (n = 10 cells) are presented in Figure 1C. ChR2 excitation was substantial and overwhelming. Based on the physiological and morphological characteristics of the recorded cells, one was non-pyramidal and was excited by LED light delivery. The remaining 9 cells were pyramidal. One did not respond to LED delivery, but we cannot exclude the possibility that this was due to a lack of ChR2 expression in the somatic compartment. Another cell showed a mild reduction in activity following LED stimulation, while the remaining 7 cells displayed clear excitation upon LED stimulation. We have modified our manuscript to reflect these observations. We did not include percentages since only 10 recordings are shown.

      (4) Present data from all five conditioning sessions, not just one, to allow evaluation of learning history.

      This recommendation corresponds to the fourth public comment made by the Reviewer, and we have replied to this comment.

      (5) Address the issue of small and poorly matched groups, particularly in Figures 2b, 3b, 6b, and 6c.

      This recommendation corresponds to the third public comment made by the Reviewer, and we have replied to this comment.

      (6) Temper the conclusions to reflect the limitations of sampling, group matching, and the lack of specificity in the manipulation.

      We have modified our Discussion to address potential issues related to sampling and group matching. However, we are unsure how the lack of specificity of the IL stimulation has any impact on the interpretations made, since no statement is made about neuronal specificity. That said, as noted above, “we have modified our discussion to indicate that future research should establish which IL neuronal population(s) contribute to the effects reported here”.

      Reviewer #2 (Recommendations for the authors):

      Nothing additional to include beyond what is written for public view.

      Reviewer #3 (Recommendations for the authors):

      This is a really nice manuscript with different lines of evidence to show that the IL encodes inhibitory memories that can then be manipulated by optogenetic stimulation of these neurons during extinction. The behavioral designs are excellent, with converging evidence using extinction/re-extinction, backwards/forwards aversive conditioning, and backwards appetitive/forwards aversive conditioning. Additional factors, such as nonassociative effects of the CS or US, are also considered, and the authors evaluate the inhibitory properties of the CS with tests of conditioned inhibition. I only have a couple of comments that the authors may want to consider.

      We thank the Reviewer for their positive assessment.

      First, in Figure 2, it is unfortunate that there is a general effect of the LED assignment before the LED experience (p=.07 during that first extinction session). This is in the same direction as the difference during the test, so it is not clear that the test difference really reflects differences due to Extinction 2 treatment or to preexisting differences based on group assignments.

      The Reviewer’s comment is identical to the first public comment of Reviewer 2, which has been addressed.

      Second, it is notable that the backwards fear conditioning phase was conducted over 5 days, but the forward conditioning phase was conducted over one day. The rationale for these differences should be presented. There is an old idea going back to Konorski that backwards conditioning may lead to excitation initially, and it is only after more extensive trials that inhibitory conditioning occurs (a finding supported by Heth, 1976). Some discussion of the potential biphasic nature of backwards conditioning would be useful, especially for people who want to run this type of experiment but with only a single session of backwards conditioning.

      In line with the Reviewer’s suggestion, the revised manuscript (see results section) provide an explanation for conducting backward conditioning across multiple days.

      Third, as written, each paragraph of the discussion is mostly a recapitulation of the findings from each experiment. This could be condensed significantly, and it would be nice to see more integration with the current literature and how these results challenge or suggest nuance in current thinking about IL function.

      We have significantly condensed the recapitulation of our findings in the Discussion of the revised manuscript. The Discussion now dedicates space to address comments from the other Reviewers and integrate the present findings with the current literature.

      References

      Chen, Y.-H., Wu, J.-L., Hu, N.-Y., Zhuang, J.-P., Li, W.-P., Zhang, S.-R., Li, X.-W., Yang, J.-M., & Gao, T.-M. (2021). Distinct projections from the infralimbic cortex exert opposing effects in modulating anxiety and fear. J Clin Invest, 131(14), e145692. https://doi.org/10.1172/JCI145692

      Do-Monte, F. H., Manzano-Nieves, G., Quiñones-Laracuente, K., Ramos-Medina, L., & Quirk, G. J. (2015). Revisiting the role of infralimbic cortex in fear extinction with optogenetics. J Neurosci, 35(8), 3607-3615. https://doi.org/10.1523/JNEUROSCI.3137-14.2015

      Estes, W. K. (1955). Statistical theory of spontaneous recovery and regression. Psychol Rev, 62(3), 145-154. https://doi.org/10.1037/h0048509

      Kim, H.-S., Cho, H.-Y., Augustine, G. J., & Han, J.-H. (2016). Selective Control of Fear Expression by Optogenetic Manipulation of Infralimbic Cortex after Extinction. Neuropsychopharmacology, 41(5), 1261-1273. https://doi.org/10.1038/npp.2015.276

      Lingawi, N. W., Holmes, N. M., Westbrook, R. F., & Laurent, V. (2018). The infralimbic cortex encodes inhibition irrespective of motivational significance. Neurobiol Learn Mem, 150, 64-74. https://doi.org/10.1016/j.nlm.2018.03.001

      Lingawi, N. W., Westbrook, R. F., & Laurent, V. (2017). Extinction and Latent Inhibition Involve a Similar Form of Inhibitory Learning that is Stored in and Retrieved from the Infralimbic Cortex. Cereb Cortex, 27(12), 5547-5556.

      https://doi.org/10.1093/cercor/bhw322.

    1. AbstractAdvances in spatial omics enable measurement of genes (spatial transcriptomics) and peptides, lipids, or N-glycans (mass spectrometry imaging) across thousands of locations within a tissue. While detecting spatially variable molecules is a well-studied problem, robust methods for identifying spatially varying co-expression between molecule pairs remain limited. We introduce SpaceBF, a Bayesian fused modeling framework that estimates co-expression at both local (location-specific) and global (tissue-wide) levels. SpaceBF enforces spatial smoothness via a fused horseshoe prior on the edges of a predefined spatial adjacency graph, allowing large, edge-specific differences to escape shrinkage while preserving overall structure. In extensive simulations, SpaceBF achieves higher specificity and power than commonly used methods that leverage geospatial metrics, including bivariate Moran’s I and Lee’s L. We also benchmark the proposed prior against standard alternatives, such as intrinsic conditional autoregressive (ICAR) and Matérn priors. Applied to spatial transcriptomics and proteomics datasets, SpaceBF reveals cancer-relevant molecular interactions and patterns of cell–cell communication (e.g., ligand–receptor signaling), demonstrating its utility for principled, uncertainty-aware co-expression analysis of spatial omics data.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag006), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Daniel Domovic

      Dear authors,

      I read your manuscript "SpaceBF: Spatial coexpression analysis using Bayesian Fused approaches in spatial omics datasets" with interest.

      The manuscript presents SpaceBF, a Bayesian method for detecting spatial co-expression between pairs of molecules in spatial omics data. The topic is relevant since new technologies like spatial transcriptomics, mass spectrometry imaging, and multiplex immunofluorescence produce large data but current tools for co-expression are limited. The authors try to solve this gap with a new model and they also test it on real datasets. The paper is technical, but it also gives biological examples, which is helpful for readers.

      The paper has many strong points. First, the idea to use Bayesian fused horseshoe prior together with MST spatial structure is new and well explained. Second, the authors apply their method on three real datasets and they show interesting biology, for example IGF2-IGF1R relation, keratin isoform consistency, and stromal ECM peptides. Third, I appreciate that the code is open on GitHub. Also, the paper compares with other methods and deals with the common problem of variance-stabilizing transform by modeling UMI counts directly with negative binomial distribution.

      Overall, the work is clear and well organized, but there are some points where more explanation or clarification would help. In my review I give major and minor remarks that I hope will improve the paper.

      Major remarks 1. Were you worried choosing MST may oversimplify spatial relationships, since many meaningful local neighborhoods may be excluded? Would the results of SpaceBF be significantly different if a different spatial graph, such as kNN, Delaunay triangulation, or kernel-based, was used instead of MST? 2. Since MST edges depend a lot on pairwise L2 distances, how stable are the results if spatial coordinates are a little noisy, or if there are tissue registration errors? 3. The model puts one molecule as outcome and the other as predictor. Are the co-expression estimates still the same if you switch roles? 4. In the Results you mention "FDR < 0.1." Can you explain which method you used for FDR? Also, are the discoveries robust if you change the threshold (for example 0.05 vs 0.1)? 5. Do the simulation parameters (lengthscale, slope, dispersion) correspond to realistic biological signal strengths and spatial scales observed in real datasets? Three values of the lengthscale l are considered, l = 3.6, 7.2, 18. Why exactly these values? What does ν=0.75 mean in terms of effect size? How does l=18 compare to real tissue lengthscales? 6. Can you describe runtime and memory for larger datasets, like 10X Visium with 5,000-20,000 spots? Is the current MCMC practical for this scale, or do you think approximate inference (like variational Bayes or INLA) is needed?

      Minor remark 1. How sensitive are the results to the choice of hyperparameters for the Horseshoe prior? 2. In the Results you state that keratins "co-express highly, meaning their binding patterns with any specific type 1 keratin should be similar." Please make clear that SpaceBF measures co-expression, not direct binding, so that conclusions are not overstated. 3. You mention SpatialCorr and Copulacci, but the comparison was not successful. Even if parameters were sensitive, I think one short numerical comparison in the supplement would be helpful. 4. You filter out genes with fewer than ~59 total reads (0.2 x number of spots). Can you justify the choice of this threshold and show if results are stable for other thresholds (for example 0.1x or 0.5x)? Since many ligands and receptors are lowly expressed, is there a risk of losing meaningful biology? Since the dataset has only 293 spots, thresholds can have strong effect.

    1. 1.2. Kumail Nanjiani’s Reflections on Ethics in Tech# Image source Kumail Nanjiani was a star of the Silicon Valley TV Show, which was about the tech industry. He posted these reflections on ethics in tech on Twitter (@kumailn) on November 1, 2017: As a cast member on a show about tech, our job entails visiting tech companies/conferences etc. We meet ppl eager to show off new tech. Often we’ll see tech that is scary. I don’t mean weapons etc. I mean altering video, tech that violates privacy, stuff w obv ethical issues. And we’ll bring up our concerns to them. We are realizing that ZERO consideration seems to be given to the ethical implications of tech. They don’t even have a pat rehearsed answer. They are shocked at being asked. Which means nobody is asking those questions. “We’re not making it for that reason but the way ppl choose to use it isn’t our fault. Safeguard will develop.” But tech is moving so fast. That there is no way humanity or laws can keep up. We don’t even know how to deal with open death threats online. Only “Can we do this?” Never “should we do this? We’ve seen that same blasé attitude in how Twitter or Facebook deal w abuse/fake news. You can’t put this stuff back in the box. Once it’s out there, it’s out there. And there are no guardians. It’s terrifying. The end. Kumail Nanjiani 1.2.1. Reflection questions:# What do you think is the responsibility of tech workers to think through the ethical implications of what they are making? Why do you think the people who Kumail talked with didn’t have answers to his questions?

      I think tech workers have a responsibility to consider the ethical implications of what they create, because technology can shape behavior, privacy, and power in ways that are difficult to reverse. As Kumail Nanjiani points out, once technology is released, it cannot simply be taken back, so ethical thinking should happen before harm occurs.

      I think the people Kumail spoke with lacked answers because ethical reflection is often not prioritized in tech culture. Many developers focus on whether something can be built rather than whether it should be built, and since these questions are rarely asked, they may not be prepared to address them.

    1. R0:

      Reviewer #1: Peer Reviewer’s report for the submission “Reaching the 100 by 2027 target for universal access to rapid diagnostic tests 2 for tuberculosis in Africa: in-sight but out of reach”

      Recommendation: Minor Revisions General Comment: This paper addresses a pertinent global health subject, a WHO priority research gap. The methods are sound and innovative. However, the authors need to improve on the clarity of the paper.

      Abstract: -The authors did a fantastic work summarizing the study with this abstract -Kindly break the abstract into the standard sections: background, methods, results, conclusion -Please clearly designate and state clearly the name of the study design used in this study. Are we an ecological study with mixed methods or what?

      Background -Great job introducing the research gap and pertinence of the research -A brief perspective on funding gaps for diagnostics might strengthen this section -Do not overestimate the knowledge of potential readers on the subject, briefly describe what WRDs are and state list them. Why are they so important?

      Methods -This section of the work is a bit to brief and doesn’t present the work in a way that can be easily reproducible by readers. Use standard sub-headers such as study design, study population, study period, data collection and data analysis for clarity. -Again, I ask what is the study design of this study? -WRD were recommended 10 years ago, what is the rationale behind the period 2021-2023? I think the key landmarks for this are 2015 for End-TB, 2018 for the first UNHLM and 2023 for the second UNHLM. -Line 98-101: How were these cutoffs decided? -Study area is completely absent. It is important to shade more light on the 24 countries. Who are they, what is the burden of TB there, any peculiarities? -Benchmarks which needed a secondary calculation following extraction need to be presented clearly, showing the variables used as denominator and numerator.

      Results -Kindly provide the exact number of cases tested for the different years, prior to providing proportions. A standalone table could resolve this. -Line 151-161, I find it hard to see trends with just 3 years data points. Probably need to increase the years if you want to discuss trends -Did the Table 2 strategies come from the TB staff or the authors? It appears it came from the authors, in which case I don’t agree with their existence in the results. At best in recommendations

      Discussions -The authors did a superb job discussing the available findings of the study -Being a study with policy implications, kindly include a sub-header for Policy implications of the findings and state them clearly -Include sub-headers for strengths and limitations and outline them clearly

      Reviewer #2: Review of Title: Reaching the 100 by 2027 target for universal access to rapid diagnostic tests for tuberculosis in Africa: in-sight but out of reach

      Summary of research and overall impression This is a well-written and researched article reporting on the availability and use of WHO-recommended rapid diagnostics for TB in African countries where there is significant burden. The authors use routinely reported data to assess access to WRDs, and a small survey of programme staff from a subset of countries to identify barriers and facilitators to the inclusion of WRDs in diagnostic algorithms. The paper makes an important contribution to the TB literature by mapping the gaps in terms of access to and usage of WRDs, which is needed to strengthen TB control efforts. There are minor comments for the authors to address to strengthen the paper.

      Methods 1. Include brief details on how/why the 24 countries included in the review were selected. 2. More details are needed to describe the process for the country stakeholder survey. For example:

      • Specify what the questionnaire consisted of, i.e., closed and open-ended questions? What topic areas/sections were included/asked about? How/by whom was the questionnaire designed/developed, using/adapting an existing framework/questionnaire?
      • How were the questionnaires sent out? Were specific people targeted? How many were sent out? What was the timeframe?
      • Provide details of how/why the 6 countries were selected – e.g., 1-2 from each region? Who inputted on these decisions? The authors mention later that these were also selected based on WRD access, which should be mentioned here in methods.

      • It is unclear under ‘statistical analysis’ if this refers to analysis of all data, or just the data review. Suggest revising to clarify analysis for data review, and analysis for the stakeholder survey. Two things to consider: 1) Provide details on the data extracted and the analysis conducted. 2) It is unclear what is meant here: “The first author used topic guides that reflected content areas such as barriers and contextual factors influencing WRD use and the themes that emerged during the review of the survey responses to manually organise the data into thematic codes.” Is this referring to the stakeholder surveys? Suggest revising for clarity on the analysis process. Were any frameworks used in analysis to categorise barriers into categories and develop mitigation strategies? This process needs to be detailed in the methods to lead into the results.

      • Please clarify/confirm the ethics of surveying country stakeholders without a consent process, even if participants (country stakeholders) are not identifiable.

      Results Provide details of how many survey responses were received. Is it only 6 from 6 countries (as in lines 182-186)? How were respondents distributed across the 6 countries? Could they speak to the different country contexts? Later in the text there is mention of 16, suggest clarifying this in the results clearly.

      In lines 163 onwards, when referring to the analysed gaps in the TB diagnostic cascade, please clarify in the text throughout what is meant with ‘countries reported’ – is this a comparison of what is found in the data review with what is reported by country stakeholders?

      As mentioned earlier, the process for categorising the barriers and developing mitigation strategies must be introduced in the methods. “We then distilled the barriers into five categories and developed mitigation strategies 260 (Table 3) to improve the use of WRDs across all 24 LabCoP countries.” Did you use a framework for this to guide at different health system level? Suggest revising the three theme headings as they read more like recommendations statements now than findings, i.e., optimise…, strengthen…. To read as findings of the barriers and facilitators, they should be descriptive of what was found. - Theme 1: ‘optimise WRD capacity’ – clarify what ‘capacity’ is referring to. Under this heading there are multiple aspects included, i.e., policies, guidelines, as well as examples of how access to WRD has been improved, so examples of optimising WRD capacity? - Theme 2: seems to speak to 2 things: sample transportation and access to testing via active case finding. Clarify if/how these are linked. - Theme 3 – insufficient financing, staffing, and infrastructure to implement WRD.

      Discussion Under strengths and limitations, the authors mention that ‘a planned report from our annual meeting will capture responses from all 24 countries’ – lines 362-363. This statement has limited relevance to the article, unless already publicly available and can be referenced. Suggest to delete/remove.

      The authors also mention ‘only reached out to the selected countries’ – line 361. Suggest to phrase this more positively, i.e., we purposively selected a subset of 6 countries from the 24 within the LabCoP network, which may limit…’

      R1:

      Reviewer #2: Well done on an exceptionally well-written and important paper. I do have one pending comment about the number of survey responses, which I do not see reported in the results. It is important to include the number of respondents and how they were distributed across the 6 countries included in the survey.

  2. Jan 2026
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work presents an interesting circuit dissection of the neural system allowing a ctenophore to keep its balance and orientation in its aquatic environment by using a fascinating structure called the statocyst. By combining serial-section electron microscopy with behavioral recordings, the authors found a population of neurons that exists as a syncytium and could associate these neurons with specific functions related to controlling the beating of cilia located in the statocyst. The type A ANN neurons participate in arresting cilia beating, and the type B ANN neurons participate in resuming cilia beating and increasing their beating frequency.

      Moreover, the authors found that bridge cells are connected with the ANN neurons, giving them the role of rhythmic modulators.

      From these observations, the authors conclude that the control is coordination instead of feedforward sensory-motor function, a hypothesis that had been put forth in the past but could not be validated until now. They also compare it to the circuitry implementing a similar behavior in a species that belongs to a different phylum, where the nervous system is thought to have evolved separately.

      Therefore, this work significantly advances our knowledge of the circuitry implementing the control of the cilia that participate in statocyst function, which ultimately allows the animal to correct its orientation. It represents an example of systems neuroscience explaining how the nervous system allows an animal to solve a specific problem and puts it in an evolutionary perspective, showing a convincing case of convergent evolution.

      Strengths:

      The evidence for how the circuitry is connected is convincing. Pictures of synapses showing the direction of connectivity are clear, and there are good reasons to believe that the diagram inferred is valid, even though we can always expect that some connections are missing.

      The evidence for how the cilia change their beating frequency is also convincing, and the paradigm and recording methods seem pretty robust.

      The authors achieved their aims, and the results support their conclusions. This work impacts its field by presenting a mechanism by which ctenophores correct their balance, which will provide a template for comparison with other sensory systems.

      Thank you very much for these comments.

      Weaknesses:

      The evidence supporting the claim that the neural circuitry presented here controls the cilia beating is more correlational because it only relies on the fact that the location of the two types of ANN neurons coincides with the quadrants that are affected in the behavioral recordings. Discussing ways by which causality could be established might be helpful.

      We have now added additional discussions in a new “Future Directions” section explaining that for example calcium imaging or targeted neuron ablations could be used in future work to establish causality. This would require the development of genetic delivery techniques to e.g. introduce GCaMP calcium sensor or transgenic reporters.

      The explanation of the relevance of this work could be improved. The conclusion that the work hints at coordination instead of feedforward sensory-motor control is explained over only a few lines. The authors could provide a more detailed explanation of how the two models compete (coordination vs feedforward sensory-motor control), and why choosing one option over the other could provide advantages in this context.

      We added a more detailed explanation about the two types of model and why we believe that a coordination model is more compatible with our connectome data.

      “An alternative model for the function of the nerve net would be a feedforward sensory-motor system, in which balancer cells provide mechanosensory input to motor effectors via the nerve net, similar to a reflex arc. None of our observations support such a sensory-motor model. There are no synaptic pathways from balancer cells or any other sensory cells to the nerve net. The only synaptic input to ANNs comes from the bridge cells (discussed below) and from each other. The three synaptically interconnected ANNs may generate endogenous rhythm that controls balancer cilia and is influenced by bridge input. ANNs may also be influenced by neuropeptides secreted by other aboral organ neurons. Such chemical inputs may underlie the flexibility of gravitaxis and its modulation by other cues (e.g. light). Overall, the coordination model parsimoniously explains both the ANN wiring topology and the observed dynamics, whereas a simple feedforward reflex does not.”

      Since the fact that the ANN neurons form a syncytium is an important finding of this study, it would be useful to have additional illustrations of it. For instance, pictures showing anastomosing membranes could typically be added in Figure 2.

      We have now included a movie (Video 3) showing a volumetric reconstruction of a segment of an ANN neuron, which highlights the anastomosing morphology in greater detail than static images.

      “Video 3. Volumetric reconstruction of a single ANN Q1-4 neuron showing syncytial soma (cyan) and nuclei (magenta). The rotating view highlights the anastomosing morphology, although not all fine details could be reconstructed due to data limitations.”

      Also, to better establish the importance of the study, it could be useful to explain why the balancers’ cilia spontaneously beat in the first place (instead of being static and just acting as stretch sensors).

      We have discussed in more detail why it may be important for the balancer cilia to beat.

      “The observation that balancer cilia beat spontaneously, even in the absence of external tilt, suggests that they are active sensory oscillators rather than static stretch sensors. Their spontaneous beating could set a dynamic baseline of sensitivity, which can then be modulated by ANN inputs or sensory changes during tilt. Such a dynamic system may be more sensitive to small deflections and be more responsive [@Lowe1997]. Thus, the regulated beating of balancer cilia should not be seen as noise, but as an adaptive feature that enables flexible and robust graviceptive responses. The ctenophore balancer may thus use active ciliary oscillations for enhanced sensorimotor integration similar to other sensory systems [@Wan_2023].”

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, the authors describe the production of a high-resolution connectome for the statocyst of a ctenophore nervous system. This study is of particular interest because of the apparent independent evolution of the ctenophore nervous system. The statocyst is a component of the aboral organ, which is used by ctenophores to sense gravity and regulate the activity of the organ’s balancer cilia. The EM reconstruction of the aboral organ was carried out on a five-day-old larva of the model ctenophore Mnemiopsis leidyi. To place their connectome data in a functional context, the authors used high-speed imaging of ciliary beating in immobilized larvae. With these data, the authors were able to model the circuitry used for gravity sensing in a ctenophore larva.

      Strengths:

      Because of it apparently being the sister phylum to all other metazoans, Ctenophora is a particularly important group for studies of metazoan evolution. Thus, this work has much to tell us about how animals evolved. Added to that is the apparent independent evolution of the ctenophore nervous system. This study provides the first high-resolution connectomic analysis of a portion of a ctenophore nervous system, extending previous studies of the ctenophore nervous system carried out by Sid Tamm. As such, it establishes the methodology for high-resolution analysis of the ctenophore nervous system. While the generation of a connectome is in and of itself an important accomplishment, the coupling of the connectome data with analysis of the beating frequency of balancer cell cilia provides a functional context for understanding how the organization of the neural circuitry in the aboral organ carries out gravity sensing. In addition, the authors identified a new type of syncytial neuron in  Mnemiopsis. Interestingly, the authors show that the neural circuitry controlling cilia beating in Mnemiopsis shares features with the circuitry that controls ciliary movement in the annelid Platynereis, suggesting convergent evolution of this circuitry in the two organisms. The data in this paper are of high quality, and the analyses have been thoroughly and carefully done.

      Weaknesses:

      The paper has no obvious weaknesses.

      We thank the reviewer for these comments.

      Reviewer #3 (Public review):

      Summary:

      It has been a long time since I enjoyed reviewing a paper as much as this one. In it, the authors generate an unprecedented view of the aboral organ of a 5-day-old ctenophore. They proceed to derive numerous insights by reconstructing the populations and connections of cell types, with up to 150 connections from the main Q1-4 neuron.

      Strengths:

      The strengths of the analysis are the sophisticated imaging methods used, the labor-intensive reconstruction of individual neurons and organelles, and especially the mapping of synapses. The synaptic connections to and from the main coordinating neurons allow the authors to create a polarized network diagram for these components of the aboral organ. These connections give insight into the potential functions of the major neurons. This also gives some unexpected results, particularly the lack of connections from the balancer system to the coordinating system.

      Thank you for these positive comments on the paper.

      Weaknesses:

      There were no significant weaknesses in the paper - only a slate of interesting unanswered questions to motivate future studies.

      Recommendations for the authors:

      Reviewing Editor Comments:

      In consultation, the reviewers recommend that improving the evidence to “exceptional” would require additional perturbation experiments (e.g., ablation of specific neurons), as Reviewer 1 suggests. They also recommend adding a “Future Directions” section to the manuscript, because it opens up so many new experimental directions.

      We have added a new “Future Directions” section at the end of the Discussion. To carry out the proposed perturbation or calcium imaging experiments would require significant additional work and method development. We are actively working in establishing mRNA and DNA injection into ctenophore zygotes to enable live imaging, cell labelling or ablations in the future.

      Reviewer #1 (Recommendations for the authors):

      Suggestions for improved or additional experiments, data, or analyses:

      To establish causality (neurons control balancer cilia), an important experiment would be to manipulate each of these neuronal populations (e.g., by ablating them) and measure the effect of these ablations on the beating frequency of the balancer cilia of the four quadrants. Moreover, direct observation of neuronal activity (e.g., by using calcium imaging) would also provide more compelling evidence for neuronal control.

      We agree with the reviewer that such perturbation experiments would be needed to establish causality. Such experiments are currently still not possible in ctenophoes and would require significant technology development. We discuss such experiments in the “Future directions” section and also place this in the context of the currently available techniques in ctenophores. We are actively working on this but waiting for such technological breakthroughs and new experiments would significantly delay the publication of a version of record of the paper.

      Recommendations for improving the writing and presentation:

      ANN neurons are described in great detail, though SNN neurons are described more loosely. Perhaps a more detailed description of SNN neurons would be helpful.

      We added the information on SNNs to show that these cells are distinct from the ANN neurons. Since our focus is on the aboral organ, we did not aim for a comprehensive reconstruction of SNNs. Several of the processes of the SNNs are also truncated and outside our EM volume. We have nevertheless added additional details about the morphology and connectivity of SNN neurons.

      “Near the perifery of the aboral organ, we identified four further anastomosing nerve-net neurons. These resembled the previously reported syncytial subepithelial nerve net (SNN) neurons in the body wall of Mnemiopsis (Figure 2–figure supplement 1C–G) and were clearly distinct from the ANN neurons (both in location and morphology). SNN neurons show a blebbed morphology and contain dense core vesicles @Burkhardt2023 but no synapses.”

      Minor corrections to the text and figures:

      (1) Figure 2 C): “mitochondia” instead of “mitochondria”.

      corrected

      (2) Figure 3. Title: “balancer and and bridge”.

      corrected

      (3) Figure 3.C) “shown in xxx color”

      corrected

      Reviewer #2 (Recommendations for the authors):

      Clearer usage of the terms statocyst, aboral organ, aboral nerve net, statolith, dome, and lithocytes would be helpful. For readers not familiar with ctenophore anatomy, things can get a bit confusing. A single schematic with all of these terms would be helpful. In Figure 1E, there is a label “dc”. Should this be “do”?

      We have added an annotated schematic to Figure 1, explaining these terms.

      Figure 1C “The statocyst is a cavity-like organ enclosed by the dome cilia (do), which contains the statolith formed by lithocytes (li) and supported by the balancer cilia (bal).”

      Reviewer #3 (Recommendations for the authors):

      My comments are numerous, but mostly minor suggestions for improving the clarity.

      [Suggested insertions/changes are indicated by square brackets]

      (1) [It would be much easier to review this if there were line numbers, or with a double-spaced manuscript that was more accommodating for markup.]

      Thank you for this comment. We have increased the line spacing in the revised version. (We set the CSS line-height property on the html ‘body’ element to 2em).

      (2) The terms statolith, statocyst, and lithocytes can be confusing, so it would be nice to have an upfront definition of how they relate to each other.

      We have now explain these terms in the Introduction and also have improved the annotation of Figure 1.

      Figure1C. “The statocyst is a cavity-like organ enclosed by the dome cilia (do), which contains the statolith formed by lithocytes (li) and supported by the balancer cilia (bal).”

      (3) Statolith is spelled as statolyth in the early pages, but statolith in the later pages. I think -lith is more common, but in any case, these should be standardized.

      corrected to ‘statolith’

      ABSTRACT:

      (1) Differential load[s] on the balancer cilia [lead] to altered

      changed

      (2) We used volume electron microscopy (vEM) to image the aboral organ.

      changed

      (3) also form reciprocal connections with the bridge cells.

      corrected

      INTRODUCTION:

      (1) “identify conserved neuronal markers in ctenophores” - confusing - does this mean conserved across ctenophores, or conserved in ctenophores and other animals?

      changed to “classical neuronal markers”

      (2) “either increase or decrease their [ciliary] activity, indicating” - otherwise it sounds like the balancers are increasing activity.

      changed to “balancer cells may either increase or decrease their ciliary activity”

      (3) after “matches the setup used in high-speed imagine experiments”, it might be nice to add a statement like “Future studies could potentially investigate activity in the inverted orientation, when the statolith is suspended below the cilia, to see if the response differs.”

      In this sentence we referred to the orientation of the animals in our figures. There is a consensus among ctenophore researchers that when depicting ctenophores, the aboral organ should face downwards. However, for this paper we chose the opposite orientation to better match our experiments and help interpreting the results. We changed the text to: “In this study, we represent ctenophores with their aboral organ facing upwards (”balancer-up” posture), as this configuration facilitates intuitive interpretation of balance-like functions and matches the setup used in high-speed imaging experiments. ”

      We added the sentences “Future experiments could also explore how orientation affects the response of balancer cilia. For example, when the statolith is suspended below the cilia (the”balancer-down” posture), ciliary beating patterns may differ from what we observed here in the “balancer-up” configuration.” to the section Future Directions”.

      (4) “abolished by calcium[-]channel inhibitors”

      corrected

      (5) “By functional imaging, we uncovered” - It is not clear what functional imaging is. Maybe a fewword definition here, and be sure to explain in the methods.

      changed to “By high-speed ciliary imaging”. The details of the imaging are explained in the Methods section under “Imaging the Activity of Balancer Cilia”.

      RESULTS:

      (1) “five-day-old” - is it worth saying post-fertilization here?

      Thank you for pointing this out. In accordance with Presnell et al. (2022), we use post-hatching as the reference. We have revised the text in the Materials and Methods section to read: “5-day-old (5 days post-hatching)”

      (2) “We classified these cells into cell types [based on …]” - specify a bit about how you classified them based on morphology, the presence of organelles, etc.

      We added a clarification. “Our classification was based on i) ultrastructural features (e.g. number of cilia), ii) cell morphology (e.g. nerve net or bridge cells), iii) unique organelles (e.g. lamellate body, plumose cells), iv) and similarities to cell types previously described by EM. Our classification agrees with the cell types identified in the 1-day-old larva [@ferraioli2025].”

      (3) “CATMAID only supports [bifurcating] skeleton trees” - Correct?

      yes, a node in CATMAID cannot be fused to another node of the same skeleton to represent anastomoses

      FIGURE 1:

      (1) It is not worth redrawing and renumbering everything, but I wish the lateral view in A matched the rotated aboral view in B, instead of having to do two rotations to get the alignment to coincide. (Rotating panel B 90{degree sign} clockwise would make them match, but then it wouldn’t coincide with all the subsequent figures.)

      Thank you for the suggestion. We have replaced panel A with a lateral view that now matches panel B.

      (2) The labels on Figure 1 are a mix of two typefaces (Helvetica and Myriad?). They should be standardized to all use one typeface (preferably Helvetica).

      we have changed the font to Helvetica

      (3) Panel C legend: arrows are not really arrows. Say “Eye icons” or something like that. Can you show the location of the anal pores in the DIC image?

      Changed to ‘eye icons’. The anal pores are usually closed and only open briefly therefore it is not clear where exactly they would be, so indicating their position would be misleading.

      (4) Panel F, I cannot see the lines mentioned in the legend at all, except for maybe a tiny wisp in a couple of places. Either omit or make visible.

      changed to “The spheres indicate the position of nuclei in the reconstructed cells.”

      (5) Panel G. “Cells are color coded according to quadrants”… but unfortunately, the color scale is 90{degree sign} off of what is presented in the rest of the panels and the paper. Q1 and Q3 have been blue, but now Q2+4 are blue/purple, while Q1+3 are orange/yellow. Again, it seems like too much work to recolor panel G, but in future, it would be nice to maintain that consistency, especially since other panels specifically mention the consistent colors.

      We have changed the color code in panels B, C and E to match G and the subsequent panels/figures.

      RESULTS: Aboral synaptic nerve net

      (1)“We reconstructed three aboral nerve-net (ANN) neurons” - out of how many total? Were these three just the first ones traced, or are they likely to be all of the multi-domain neurons? One can’t tell if these are the top 3 (out of X), or if there are other multi-quad neurons that were not traced. Are there any Q1Q4 or Q2Q3 neurona? Specify overall composition.

      There are only three ANN neurons in the aboral organ. These are all completely reconstructed and contained within the volume. We have clarified this in the text. “We identified and reconstructed three aboral nerve-net (ANN) neurons, each exhibiting a syncytial morphology characterized by anastomosing membranes and multiple nuclei (ranging from two to five) (Figure 2A and B, Figure 2–figure supplement 1C). These three neurons are the only fully reconstructed ANN neurons contained within the volume. Several small ANN-like fragments were also observed at the periphery of the aboral organ, but their connectivity to the main ANN remains uncertain.”

      FIGURE 2:

      (1) Panel C: “N > 2 cells for each cell type” - is that supposed to say “N > 2 mitochondria”? More than 2 cells in all the types shown in the graph.

      It is number of cells for each cell type

      (2) Panel D: Is this the wrong caption? I can only see green and black circles, not red, yellow, or blue. Make them larger or “flat” (circled, not shaded spheres) if they are supposed to be visible

      Thank you for pointing this out. The caption was incorrect and has been corrected to match the figure.

      (3) Panel E: Amazing to see the cross-network connections!

      Thank you

      (4) Again, it is great to see the three ANN mapped out, but … are there other connections that weren’t mapped in this study? Other high-level coordinating neurons? ANN_Q1Q4 or Q2Q3?

      The reconstruction is complete and there are no other neurons or connections. Given the large size of ctenophore synapses, we are confident that we identified all or most synapses and their connections.

      RESULTS: Synaptic connectome

      (1) “displaying rotational symmetry” - This is one of the things I am most curious about. Where is the evidence of rotational symmetry in the network diagram? Is it the larger number of connections to Q2 and Q4? Any evidence of rotational symmetry, like Q1 and Q3 connect to Q2 and Q4 respectively, but not the other way around?

      changed to “displaying biradial symmetry”, we do not consider the slight difference in synapse number from ANN Q1-4 to the Q1-Q3 vs. Q2-Q4 balancers as significant or strong enough evidence for a single rotational symmetry (i.e. 180 degrees rotation)

      (2) “Surprisingly” - this *was* really surprising. There have to be some afferent neurons connecting from the balancers, don’t there? I can’t remember the connections to the SNN, but is there a tertiary set of ANNs that connect between the balancers and the top 3 ANNs? I would like a little more discussion about this.

      Indeed, this is why this is so surprising. Most people would have expected some output connections from the balancer to the nerve net or elsewhere. There are none. We have the complete balancer network and all balancer cells are ‘sink nodes’ (inputs only)(Figure3–figure supplement 1).

      we added a short statement in the beginning of the Bridge Cells as Feedback Regulators of Ciliary Rhythms section noting that no direct connections from the balancers to the ANN were found and that all balancer cells act as sink nodes (inputs only; Figure 3–figure supplement 1). This highlights that bridge cells are indeed the sole neuronal input to the ANN circuit.

      Figure 3:

      (1) As you know, during development, the diagonally opposite cells have a shared heritage and shared functionality. Are there neuronal signatures that correspond to the rotational symmetry that we see, for example, in the position of the anal pores?

      We did not find any evidence in neuronal complement for a diagonal symmetry, suggesting that neuronal organization does not simply mirror the organism’s rotational body symmetry.

      (2) Do you have the information to say whether there are any diagonal or asymmetric connections? Can’t tell if those would have shown up in the mapping efforts or if you focused on the major ones only.

      Based on our complete mapping, we did not find evidence for a diagonal pattern. The connectivity instead shows a biradial organization.

      (3) “extending across opposite quadrant regions” - to me, opposite would be diagonally opposite, but this looks like a set of cells between Q1 and Q2 is connecting to a sister-set in Q3+Q4. I wonder if, in a more detailed view, you could see whether this is a rotational correspondence, rather than a reflection. There are some subtle hints of this in the aboral view, with some cells on the right of the blue cluster and the left of the magenta cluster.

      changed to “extending across tentacular-axis-symmetric quadrant regions” for clarity

      (4) As with Figure 2, I do not see any circles/spheres that are yellow, red, or blue! There are some traces of what appear to be other neurons that have these colors, but nothing that would suggest the localization of mitochondria.

      Thank you for pointing this out. We have corrected the caption to match the figure, as in the previous item.

      (5) The connectivity map is very cool, but the caption does not seem to correspond to the version included in the manuscript. I don’t see any hexagons; all arrows seem to have the same thickness.

      changed to: “Complete connectivity map of the gravity-sensing neural circuit. Cells belonging to the same group are shown as diamonds, and the number of cells is added to their labels. The number of synapses is shown on the arrows.”

      RESULTS: Dynamics of balancer cilia

      (1) The orientation of the stage+larvae is a bit hard to follow. Maybe say the sagittal or tentacular plane is parallel to the sample stage and the gravity vector?

      we added “Larvae were oriented with their sagittal or tentacular plane parallel to the sample stage.”

      (2) “We could simultaneously image Q1(3) and Q2(4). The meaning of the numbers in () is not clear. Either way that I try to interpret it does not match the diagrams. Should this say viewing the tentacular plane, you can image Q1 and 4 or Q2 and 3?

      Thank you for spotting this mistake, we have changed to: “In larvae with their sagittal plane facing the objective, we could compare balancer-cilia movements between Q1 vs. Q2 or Q3 vs. Q4. In other larvae oriented in the tentacular plane, we could simultaneously image Q1 and Q4 or Q2 and Q3.”

      (3) Typo: episod[e]s were excluded

      Corrected

      DISCUSSION:

      This section is quite clean. Maybe mention some future directions:

      We have added a “Future Directions” section

      (1) Do these networks change during development? Five-days-old is still quite undeveloped - what would it look like in an adult specimen? Would you expect a larger version of the same or more diverse connections?

      As far as we know from work on aboral organs in adult ctenophores, the same structures and cells can be found. We do not know how the network will develop. We know that at 5 days the balancer is fully functional and the animals can orient and their behaviour is coordinated. So the wiring may not change extensively later in development. In the 1-day-old larva, Ferraioli et al. did not distinguish ANN neurons as a separate population, as these were merged with SNNs in their dataset. This suggests that significant cellular and circuit maturation likely occurs between 1 and 5 days.

      METHODS: Imaging the Activity of Balancer Cilia

      (1) “we selected only larvae whose aboral-oral axis was oriented nearly perpendicular to the gravitational vector”. Shouldn’t this be “nearly parallel to the gravity vector” not perpendicular?

      Thank you for spotting this, corrected.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      The authors present exciting new experimental data on the antigenic recognition of 78 H3N2 strains (from the beginning of the 2023 Northern Hemisphere season) against a set of 150 serum samples. The authors compare protection profiles of individual sera and find that the antigenic effect of amino acid substitutions at specific sites depends on the immune class of the sera, differentiating between children and adults. Person-to-person heterogeneity in the measured titers is strong, specifically in the group of children's sera. The authors find that the fraction of sera with low titers correlates with the inferred growth rate using maximum likelihood regression (MLR), a correlation that does not hold for pooled sera. The authors then measure the protection profile of the sera against historical vaccine strains and find that it can be explained by birth cohort for children. Finally, the authors present data comparing pre- and post- vaccination protection profiles for 39 (USA) and 8 (Australia) adults. The data shows a cohort-specific vaccination effect as measured by the average titer increase, and also a virus-specific vaccination effect for the historical vaccine strains. The generated data is shared by the authors and they also note that these methods can be applied to inform the bi-annual vaccine composition meetings, which could be highly valuable.

      We appreciate the reviewer’s clear summary of our work.

      Thanks to the authors for the revised version of the manuscript. A few concerns remain after the revision:

      (1) We appreciate the additional computational analysis the authors have performed on normalizing the titers with the geometric mean titer for each individual, as shown in the new Supplemental Figure 6. We agree with the authors statement that, after averaging again within specific age groups, "there are no obvious age group-specific patterns." A discussion of this should be added to the revised manuscript, for example in the section "Pooled sera fail to capture the heterogeneity of individual sera," referring to the new Supplemental Figure 6.

      However, we also suggested that after this normalization, patterns might emerge that are not necessarily defined by birth cohort. This possibility remains unexplored and could provide an interesting addition to support potential effects of substitutions at sites 145 and 275/276 in individuals with specific titer profiles, which as stated above do not necessarily follow birth cohort patterns.

      The reviewer is correct that there remains heterogeneity among the serum titers to different strains that we cannot easily explain via age group, and suggests that additional patterns could emerge. We certainly agree that explaining this heterogeneity remains an interesting goal, but as described in the manuscript we have analyzed the possible causes of the heterogeneity as exhaustively as possible given the available metadata. At this point, the most we can say is that the strain-specific neutralization titers are highly heterogeneous in a way that cannot be completely explained by birth cohort. We agree that further analysis of the cause is an area for future work, and have made all of our data available so that others can continue to explore additional hypotheses. It may be that these questions can only be answered by experiments on sera from newer cohorts where more detailed metadata on infection and vaccination history are available.

      (2) Thank you for elaborating further on the method used to estimate growth rates in your reply to the reviewers. To clarify: the reason that we infer from Fig. 5a that A/Massachusetts has a higher fitness than A/Sydney is not because it reaches a higher maximum frequency, but because it seems to have a higher slope. The discrepancy between this plot and the MLR inferred fitness could be clarified by plotting the frequency trajectories on a log-scale.

      For the MLR, we understand that the initial frequency matters in assessing a variant's growth. However, when starting points of two clades differ in time (i.e., in different contexts of competing clades), this affects comparability, particularly between A/Massachusetts and A/Ontario, as well as for other strains. We still think that mentioning these time-dependent effects, which are not captured by the MLR analysis, would be appropriate. To support this, it could be helpful to include the MLR fits as an appendix figure, showing the different starting and/or time points used.

      Multinomial logistic regression is a widely used technique to estimate viral growth rates from sequencing counts (PLoS Computational Biology, 20:e1012443; Nature, 597:703-708; Science, 376:1327-1332). As the reviewer points out, it does assume that the relative viral growth rates are constant over the time period analyzed. However, most of the patterns mentioned by the reviewer are not deviations from this assumption, but rather just due to the fact that frequencies are plotted on a linear scale. More specifically, our multinomial logistic regression implementation defines two parameters per variant: the initial frequency and the growth rate. The absolute variant growth rate is effectively the slope of the logit-transformed variant frequencies. Each variant's relative fitness depends on that variant's growth rate relative to a predefined baseline variant. Plotting frequencies on a logit scale does help emphasize the importance of the slope by showing exponential growth as a linear trajectory. We have added a new Supplemental Figure 9 that plots the frequencies from Figure 5A on a logit scale. As can be seen the frequency trajectories are closer to linear on the logit scale.

      We have updated the results text to clarify the nature of the fixed relative growth rates per strain and to refer to this new supplemental figure as follows:

      To estimate the evolutionary success of different human H3N2 influenza strains during 2023, we used multinomial logistic regression, which uses sequence counts to estimate fixed strain growth rates relative to a baseline strain for the entire analysis time period (in this case, 2023) [50–52]. Relative growth rates estimated by multinomial logistic regression represent relative fitnesses of strains over that time period. There were sufficient sequencing counts to reliably estimate growth rates in 2023 for 12 of the HAs for which we measured titers using our sequencing-based neutralization assay libraries (Figure 5a,b and Supplemental Figure 9). We estimated strain growth rates relative to the baseline strain of A/Massachusetts/18/2022. Note that these growth rates estimate how rapidly each strain grows relative to the baseline strain, rather than the absolute highest frequency reached by each strain. Each strain’s absolute growth rate corresponds to the slope of the strain’s logit-transformed frequencies at the end of the analysis time period (Supplemental Figure 9).

      As the reviewer notes, the multinomial logistic regression implementation assumes a fixed growth rate for each strain over the time period being analyzed. This limitation causes the inferred growth rates to emphasize the latest trends in the analysis time period. For example, at the end of December 2023 in Figure 5A, the A/Ontario/RV00796/2023 strain is growing rapidly and replacing all other variants. Correspondingly, the multinomial logistic regression infers a high growth rate for that Ontario strain relative to the A/Massachusetts/18/2022 baseline strain. However, the A/Massachusetts/18/2022 strain was growing relative to other strains in the first half of 2023 since it has a higher growth rate than they do. However, there are modest deviations from linearity on the logit scale shown in the added supplementary figure likely because the assumption of a fixed set of relative growth rates over the analyzed time period is an approximation.

      We have added the following text to the discussion to highlight this limitation of the multinomial logistic regression:

      Our comparisons of the neutralization titers to the growth rates of different H3N2 strains was limited by the fact that only a modest number of strains had adequate sequence data to estimate their growth rates. Strains with more sequencing counts tend to be those with moderate-to-high fitness, which therefore limited the dynamic range of growth rates across strains we were able to analyze. Relatedly, the multinomial logistic regression infers a single fixed growth rate per strain for the entire analysis time period of 2023, and cannot represent changes in relative fitness of strains over that relatively short time period. Additionally, because the strains for which we estimated growth rates are phylogenetically related it is difficult to assess the statistical significance of the correlation [53], so it will be important for future work to reassess the correlations with new neutralization data against the dominant strains in future years.

      (3) Regarding my previous suggestion to test an older vaccine strain than A/Texas/50/2012 to assess whether the observed peak in titer measurements is virus-specific: We understand that the authors want to focus the scope of this paper on the relative fitness of contemporary strains, and that this additional experimental effort would go beyond the main objectives outlined in this manuscript. However, the authors explicitly note that "Adults across age groups also have their highest titers to the oldest vaccine strain tested, consistent with the fact that these adults were first imprinted by exposure to an older strain." This statement gives the impression that imprinting effects increase titers for older strains, whereas this does not seem to be true from their results, but only true for A/Texas. It should be modified accordingly.

      We agree with the reviewer’s suggestion that the specific language describing the potential trend of adults having the highest titers to the oldest strain tested could be further caveated. To this end, we have made the following edits to the portion of the main text that they highlighted:

      Adults across age groups also have their highest titers to the oldest vaccine strain tested (Figure 6), consistent with the fact that these adults were likely first imprinted by exposure to an older strain more antigenically similar to A/Texas/50/2012 (the oldest strain tested here) than more recent strains. Note that a similar trend towards adult sera having higher titers to older vaccine strains was also observed in a more recent study we have performed using the same methodology described here [60].

      Notably, this trend of adults across age groups having the highest titers to the oldest vaccine strains tested has held true in subsequent work we’ve performed with H1N1 viruses (Kikawa et al., 2025 Virus Evolution, DOI: https://doi.org/10.1093/ve/veaf086). In that more recent study, we again saw that adults (cohorts EPIHK, NIID, and UWMC) tended to have their highest titers to the oldest cell-passaged strain tested (A/California/07/2009), whereas children (cohort SCH) had more similar neutralization titers across strains.  These additional data therefore support the idea that adults tend to have their highest titers to older vaccine strains, a finding that is also consistent with substantial prior work (eg, Science, 346:996-1000).

      Reviewer #2 (Public review):

      This is an excellent paper. The ability to measure the immune response to multiple viruses in parallel is a major advancement for the field, that will be relevant across pathogens (assuming the assay can be appropriately adapted). I only had a few comments, focused on maximising the information provided by the sera. These concerns were all addressed in the revised paper.

      We thank this reviewer for the summary of our work and their helpful comments in the first revision.

      Reviewer #3 (Public review):

      The authors use high throughput neutralisation data to explore how different summary statistics for population immune responses relate to strain success, as measured by growth rate during the 2023 season. The question of how serological measurements relate to epidemic growth is an important one, and I thought the authors present a thoughtful analysis tackling this question, with some clear figures. In particular, they found that stratifying the population based on the magnitude of their antibody titres correlates more with strain growth than using measurements derived from pooled serum data. The updated manuscript has a stronger motivation, and there is substantial potential to build on this work in future research.

      Comments on revisions:

      I have no additional recommendations. There are several areas where the work could be further developed, which were not addressed in detail in the responses, but given this is a strong manuscript as it stands, it is fine that these aspects are for consideration only at this point.

      We appreciate this reviewer’s summary of our work, and we are glad they feel the motivation is stronger in the revised manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This valuable study provides insights into the role of Pten mutations in SHH-medulloblastoma, by using mouse models to resolve the effects of heterozygous vs homozygous mutations on proliferation and cell death throughout tumorigenesis. The experiments presented are convincing, with rigorous quantifications and orthogonal experimentation provided throughout, and the models employing sporadic oncogene induction, rather than EGL-wide genetic modifications, represent an advancement in experimental design. However, the study remains incomplete, such that the biological conclusions do not extend greatly from those in the extant literature; this could be addressed with additional experimentation focused on cell cycle kinetic changes at early stages, as well as greater characterization of macrophage phenotypes (e.g., microglia vs circulating monocytes). The work will be of interest to medical biologists studying general cancer mechanisms, as the function of Pten may be similar across tumor types.

      We appreciate the summary of the importance of our work and agree that it provides a foundation for future experiments addressing underlying mechanisms including the role of macrophages in tumor progression/regression

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This paper investigates how Pten loss influences the development of medulloblastoma using mouse models of Shh-driven MB. Previous studies have shown that Pten heterozygosity can accelerate tumorigenesis in models where the entire GNP compartment has MB-promoting mutations, raising questions about how Pten levels and context interact, especially when cancer-causing mutations are more sporadic. Here, the authors create an allelic series combining sporadic, cell-autonomous induction of SmoM2 with Pten loss in granule neuron progenitors. In their models, Pten heterozygosity does not significantly impact tumor development, whereas complete Pten loss accelerates tumour onset. Notably, Pten-deficient tumours accumulate differentiated cells, reduced cell death, and decreased macrophage infiltration. At early stages, before tumour establishment, they observe EGL hyperplasia and more pre-tumour cells in S phase, leading them to suggest that Pten loss initially drives proliferation but later shifts towards differentiation and accumulation of death-resistant, postmitotic cells. Overall, this is a well-executed and technically elegant study that confirms and extends earlier findings with more refined models. The phenotyping is strong, but the mechanistic insight is limited, especially with respect to dosage effects and macrophage biology.

      Strengths:

      The work is carefully executed, and the models-using sporadic oncogene induction rather than EGL-wide genetic manipulations-represent an advance in experimental design. The deeper phenotyping, including singlecell RNA-seq and target validation, adds rigor.

      Weaknesses:

      The biological conclusions largely confirm findings from previous studies (Castellino et al, 2010; Metcalf et al, 2013), showing that germline or conditional Pten heterozygosity accelerates tumorigenesis, generates tumors with a very similar phenotype, including abundant postmitotic cells, and reduced cell death.

      We respectfully would like to point out that we have added new insights not covered in the previous more abbreviated studies. First, we are the first to show that in a sporadic model, heterozygous loss of Pten does not lead to accelerated or more aggressive disease. This is an important finding, since this is the case for many patients and only germline PTEN mutant humans are likely to have more aggressive tumors. Also, the previous studies did not examine tumor progress by analyzing neonatal stages or analyze spinal cord metastasis. We found a different phenotype at some early stages then at end stage, thus they provide new insights. Our study also is the only one to apply a mosaic analysis to study cell behaviors at early stages of progression, including proliferation and differentiation/survival. We are also the first to demonstrate a reduction in macrophages in Pten mutant SHH-MB.

      The second stated goal - to understand why Pten dosage might matter - remains underdeveloped. The difference between earlier models using EGL-wide SmoA1 or Ptch loss versus sporadic cell-autonomous SmoM2 induction and Pten loss in this study could reflect model-specific effects or non-cell-autonomous contributions from Pten-deficient neighbouring cells in the EGL, for example. However, the study does not explore these possibilities. For instance, examining germline Pten loss in the sporadic SmoM2 context could have provided insight into whether dosage effects are cell-autonomous or dependent on the context.

      We thank the reviewer for suggesting this experiment and agree it would be an informative one for other groups to perform as a follow up to our work to allow a direct comparison in the same sporadic SHH-MB model of mosaic vs germline loss of Pten. Also, we would like to point out that we do show a dosage effect of lowering vs removing Pten when only sporadic GCPs also have an activating mutation in SMO. Please see above comments for additional new mechanistic insight we have provided.

      The observations on macrophages are intriguing but preliminary. The reduction in Iba1+ cells could reflect changes in microglia, barrier-associated macrophages, or infiltrating peripheral macrophages, but these populations are not distinguished. Moreover, the functional relevance of these immune changes for tumor initiation or progression remains unexplored.

      We agree, further studies of the influence of Pten mutations on macrophage phenotypes will be interesting.

      Reviewer #2 (Public review):

      The authors sought to answer several questions about the role of the tumor suppressor PTEN in SHHmedulloblastoma formation. Namely, whether Pten loss increases metastasis, understanding why Pten loss accelerates tumor growth, and the effect of single-copy vs double-copy loss on tumorigenesis. Using an elegant mouse model, the authors found that Pten mutations do not increase metastasis in a SmoD2-driven SHH-medulloblastoma mouse model, based on extensive characterization of the presence of spinal cord metastases. Upon examining the cellular phenotype of Pten-null tumors in the cerebellum, the authors made the interesting and puzzling observation that Pten loss increased the differentiation state of the tumor, with fewer cycling cells, seemingly in contrast to the higher penetrance and decreased latency of tumor growth.

      The authors then examined the rate of cell death in the tumor. Interestingly, Pten-null tumors had fewer dying cells, as assessed by TUNEL. In addition, the tumors expressed differentiation markers NeuN and SyP, which are rare in SHH-MB mouse models. This reduction in dying cells is also evident at earlier stages of tumor growth. By looking shortly after Pten-loss induction, the authors found that Pten loss had an immediate impact on increasing the proliferative state of GCPs, followed by enhancing the survival of differentiated cells. These two pro-tumor features together account for the increased penetrance and decreased latency of the model. While heterozygous loss of Pten also promoted proliferation, it did not protect against cell death.

      Interestingly, loss of Pten alone in GCPs caused an increase in cerebellar size throughout development. The authors suggest that Pten normally constrains GCP proliferation, although they did not check whether reduced cell death is also contributing to cerebellum size.

      Lastly, the authors examined macrophage infiltration and found that there was less macrophage infiltration in the Pten-null tumors. Using scRNA-seq, they suggest that the observed reduction in macrophages might be due to an immunosuppressive tumor microenvironment.

      This mouse model will be of high relevance to the medulloblastoma community, as current models do not reflect the heterogeneity of the disease. In addition, the elegant experimentation into Pten function may be relevant to cancer biologists outside of the medulloblastoma field.

      Strengths:

      The in-depth characterisation of the mouse model is a major strength of the study, including multiple time points and quantifications. The single-cell sequencing adds a nice molecular feature, and this dataset may be relevant to other researchers with specific questions of Pten function.

      Weaknesses:

      One weakness of the study was the examination of the macrophage phenotype, which did not include quantification (only single images), so it is difficult to assess whether this reduction of macrophages holds true across multiple samples. Future studies will also be needed to assess whether Pten-mutated patient medulloblastomas also have a differentiation phenotype, but this is difficult to assess given the low number of samples worldwide.

      We thank the reviewer for highlighting the importance of our sporadic mutant approach and new findings. As stated above, we agree, further studies of the influence of Pten mutations on macrophage phenotypes will be interesting as well as of human samples once large numbers can be obtained. All conclusions about macrophages are based on analyzing 3 independent tumors/genotype, which was stated in the Figure legends, and for all end stage tumors the sections were collected from one lateral edge of the tumor to the midline and for earlier stage from one side of the brain to the other, thus we believe the reported phenotypes are consistent within tumor and stages

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Minor points 

      (1) The authors should state explicitly that early EGL analyses sample the same cerebellar region across animals (e.g., matched lobule or distance from the midline) because position-dependent effects are possible. 

      We agree this is an important aspect of the rigor of the study and are sorry this was not clear enough. We had stated in the legends to Figures 4 and 5 that midline sections were analyzed and when it was not the entire EGL quantified the region analyzed was shown, but we now include more details in all relevant Figure legends and in the Methods section. 

      (2) It is not clear from Figure 3i-k that TUNEL density in Syp-high regions differs between Pten+/- and Pten-/- tumors. 

      We have added a new graph as Figure 3 Supplemental Figure 1D with this direct comparison. Indeed, there is no difference between the Syp-high regions of Pten+/- and Pten-/- tumors as these regions of Pten+/- tumors have no detectable PTEN protein and thus have the same behavior as Pten-/- tumors (reduced cell death).

      (3) The authors interpret the increase in the %EdU+ GFP+ cells in the EGL as evidence of a faster cell cycle. However, EdU labeling alone does not demonstrate altered cell cycle kinetics; this would require a dedicated assay. It would also be informative to combine EdU with Ki67 staining. This could clarify whether the effect reflects changes in differentiation - for example, if a higher proportion of GFP+ pre-tumor cells remain Ki67+-or whether the increase in EdU simply reflects a greater fraction of cells being in cycle. Such an analysis might even reveal no change in cycling if the proliferation index in controls is lower. 

      We are sorry we did not make our analysis sufficiently clear in Figure 5 and Figure 6. The quantification of EdU+ cells was restricted to the outer EGL (region defined by containing GFP+ and EdU+ cells) where all cells should be Ki67+.  We cannot perform co-staining of Ki67 and GFP, since antigen retrieval for Ki67 removes the epitope for our GFP antibody. We have revised the wording in the figure legends and results sections.  

      (4) Some of the stains are unconvincing - for example, Figure 2 E,F, the p27 staining is difficult to distinguish from the background, Figure 7G,E- CD31+ blood vessels are difficult to see. 

      As requested, in Fig. 2 we adjusted the level of the green color for P27 to reduce the background in A, B, E , F using Photoshop. In Fig. 7G, H we adjusted the level of the green color for CD31 to reduce the background.  

      (5) Line 158: "unlike a SmoA2 model with germline or broad deletion of Pten in the cerebellum, where heterozygous deletion is sufficient..." That paper refers to the Neuro-D2SmoA1 mouse model. So this statement should be clarified.  

      We have made this edit.

      Reviewer #2 (Recommendations for the authors): 

      (1) I find the final discussion paragraph about Kmt2d does not add much to the study, as it seems obvious that the mechanisms of tumor formation would differ between two different tumor suppressor genes, but this is only my opinion. 

      We respectfully think it is interesting, even if expected, so have left it in the Discussion.

      (2) There is also a typo on line 342 that changes the meaning of the sentence: mTORC1 signaling is significantly 'unregulated'; 

      We thank the reviewer for noticing this mistake. We have changed 'unregulated' to ‘upregulated’.

      (3) Figure 9Q,R mislabeled: not mTORC1, but instead UPR  

      Asns is included in the mTOR pathway in Hallmark MTOR1 signaling as well as in the Unfolded Protein Response gene list. We have made a note of this in the Figure legend.

    1. 7.6.3. Trolling and Nihilism# While trolling can be done for many reasons, some trolling communities take on a sort of nihilistic philosophy: it doesn’t matter if something is true or not, it doesn’t matter if people get hurt, the only thing that might matter is if you can provoke a reaction. We can see this nihilism show up in one of the versions of the self-contradictory “Rules of the Internet:” 8. There are no real rules about posting … 20. Nothing is to be taken seriously … 42. Nothing is Sacred Youtuber Innuendo Studios talks about the way arguments are made in a community like 4chan: You can’t know whether they mean what they say, or are only arguing as though they mean what they say. And entire debates may just be a single person stirring the pot [e.g., sockpuppets]. Such a community will naturally attract people who enjoy argument for its own sake, and will naturally trend oward the most extremte version of any opinion. In short, this is the free marketplace of ideas. No code of ethics, no social mores, no accountability. … It’s not that they’re lying, it’s that they just don’t care. […] When they make these kinds of arguments they legitimately do not care whether the words coming out of their mouths are true. If they cared, before they said something is true, they would look it up. The Alt-Right Playbook: The Card Says Moops by Innuendo Studios While there is a nihilistic worldview where nothing matters, we can see how this plays out practically, which is that they tend to protect their group (normally white and male), and tend to be extremely hostile to any other group. They will express extreme misogyny (like we saw in the Rules of the Internet: “Rule 30. There are no girls on the internet. Rule 31. TITS or GTFO - the choice is yours”), and extreme racism (like an invented Nazi My Little Pony character). Is this just hypocritical, or is it ethically wrong? It depends, of course, on what tools we use to evaluate this kind of trolling. If the trolls claim to be nihilists about ethics, or indeed if they are egoists, then they would argue that this doesn’t matter and that there’s no normative basis for objecting to the disruption and harm caused by their trolling. But on just about any other ethical approach, there are one or more reasons available for objecting to the disruptions and harm caused by these trolls! If the only way to get a moral pass on this type of trolling is to choose an ethical framework that tells you harming others doesn’t matter, then it looks like this nihilist viewpoint isn’t deployed in good faith1. Rather, with any serious (i.e., non-avoidant) moral framework, this type of trolling is ethically wrong for one or more reasons (though how we explain it is wrong depends on the specific framework).

      This section helped me think about trolling in a much more nuanced way, especially the idea that disruption itself isn’t automatically good or bad. I found the discussion about group formation and norm enforcement really useful, because it explains why trolling can feel threatening—it challenges the patterns and signals that groups rely on to define who belongs. The comparison between trolling, protest, and revolution also stood out to me, since it shows how moral judgment often depends on whether we see the existing social order as legitimate. Overall, this section made it clear that evaluating trolling ethically requires looking beyond intent or humor and examining what is being disrupted and who is harmed or protected by that disruption.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements [optional]

      We thank all three Reviewers for appreciating our work and for sharing constructive feedback to further enhance the quality of our study. It is really gratifying to read that the Reviewers believe that this work is interesting, novel and of interest to broad audience. Therefore, we believe that it will be suitable for a high profile journal. Further, the experiments suggested by the reviewers have added value to the work and have substantiated our findings. It is important to highlight that we have performed all the suggested experiments. Please find below the detailed point by point response to Reviewer’s Comments.

      2. Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required):

      • The manuscript entitled, "IP3R2 mediated inter-organelle Ca2+ signaling orchestrates melanophagy" is a rather diffuse study of the relationship between IP3R2 and melanin production. While this is an interesting and understudied area, the study lacks a clear focus. The model seems to be that IP3R2 is essential for mitochondrial calcium loading. And that its absence increases lysosomal calcium loading. There are also a number of incomplete and/or unconvincing links to autophagy/melanophagy, TMEM165, TRPML1 and even gene transcription. In this kind of diffuse study, each step needs to be convincing to get to the next one, which is not the case here. There are also references to altered proteasome function, despite the total absence of any direct data on the proteasome. Finally, I felt it was sometimes unclear whether the authors were referring to melanosomes or lysosomes at various points throughout the study.*

      While I suspect that, somewhere in here, there are some novel relationships worthy of further investigation, this is a case where the many parts make the overall product less convincing. What effects here are directly relevant to IP3R2? This study should stop there, leaving investigations of peripheral factors for future investigations, as the further you get from where you start, the less clear what you are studying becomes. And the less direct.

      Response: We thank the Reviewer for finding our study interesting and recognizing that this is an understudied area. Further, we appreciate the constructive feedback given by the Reviewer. We have addressed all the Reviewer’s comments. Please find below point-wise responses to the comments.

      Specific Comments:

      __ Comment 1.__ The separation of Figures 1F and 1J makes it impossible to assess the effect of αMSH on IP3R2 expression. This presentation makes interpretation difficult; a simple 4 lane Western would be more informative.

      Response: We apologize to the Reviewer for not being very clear. Actually, we have separated these data sets because these are two independent experimental conditions. The Figure 1F illustrates data from the LD-based pigmentation model, whereas Supplementary Figure 1K (Previously Fig 1J) depicts data from α-MSH–induced pigmentation model.

      Comment 2. One of the most attractive points made by this study is that there is a specific link between IP3R2 and melanin production. In my opinion, the null hypothesis is that this is just about the amount of IP3Rs expressed per cell. To reject this concept, the authors should show data demonstrating the relative expression of all 3 IP3Rs. Without this information, the null hypothesis that IP3R2 is the most expressed IP3R isoform and that's why its knockdown has the most dramatic effect cannot be rejected It would also be helpful to show where the different IP3Rs are expressed within the cell.

      Response: We thank the Reviewer for raising this interesting point and for the constructive comment. As suggested, we would like to clarify that the relative expression of all three IP₃R isoforms has already been analyzed in our study. Specifically, in Figure 1B, we demonstrate the expression pattern of IP₃R isoforms in our experimental system, where IP₃R2 shows the highest expression level, followed by IP₃R3 and IP₃R1 (IP₃R2 > IP₃R3 > IP₃R1). Further, in the revised manuscript, we additionally analyzed publicly available datasets for IP₃Rs expression. “The Human Protein Atlas” reports a higher expression of IP₃R2 in melanocytes compared to the other IP₃R isoforms (Supplementary Fig 1A). Therefore, we agree with the Reviewer’s proposed concept that the relatively higher expression of IP₃R2 can be one of the important factors that regulate pigmentation levels. Indeed, our analysis of microarray dataset from African vs Caucasian skin revealed a greater IP₃R2 expression in African skin compared to Caucasian skin (__Figure 1L). __

      With respect to subcellular localization, all three IP₃R isoforms are predominantly localized to the endoplasmic reticulum, consistent with their established role as ER-resident Ca²⁺ release channels. However, their expression levels are known to be highly cell and tissue specific (Bartok et al., Nature Communications 2019), supporting the idea that higher IP₃R2 levels play a functionally specialized role in melanogenesis.

      Comment 3. It would be helpful to label Figs 3F-I with the conditions used. The description in the text is of increased LC3II levels, however, the ratio of LC3I to LC3II might be more meaningful. Irrespective, although the graph shows an increase in LC3II, the Western really doesn't show much. As a standalone finding, I don't find this figure to be very convincing; there are better options to demonstrate this proposed relationship between IP3R2 and autophagy than what is shown.

      Response: We sincerely thank the Reviewer for this thoughtful and critical evaluation, which has helped us improve the clarity and precision of this analysis. To address this concern, in the revised manuscript, we have now labeled ‘LD’ in the Supplementary Fig 2A-B (Previously, Fig 4F-I) with the corresponding experimental conditions for clarity. In addition, we reanalyzed the data by calculating the LC3II/LC3I ratio in all the figures of the revised manuscript that include LC3II expression, which provides a more meaningful and robust assessment of autophagic flux. This revised analysis yields a clearer representation of LC3 dynamics and strengthens the interpretation of the western blotting data in support of the relationship between IP₃R2 and autophagy. Further, we have shown by confocal imaging that IP3R2 silencing significantly reduced GFP/RFP ratio of the pMRX-IP-GFP-LC3-RFP reporter system in comparison to control condition in Fig 4M-N to demonstrate the relationship between IP3R2 and autophagy. Collectively, these autophagy flux assays and biochemical experiments clearly demonstrate a direct relationship between IP3R2 and autophagy.

      Comment 4. The following statement at the beginning of page 22 "We observed an impaired proteasomal degradation of critical melanogenic proteins localized on melanosomes in the IP3R2 knockdown condition" is insufficiently supported by data to be made. Even if I was convinced that autophagy was enhanced, there is no data of any kind about the proteasome in this manuscript.

      Response: We appreciate the Reviewer’s careful scrutiny of this statement and the opportunity to clarify and strengthen our interpretation. To directly address the concern regarding proteasomal involvement, in the revised manuscript, we performed additional experiments using MG132, a well-established inhibitor of proteasomal degradation. These experiments were designed to assess whether the altered stability of melanogenic proteins observed upon IP₃R2 knockdown could be attributed to changes in proteasome-mediated turnover.

      In the revised manuscript, our new data show that treatment with MG132 leads to a marked reduction in the levels of melanosome-associated melanogenic proteins, including GP100 and DCT, compared to the DMSO control (Fig. 4A–D). This response contrasts with that of non-melanosomal proteins, such as IP₃R2 and Calnexin, which are localized to the endoplasmic reticulum and exhibits increased accumulation upon MG132 treatment (Fig. 4E–H), consistent with canonical proteasomal inhibition. These differential outcomes suggest that melanosome-resident proteins respond distinctly to proteasomal blockade, likely due to their compartmentalized localization on melanosomes.

      Previous studies have shown that impairment of proteasomal function can activate autophagy as a compensatory, cytoprotective mechanism (Williams et al, 2013; Li et al, 2019; Su & Wang, 2020; Pan et al, 2020). Indeed, we observed a significant increase in LC3II/LC3I levels in IP3R2 knockdown plus MG132 treatment condition in comparison to IP3R2 knockdown plus the DMSO control (Fig. 4I–J).

      To investigate whether impairment of proteasomal degradation upon IP3R2 silencing alone or together with MG132 selectively triggers melanophagy, we assessed melanophagy using melanophagy reporter, mCherry-Tyrosinase-eGFP following IP3R2 silencing along with MG132 treatment. Our observations revealed an increase in melanophagy flux with IP3R2 silencing and MG132 treatment compared to siNT with DMSO control (Fig 5K-L). This suggests that IP3R2 silencing induced inhibition of proteasomal degradation activates melanophagy. Taken together, these findings indicate that compromised proteasomal degradation engages the autophagy machinery, providing a mechanistic link between proteasome dysfunction, enhanced autophagy, and altered melanogenic protein turnover.

      Comment 5. In figure 5, the authors create a new ratiometric dye to detect melanosome stability based on the principle that tyrosinase is exclusively found in melanosomes. Unfortunately, there is no validation that this new construct is found exclusively in melanosomes upon expression. In addition, there is discussion about the pH of lysosomes, but not of melanosomes. Ultimately, this data cannot be considered at face value without any type of validation; I also note that the pictures lack sufficient detail to support identification of these structures as melanosomes. * While I maintain the above concerns, I note that, the data in supplemental figure 3 is MUCH more convincing than what is in the figure. Both the writing and the figure design should be rethought.*

      Response: We appreciate the Reviewer’s thorough evaluation and constructive critique of Figure 5, which has helped us to better clarify and validate this aspect of the study. In the revised manuscript, we directly address the concern regarding the subcellular specificity of the ratiometric probes, we performed detailed colocalization analysis using established melanosome markers. Specifically, we assessed the localization of the melanophagy detection probes mCherry–Tyr–eGFP and tyrosinase–mKeimaN1 with the melanosome-resident protein GP100 detected by anti-HMB45 (Supplementary Fig 2E-F and 2K-L). These analyses revealed a very high degree of colocalization, reflected by strong Pearson’s correlation and overlap coefficients, thereby validating that the expressed probes are predominantly localized to melanosomes.

      Regarding Lysosome/Melanosomal pH considerations, our melanophagy detection ratiometric probes: mCherry–Tyrosinase–eGFP (sensitive to acidic pH via eGFP) and tyrosinase mKeimaN1 (sensitive to acidic pH via Keima) are specifically designed to identify melanosome degradation, which happens upon melanosome fusion with lysosome. Consequently, the observed signal shifts indicate melanosome turnover rather than merely reflecting the lysosomal pH.

      To further corroborate the microscopic observations, we performed biochemical assays to study melanophagy flux upon IP3R2 silencing. We employed Bafilomycin A1, an inhibitor of autophagosome-lysosome fusion, to examine melanosomal protein accumulation. Upon Bafilomycin A1 treatment, IP3R2 silenced cells showed enhanced accumulation of melanosomes, as indicated by elevated tyrosinase levels compared with siNT controls (Supplementary Fig 3C-D), indicating elevated melanophagy flux upon IP3R2 knockdown. In the revised manuscript, we employed additional melanophagy detection strategies to further strengthen our findings. Specifically, we used Retagliptin phosphate (RTG), a well-established selective inducer of melanophagy, and observed a marked increase in melanophagy using the mCherry–Tyrosinase–eGFP melanophagy probe (Supplementary Fig 2G-H). Additionally, we performed independent validation by assessing colocalization of the melanosome (recognized by anti-HMB45 ab that identifies melanosomal structural protein GP100) with LC3 (Supplementary Fig 3A-B). This analysis revealed a significant increase in melanosomes colocalization with LC3 upon IP₃R2 silencing compared to control conditions.

      Collectively, these independent approaches clearly demonstrate that the melanophagy probes localize to melanosomes and detect melanophagy (by responding to melanosome fusion to lysosomes).

      Comment 6. Given the increase in ER Ca2+ content after IP3R2 knockdown, ER calcium content should be emptied before attempting to estimate lysosomal Ca2+ content with GPN or Bafilomycin. Otherwise, the source of calcium is less than clear.

      Response____: We appreciate the Reviewer’s careful consideration of Ca²⁺ source, which is critical for accurate interpretation of these experiments. Therefore, as suggested, in the revised manuscript, we conducted experiments involving Thapsigargin (Tg) pre-treatment to deplete ER Ca²⁺ reserves before examining lysosomal Ca²⁺ release using GPN or Bafilomycin (Supplementary Fig 6I-N). Even under these conditions, we noted increased lysosomal Ca²⁺ release in IP₃R2 knockdown cells, thus confirming that the observed Ca²⁺ signals originate from lysosomes rather than any remaining ER Ca²⁺. Importantly, this approach allowed us to minimize ER-derived Ca²⁺ contributions to changes in the lysosomal Ca²⁺ release.


      Reviewer #1 (Significance (Required)):

      The manuscript entitled, "IP3R2 mediated inter-organelle Ca2+ signaling orchestrates melanophagy" is a rather diffuse study of the relationship between IP3R2 and melanin production. While this is an interesting and understudied area, the study lacks a clear focus. The model seems to be that IP3R2 is essential for mitochondrial calcium loading. And that its absence increases lysosomal calcium loading. There are also a number of incomplete and/or unconvincing links to autophagy/melanophagy, TMEM165, TRPML1 and even gene transcription. In this kind of diffuse study, each step needs to be convincing to get to the next one, which is not the case here. There are also references to altered proteasome function, despite the total absence of any direct data on the proteasome. Finally, I felt it was sometimes unclear whether the authors were referring to melanosomes or lysosomes at various points throughout the study.

      Response____: We thank the Reviewer for finding our work interesting and appreciating that this is an understudied field. Further, we thank him/her for the constructive feedback on our study. We have performed several additional experiments and significantly revised the manuscript to address all the comments of the Reviewer.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      In the present manuscript, Saurav et al. identify IP3R2-mediated ER calcium release as a key suppressor of melanophagy, thereby sustaining pigmentation in melanocytes. Using in vitro (B16 murine melanoma cells, primary human melanocytes) and in vivo (zebrafish) models, the authors report that IP3R2 expression is positively correlated with pigmentation. They then investigate the impact of IP3R2 knockdown and find that IP3R2 silencing enhances the stability of melanogenic proteins, while also inducing autophagic degradation of melanosomes (i.e., melanophagy). Concomitantly, they find that IP3R2 silencing decreases mitochondrial calcium uptake, increases lysosomal calcium loading, and lowers lysosomal pH. They propose a pathway wherein in IP3R2 knockdown cells impaired mitochondrial calcium uptake induces the activation of AMPK-ULK1, and increased lysosomal calcium activates TRPML1 via TMEM165 and closer proximity interactions between ER and lysosomes, TFEB nuclear translocation, and upregulation of melanophagy-related genes, namely OPTN and RCHY1. The work is placed within the context of emerging roles of organelle calcium signaling in pigmentation biology, where extracellular calcium influx pathways are known regulators, but the contribution of ER-mitochondria-lysosome crosstalk to melanosome turnover remains largely unknown.

      Response____: We thank the Reviewer for appreciating our work and highlighting that the contribution of ER-mitochondria-lysosome crosstalk to melanosome turnover remains largely unappreciated.

      Major comments:

      Comment 1- The central finding is that IP3R2 knockdown induces melanophagy and reduces pigmentation. However, the manuscript does not identify any physiological or pathological context in which IP3R2 expression or activity is naturally downregulated in melanocytes. Without such context, the knockdown may represent an artificial perturbation that broadly alters ER calcium handling and triggers melanophagy as part of a general stress-induced autophagy response. This raises uncertainty about whether the pathway operates in vivo under normal or disease conditions. It would strengthen the study to identify upstream cues that reduce IP3R2 function and to test whether these also trigger melanophagy through the proposed mechanism.


      Response____: We thank the Reviewer for asking such an important question. The Reviewer asked to identify any physiological or pathological context in which IP3R2 expression is naturally downregulated in melanocytes. To address this question, in the revised manuscript, we analyzed publicly available microarray datasets comparing skin samples from Caucasian and African populations (Yin et al., Experimental Dermatology 2014). This unbiased analysis revealed considerably lower IP₃R2 expression in the Caucasian skin as compared to African skin (Fig. 1L). This data support a physiological correlation between IP₃R2 expression and pigmentation level, reinforcing the physiological relevance of the proposed pathway.


      Comment 2- While the data link IP3R2 knockdown to decreased pigmentation and increased melanophagy, the causality between altered organelle calcium dynamics and the melanophagy induction is inferred from correlation and partial rescue experiments. More direct interventions in the proposed downstream pathways (e.g., acute mitochondrial calcium uptake restoration, lysosomal calcium buffering) would strengthen mechanistic claims.

      Response____: We appreciate the Reviewer’s recommendation on strengthening the mechanistic causality between organelle Ca²⁺ dynamics and melanophagy. As suggested, in the revised manuscript, we restored acute mitochondrial Ca²⁺ uptake by MCU over-expression in the IP₃R2 knockdown background, which resulted in a marked reduction in melanophagy along with increased mitochondrial Ca²⁺ uptake in comparison to control (Fig 6I-L). This data clearly demonstrates that downstream of IP₃R2 silencing mitochondrial Ca²⁺ restoration rescues the melanophagy phenotype thereby revealing a mechanistic causality between mitochondrial Ca²⁺ dynamics and melanophagy.

      Similarly, to assess the causality between lysosomal Ca²⁺ dynamics and melanophagy, we silenced TMEM165 in the IP₃R2 knockdown background. Excitingly, upon TMEM165 knockdown we observed reduction in melanophagy, concomitant with decrease in lysosomal Ca²⁺ levels under IP₃R2 silencing conditions (Supplementary Fig 7I-L). Together, these direct manipulations support a causal role for altered organelle Ca²⁺ dynamics in driving melanophagy.


      We believe that these experiments would have addressed the concern of the Reviewer. However, if there are any other specific experiments that the Reviewer would like us to perform, we would be happy to carry out them as well.

      __Comment 3____- __Zebrafish assays convincingly show altered pigmentation with altered IP3R2 levels, but do not connect this to in vivo melanophagy measurements or TRPML1/TFEB activity, which would link the cell biology to organismal phenotype more directly.

      Response____: We thank the Reviewer for appreciating our in vivo zenrafish experiments. Futher, we acknowledge the Reviewer’s point of linking the cellular mechanisms to organismal phenotypes in vivo. Therefore, as suggested, we activated TRPML1 in the zebrafish model system. In the revised manuscript, we investigated role of the TRPML1–TFEB axis in pigmentation in vivo by pharmacological activation of TRPML channels with MLSA1. The MLSA1 treatment resulted in a marked reduction in zebrafish pigmentation compared to vehicle-treated controls (Fig. 8M). This phenotypic change was further substantiated by quantitative melanin content assays, which confirmed a significant decrease in melanin levels following MLSA1 treatment (Fig. 8M–N). These in vivo findings support the involvement of TRPML1-mediated lysosomal signaling in pigmentation regulation.

      Comment 4- The work suggests therapeutic potential for pigmentary disorders, but no disease models are tested. It is unclear whether the observed mechanisms operate under physiological stressors.

      Response____: We appreciate the Reviewer’s comment regarding physiological relevance and disease context. As addressed in Comment 1, we examined publicly available human skin microarray datasets for IP₃R2 expression in Caucasian and African population. This analysis revealed a positive correlation between IP₃R2 expression and human skin pigmentation, supporting that modulation of IP₃R2 occurs under physiological conditions rather than representing an artificial perturbation.

      While formal pigmentary disease models were not examined in this study, the observed correlation between IP₃R2 expression and physiological pigmentation differences along with our robust in vivo zebrafish data suggests that IP₃R2 plays an important role in physiological pigmentation. As highlighted by Reviewer 1 and Reviewer 3, the manuscript is already too long. Therefore, we plan to delineate the precise role of IP₃R2 in pigmentary disorders as an independent study.

      Comment 5- The paradox between the observed enhanced stability of melanogenic proteins and increased melanophagy is insufficiently addressed. DCT, Tyrosinase and GP100 are all melanosome-associated and their stability or degradation is in prior literature often interpreted as reflecting melanosome biogenesis and turnover. This discrepancy needs to be resolved, as it complicates interpretation of melanophagy assays.

      Response____: We appreciate the Reviewer’s careful consideration of this apparent paradox. This point was also raised by Reviewer 1. We have addressed the query in detail in response to Comment 4 of Reviewer 1. Briefly, the enhanced stability of melanosome-associated proteins reflects impaired proteasomal degradation and prolonged protein half-life, while the concurrent increase in melanophagy represents a compensatory turnover mechanism for degrading such dysfunctional melanosomes.

      Thus, increased melanophagy and apparent stabilization of melanogenic proteins are not contradictory but instead represent parallel outcomes of disrupted proteostasis. This interpretation is supported by our proteasomal inhibition experiments (Fig 4A-H) and autophagy analyses (Fig 4I-P), which collectively reconcile the observed protein stability with enhanced melanosome turnover.


      Comment 6- The authors propose that mitophagy and ER-phagy are reduced in IP3R2 knockdown cells, suggesting specific induction of melanophagy, but the rationale for why increased autophagic flux only targets melanosomes is insufficiently addressed. Also, these conclusions are solely based on Keima assays, and positive controls for mitophagy and ER-phagy are lacking.

      Response: We appreciate the Reviewer’s critical assessment of the specificity of autophagic targeting in the IP₃R2 knockdown condition and the need for appropriate validation controls. In the revised manuscript, we have repeated both the mitophagy and ER-phagy assays with well-established positive controls. Carbonyl cyanide-p-trifluoromethoxyphenylhydrazone (FCCP) was employed as a positive control to robustly induce mitophagy (Supplementary Fig 4E-F), while 4-phenylbutyric acid (4PBA) was used as a positive control for ER-phagy/reticulophagy (Supplementary Fig 4G-H). Secondly, we have validated the microscopy data with biochemical assays by examining levels of ER (Fig 4E-H) and mitochondria resident protein MCU.

      To provide a mechanistic rationale for the specific induction of melanophagy, we examined recently identified regulators of melanophagy, RCHY1 and OPTN (Lee et al., PNAS 2024). Bioinformatic analysis identified multiple TFEB binding sites on the promoters of both genes, which was supported by increased RCHY1 and OPTN expression following IP₃R2 knockdown. Further, in the revised manuscript, we performed additional loss-of-function experiments to demonstrate that co-silencing IP3R2 along with RCHY1 or OPTN significantly reduced melanophagy flux compared to IP₃R2 knockdown alone (Fig. 9H–K). Taken together, these data explain why enhanced autophagic flux downstream of IP₃R2 silencing is preferentially directed toward melanosomes.

      Comment 7- The melanophagy probes are novel and validated with rapamycin/bafilomycin, but quantitative calibration of GFP/mCherry or Keima signal to actual lysosomal delivery rates is missing; photobleaching, pH heterogeneity (incl., observed decrease in lysosomal pH), and melanin autofluorescence (see below) could confound ratios. Also, side-by-side comparison with other melanophagy detection approaches (e.g., colocalization of melanosomes with LC3) is lacking.

      __Response____: __We appreciate the Reviewer’s careful evaluation of the melanophagy probes and the potential technical confounders. In the revised manuscript, we have performed a variety of experiments to further characterize and validate the probes. First of all, the melanophagy detection ratiometric probes (mCherry–Tyrosinase–eGFP and tyrosinase mKeimaN1) are built on well-established and extensively validated backbones. Further, we used appropriate controls (empty vectors/non-targeting siRNAs/vehicle controls) in all experiments to analyze the relative fluorescence changes in the test condition v/s control. The confounding factors, if any, should be present for both test and control. Therefore, we initially did not perform side-by-side comparison with other melanophagy detection approaches.

      In the revised manuscript, as suggested by the reviewer, we employed additional melanophagy detection strategies to further strengthen our findings. Specifically, we used Retagliptin phosphate (RTG), a well-established selective inducer of melanophagy, and observed a marked increase in melanophagy using the mCherry–Tyrosinase–eGFP melanophagy probe (Supplementary Fig 2G-H). Additionally, we performed independent validation by assessing colocalization of the melanosome (recognized by anti-HMB45 ab that identifies melanosomal structural protein GP100) with LC3 (Supplementary Fig 3A-B). This analysis revealed a significant increase in melanosomes colocalization with LC3 upon IP₃R2 silencing compared to control conditions. Further, to minimize the contribution of melanin autofluorescence, non-transfected cells were imaged under identical settings, and background signals obtained from these cells were subtracted during fluorescence quantitation from all acquired images. Potential effects of photobleaching and pH heterogeneity were minimized by uniform acquisition parameters and ratiometric analysis. Taken together, we believe these complementary approaches address the Reviewer’s concerns and reinforce the robustness of our melanophagy measurements.

      Comment 8- Melanosomes exhibit broad autofluorescence, particularly upon excitation at 405-488 nm and extending into the red channel. This signal can overlap with the detection ranges for GFP, mCherry, and mKeima reporters, potentially confounding quantitative readouts unless appropriate controls (e.g., untransfected cells, spectral unmixing) are used. Throughout this manuscript, it is not addressed how melanosome autofluorescence was controlled for or excluded in the reported fluorescence measurements.

      __Response____: __We apologize to the Reviewer for not clearly stating that melanosome autofluorescence was controlled by imaging non-transfected cells under identical settings, and these background signals were subtracted during quantitation from the acquired images. Specifically, to rigorously control this issue, autofluorescence was systematically evaluated using non-transfected control cells imaged under identical excitation and emission settings used for GFP, mCherry, and mKeima reporters. These controls allowed us to define the baseline autofluorescence profile arising from melanosomes across the relevant spectral ranges. These details are included in the methods section.

      Comment 9- While OPTN and RCHY1 expression is elevated upon IP3R2 knockdown, functional engagement (e.g., OPTN localization to melanosomes, melanosome ubiquitination by RCHY1), or necessity (e.g., siRNA knockdown of these in the IP3R2-deficient background), are not tested.

      Response: We appreciate the Reviewer’s point on establishing necessity of OPTN and RCHY1 in IP₃R2 knockdown–induced melanophagy. In the revised manuscript, we performed targeted loss of function analyses for both OPTN and RCHY1 in the IP₃R2-deficient background. We assessed melanophagy using the mCherry–Tyrosinase–eGFP melanophagy probe following co-silencing of IP₃R2 with either OPTN or RCHY1. Quantitative analysis revealed a significant reduction in melanophagy flux upon co-silencing of either gene compared to IP₃R2 silencing alone (Fig. 9H–K). These findings establish the functional requirement of OPTN and RCHY1 downstream of IP₃R2 loss to drive melanophagy. Since functional engagement of OPTN and RCHY1 on melanosomes is already well-established (Lee et al. PNAS 2024 and Park et al. Autophagy 2024), we have not repeated these experiments. Taken together, our data demonstrates that OPTN and RCHY1 are not only overexpressed but also act as critical mediators of melanophagy downstream of IP₃R2 silencing.

      __Comment 10- __While siRNA/shRNA efficacy is shown, functional rescue with pore-dead mutants sometimes fails to return to control values. The possibility of partial off-target or compensatory effects is not fully excluded.

      Response: We thank the Reviewer for raising for this point. In this study, we employed pore-dead mutants of IP₃R2 (IP₃R2-M) and TRPML1 (TRPML1-M), both of them are well characterized, widely validated and extensively used by a number of leading groups in the field. Upon meticulous literature analysis, we came across multiple studies wherein partial rescue effect was reported with these pore-dead mutants. Therefore, we believe it is not surprising that we are also observing partial rescue in some of our assays.

      Actually, it is important to note that we observe rescue of the function and phenotype in every single experiment carried out with the mutants. We agree with the Reviewer that the extent of rescue is not up to control levels in few experiments. This can be attributed to the differences in the extend of expression of mutants across different experiments. However, we have validated the results with multiple independent approaches. Collectively, the use of multiple independent approaches along with genetic silencing, pharmacological inhibition/activation supports the specificity of the observed phenotypes.

      Comment 11- The mitochondrial and lysosomal calcium measurements are largely endpoint peak quantifications; kinetic analyses and buffering capacity measurements would provide more mechanistic depth, especially for the TMEM165 contribution. Also, TMEM165 necessity for melanophagy induction upon IP3R2 knockdown has not been directly addressed.

      Response: We appreciate the Reviewer’s request for greater mechanistic depth regarding organelle Ca²⁺ dynamics and the specific contribution of TMEM165. Consistent with this, we had previously demonstrated that TMEM165 silencing decreases lysosomal Ca²⁺ levels using Oregon BAPTA–dextran–based measurements (Supplementary Fig 7C-D), establishing its role in regulating lysosomal Ca²⁺ buffering. Building on this, in the revised manuscript, we performed kinetic analyses of lysosomal Ca²⁺ levels following IP₃R2 and TMEM165 silencing. These kinetic analyses validated our end point measurements that IP₃R2 knockdown leads to increase in lysosomal Ca²⁺ levels, whereas TMEM165 silencing results in decrease in lysosomal Ca²⁺ content in comparison to control. Therefore, highlighting distinct and opposing effects of IP₃R2 and TMEM165 on lysosomal Ca²⁺ kinetics.

      Further, we directly evaluated the necessity of TMEM165 for melanophagy induction in the IP₃R2-deficient background. TMEM165 knockdown alone resulted in a significant reduction in melanophagy (Supplementary Fig 7G-H). Further, co-silencing of TMEM165 with IP₃R2 also attenuated melanophagy compared to IP₃R2 knockdown alone (Supplementary Fig 7K-L). Collectively, these kinetic Ca²⁺ assays and genetic loss-of-function analyses provide mechanistic depth to the organelle Ca²⁺ measurements and establish TMEM165 as a critical regulator of melanophagy downstream of IP₃R2 silencing.

      Comment 12- The proximity ligation assay between VAP-A and LAMP1 is interpreted as showing increased ER-lysosome contacts in IP3R2 knockdown cells. However, additional controls are needed and quantitative TEM should be included to substantiate changes in organelle contact frequency and distance.

      Response: We thank the Reviewer’s for his/her emphasis on strengthening the validation of the proximity ligation assay (PLA) findings and on providing ultrastructural evidence to support altered organelle interactions. The PLA data revealed a significant increase in VAP-A–LAMP1 interaction signals in IP₃R2-silenced cells compared to control conditions (Fig. 7L–M). In the revised manuscript, this increase was not observed upon treatment with bafilomycin A1, a specific inhibitor of lysosomal acidification, or when one of the primary antibodies was omitted, confirming the specificity of the PLA signal (Fig. 7L–M). These controls support the interpretation that IP₃R2 downregulation enhances ER–lysosome interactions.

      To further substantiate the changes in organelle contact frequency and distance, we performed ultrastructural analyses using transmission electron microscopy (TEM). The quantitative TEM measurements revealed no significant change in the frequency of ER–mitochondria or ER–lysosome contacts upon IP₃R2 silencing (Fig. 7N–P). Similarly, ER–mitochondria distances remained unchanged. However, we observed a significant reduction in the distance between the ER and lysosomes in IP₃R2 knockdown cells compared to control (Fig. 7N, 7Q–R). Together, these complementary approaches demonstrate that IP₃R2 silencing specifically increases ER–lysosome proximity without altering overall contact frequency, thereby strengthening the conclusion that IP₃R2 regulates ER–lysosome coupling.

      Comment 13- Some assays report small biological n (e.g., three independent experiments with relatively small per-condition cell counts).

      __Response:____ __We appreciate the Reviewer’s comment regarding sample size. All experiments were performed with a minimum of three independent biological replicates, which is consistent with standard practice in the field. For imaging-based assays, multiple fields of view and cells were analyzed per condition in each independent experiment, and quantitative analyses were performed on pooled data across replicates. As suggested by the Reviewer, we have increased the cell numbers in some experiments. The detailed information on biological replicates and cell numbers analyzed is provided in the respective figure legends.

      Minor comments:

      • Comment 1- The title "IP3R2-mediated inter-organelle Ca2+ signaling orchestrates melanophagy" could be misread as indicating IP3R2 'promotes' melanophagy; consider rewording to make clear that IP3R2 suppresses melanophagy to maintain pigmentation. Similarly, the running title "IP3R2 negatively regulates melanophagy" would be clearer as "IP3R2 suppresses melanophagy".*

      __Response____: __As suggested by the Reviewer, we have modified the title and running title in the revised manuscript.

      Comment 2- Unify the framing of "positively regulates pigmentation" vs. "negatively regulates melanophagy" in the Introduction/Discussion.

      Response: As recommended, we have unified the framing in the suggested sections.

      Comment 3- Adding schematic flow diagrams summarizing each pathway at the end of relevant results (figure) sections could help accessibility.

      Response____: __We appreciate the Reviewer’s suggestion to improve accessibility of the presented pathways. Accordingly, we have included schematic diagrams at the end of the relevant figures. These schematics summarize: (i) ER–mitochondria interactions in the context of melanophagy (__Fig. 6P); (ii) differences in Ca²⁺ and pH regulation between wild-type and IP₃R2-silenced cells (Fig. 7S); and (iii) TRPML1-mediated Ca²⁺ release driving melanophagy via TFEB translocation (Fig. 9L). Together, these diagrams provide a concise visual overview of the key mechanistic pathways described in the study.

      Comment 4- While the introduction summarizes extracellular calcium signaling in pigmentation, there is less coverage of recent work on selective autophagy of other lysosome-related organelles (e.g., platelet dense granules, lytic granules), which could provide broader mechanistic context.

      __Response____: __As suggested by the Reviewer, we have discussed selective autophagy of other lysosome-related organelles in the introduction.

      Reviewer #2 (Significance (Required)):

      This study addresses an important gap in pigmentation biology by identifying IP3R2-mediated ER calcium release as a suppressor of melanophagy and a positive regulator of pigmentation. The strongest aspects are the integration of in vitro and in vivo models, the multi-faceted mechanistic exploration linking altered organelle calcium dynamics to selective melanosome turnover, and the development of novel ratiometric fluorescent probes for live-cell melanophagy measurement. Conceptually, the work extends prior literature that has focused on extracellular calcium influx and melanosome biogenesis, revealing a new inter-organelle calcium signaling module that controls melanosome degradation via AMPK-ULK1 and TMEM165-TRPML1-TFEB pathways.

      • However, several limitations reduce the strength of the mechanistic claims. Some key pathway steps are inferred from correlation and partial rescue rather than direct necessity/sufficiency tests (e.g., mitochondrial calcium uptake restoration, lysosomal calcium buffering). The paradoxical observation that IP3R2 knockdown both increases melanophagy and stabilizes melanosome-resident protein (DCT, Tyrosinase, GP100) is not resolved, complicating interpretation of the melanophagy assays. The specificity for melanophagy over other selective autophagy pathways is asserted but not fully explained mechanistically, and positive controls for mitophagy/ER-phagy are missing. Potential technical confounds, such as melanin autofluorescence in the detection ranges of GFP, mCherry, and mKeima, are not explicitly addressed and alternative assays for these key data were insufficiently employed. In vivo results do not yet connect altered pigmentation to melanophagy readouts or downstream TRPML1/TFEB activation. Importantly, the study does not identify any physiological or pathological scenario in which IP3R2 expression or activity is naturally reduced in melanocytes. In the absence of such upstream cues, IP3R2 knockdown may represent an artificial perturbation that triggers melanophagy as part of a broader stress-induced autophagy response, raising questions about the in vivo relevance of the proposed pathway.*

      • The work's primary audience is specialized, cell biologists, autophagy researchers, and pigmentation/skin biology specialists, but the mechanistic framework on organelle crosstalk and selective autophagy will interest a broader basic research readership, including those studying lysosome-related organelles in other systems. The ratiometric probes could be adapted for future melanophagy research, and the pathway insights may guide translational studies in pigmentary disorders or melanoma. My expertise is in mitochondrial and lysosomal calcium signaling, autophagy, and microscopy-based functional assays; I do not have detailed expertise in zebrafish developmental genetics, though the phenotypic analysis appears sound.*

      Response____: We thank the Reviewer for appreciating our work and stating that our study “addresses an important gap in pigmentation biology”. Further, we thank him/her for believing that this work will be of interest to a broad basic research readership. Moreover, we thank him/her for valuing the importance and potential significance of the ratio-metric melanophagy probes generated in this study. Finally, we acknowledge the Reviewer’s constructive feedback on our study, which has helped us in enhancing the quality of our manuscript. We have performed variety of additional in vitro experiments, in vivo zebrafish studies and have significantly revised the manuscript to address all the comments of the Reviewer.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This is a robust and extensive study showing that IP3R2 selectively initiates a calcium signalling pathway leading to melanophagy, that is the degradation of melanosomes. This reduces pigmentation and UV light protection. A strength of the paper is that it combines detailed cellular studies with in viva studies in the zebrafish model. They show that knockdown of IP3R2 reverses this process perhaps leading to a strategy to enhance melanosome number and hence to afford protection from UV irradiation. The authors use a battery of fluorescent probes (mainly genetically encoded reporters) in investigate the signalling cascade leading to melanophagy or its reduction. This involves reports for a number of different organelles involved in this process. The experiments are generally well performed with clear controls for the probes in many cases. My main issue is the panels contain too much data which may obscure the message, and a good deal could be moved to supplementary data. The manuscript investigates many mechanisms in distinct organelles which is remarkable for a two author paper. Particularly interesting was the design of novel fluorescent protein reporters for melanophagy itself. One area not explored is ion fluxes across melanosomes themselves which are lysosome-related organelles and may exhibit similar properties and signalsomes of lysosomes.

      Specifically, the authors show that a REDUCTION of IP3R2-mediated calcium release leads to a calcium flux from the ER by a different mechanism (possibly via TMBIM6). This increases calcium loading of the lysosome via TMEM165, at the expense of calcium transfer to mitochondria, and an acidification.

      • This leads to TRPML1 activation and the lysosomal calcium release activates TFEB translocation to the nucleus increases the transcription of autophagy/melanophagy genes and activation of the AMPK-ULK1 pathway (rather than mTOR). This is a complex pathway and evidence is presented for many of the steps involved.*

      • This is a tour de force investigating organelle communication during the process of melanophagy, that is little understood. It highlights many important organelle ion transport events that are important findings in their own right. For example, the importance of TMEM165 in calcium filling of lysosomes.*

      Response____: We thank the Reviewer for appreciating our study and thinking that it is a robust and extensive study in a highly understudied area. We appreciate the Reviewer’s acknowledgement that our manuscript combines detailed cellular studies with in vivo studies in the zebrafish model. Further, we thank the Reviewer for his/her constructive feedback on our work.

      __ Major points:__

      Comment 1- The authors state that TPC activation does not activate TFEB translocation the nucleus. This is now not the case and should be at least looked at. What is the role of endolysosomal channels on the melanosomes themselves in melanophagy.

      Response____: We appreciate the Reviewer’s comment regarding the potential contribution of TPC channels to TFEB activation and melanophagy. In the revised manuscript, we assessed Ca²⁺ release from TPC2 under IP₃R2 knockdown conditions using the selective TPC2 agonist TPC2-A1-N (Supplementary Fig 9G-H). Additionally, we evaluated TFEB nuclear translocation following TPC2-mediated Ca²⁺ release using TPC2-A1-N (Supplementary Fig 9I-J). Our analyses revealed no significant differences in TPC2 activity or TFEB nuclear translocation upon IP₃R2 silencing compared to control conditions. These findings suggest that, in our system, TPC2-mediated Ca²⁺ signaling does not contribute significantly to TFEB activation or melanophagy downstream of IP₃R2 silencing, indicating a more prominent role for TRPML1-dependent Ca²⁺ signaling in this context.

      Comment 2- How does reduction in IP3R2 mediated calcium fluxes enhance lysosomal acidity?

      Response____: We thank the Reviewer’s question regarding the mechanistic link between reduced IP₃R2-mediated Ca²⁺ flux and enhanced lysosomal acidity. In the revised manuscript, we show that IP₃R2 silencing results in a significant upregulation of the lysosomal proton pump H⁺-ATPase subunits: ATPV0D1 and ATP6V1H (Supplementary Fig 6E-F). Increased H⁺-ATPase expression is expected to promote proton influx into the lysosomal lumen, thereby enhancing lysosomal acidification. These findings provide a mechanistic basis for how IP₃R2 silencing can drive increased lysosomal acidity.

      Comment 3- What mediates the ER source for calcium filling of lysosomes?

      Response____: We appreciate the Reviewer’s interest in the mechanism underlying ER to lysosome Ca²⁺ transfer. Recently, an independent study also reported that IP₃R2 silencing enhances lysosomal Ca²⁺ levels and lysosomal Ca²⁺ release (Zheng et al. Cell 2022). Literature suggests that lysosomal Ca²⁺ refilling is depend on Ca²⁺ fluxes originating from the endoplasmic reticulum, particularly through ER Ca²⁺ leak pathways at ER–lysosome contact sites. In this context, ER-resident Ca²⁺ leak channels such as TMBIM6 (also known as Bax inhibitor-1) play an important role in maintaining basal cytosolic Ca²⁺ levels that can be subsequently taken up by lysosomes (Kim et al. Autophagy 2020). TMBIM6-mediated Ca²⁺ leak from the ER provides a continuous, low-level Ca²⁺ source that supports lysosomal Ca²⁺ loading, (Kim et al. Autophagy 2020). This mechanism allows lysosomes to replenish their Ca²⁺ stores via Ca²⁺ uptake systems operating at ER–lysosome contact sites. Thus, ER Ca²⁺ leak channels represent a key conduit linking ER Ca²⁺ homeostasis to lysosomal Ca²⁺ filling and function.

      Recently, lysosome localized TMEM165 was identified to play an important role in Ca²⁺ filling of lysosomes (Zajac et al. Science Advances 2024). Here, in our study, we observe that TMEM165 drives lysosomal Ca²⁺ influx in melanocytes.

      Comment 4- Oregon-green-dextran is not a great probe for lysosomal calcium. Its Kd is 170nM and even in the acidic environment this may be lowered to low micromolar which may not be great for measuring changes around luminal concentrations of around 500uM. Additionally, it is usual to correct for pH effects simultaneously since the dye is also a pH reporter and has been used as such. However, I take the point that they still see an increase in fluorescence whilst pH falls probably indicating an increase in luminal lysosomal calcium confirmed by increased perilysosomal calcium.

      Response____: We thank the Reviewer for the careful and balanced assessment of the Oregon Green–dextran measurements. We appreciate the acknowledgment that, despite the known limitations of this probe and its pH sensitivity, the observed increase in fluorescence concurrent with reduced lysosomal pH is consistent with elevated luminal lysosomal Ca²⁺ levels. We are grateful for this positive interpretation, which strengthens our conclusions when considered alongside the large amount of supporting data.

      Comment 5- The major point is to reduce the number of main data panels with consigment of some controls perhaps to supplementary. This would increase the comprehensibility of the paper.

      Response____: We thank the Reviewer for this constructive and positive suggestion. We appreciate the emphasis on reducing the data in the main figures. Therefore, as suggested, we have moved considerable data to the supplementary figures. However, due to the additional experiments performed to address the concerns of other Reviewers, the main data panels may still look little busy. We sincerely think that the Reviewer would understand our situation.

      Minor points

      Comment 1- Fig 10 needs a clear legend with symbols in the diagram explained. eg ER calcium release proteins.

      Response____: We thank the Reviewer for this helpful and constructive comment. Therefore, we have revised the Figure 10 legend to clearly explain all symbols used in the schematic illustration.

      Reviewer #3 (Significance (Required)):

      This is a tour de force investigating organelle communication during the process of melanophagy, that is little understood. It highlights many important organelle ion transport events that are important findings in their own right. For example, the importance of TMEM165 in calcium filling of lysosomes.

      Response____: We sincerely thank the Reviewer for considering our work as “a tour de force investigation” and appreciating that our study presents several important organelle ion transport events.

    1. Author response:

      eLife Assessment 

      This study presents a valuable finding on maternal SETDB1 as a key chromatin repressor that shuts down the 2C gene program and enables normal mouse embryonic development. The evidence supporting the claims of the authors is solid, although the inclusion of a causality test, a mechanistic understanding of SETDB1 targeting, and phenotypic quantification would have greatly strengthened the study. The work will be of broad interest to biologists working on embryonic development, stem cells and gene regulation.

      Thank you for this positive evaluation of our work. Please find the point-by point responses to the Reviewer’s comments below.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      During the earliest stages of mouse development, the zygote and 2-cell (2C) embryo are totipotent, capable of generating all embryonic and extra-embryonic lineages, and they transiently express a distinctive set of "2C-stage" genes, many driven by MERVL long terminal repeat (LTR) promoters. Although activation of these transcripts is a normal feature of totipotency, they must be rapidly silenced as development proceeds to the 4-cell and 8-cell stages; failure to shut down the 2C program results in developmental arrest. This study examines the role of maternal SETDB1, a histone H3K9 methyltransferase, in suppressing the 2C transcriptional network. Using an oocyte-specific conditional knockout that removes maternal Setdb1 while leaving the paternal allele intact, the authors demonstrate that embryos lacking maternal SETDB1 arrest during cleavage, with very few progressing beyond the 8-cell stage and no morphologically normal blastocysts forming. Transcriptomic analyses reveal persistent expression of MERVL-LTR-driven transcripts and other totipotency markers, indicating a failure to terminate the totipotent state. Together, the data demonstrate that maternally deposited SETDB1 is required to silence the MERVL-driven 2C program and enable the transition from totipotency to pluripotency. More broadly, the work identifies maternal SETDB1 as a key chromatin repressor that deposits repressive H3K9 methylation to shut down the transient 2C gene network and to permit normal preimplantation development. 

      Strengths: 

      (1) Closes a key knowledge gap. 

      The study tackles a central open question - how embryos exit the totipotent 2-cell (2C) state - and provides direct in vivo evidence that epigenetic repression is required to terminate the 2C program for development to proceed. By identifying maternal SETDB1 as the responsible factor, the work substantially advances our understanding of the maternal-to-zygotic transition and early lineage specification. 

      (2) Clean genetics paired with rigorous genomics. 

      An oocyte-specific Setdb1 knockout cleanly isolates a maternal-effect requirement, ensuring that early phenotypes arise from loss of maternal protein. The resulting cleavage-stage arrest is unambiguous (most embryos stall before or around the 8-cell stage). State-of-the-art single-embryo RNA-seq across stages - well-matched to low-cell-number constraints - captures genome-wide mis-expression, including persistent 2C transcripts in mutants, strongly supporting the conclusions. 

      (3) Compelling molecular linkage to phenotype. 

      Transcriptome data show that without maternal SETDB1, embryos fail to repress a suite of 1-cell/2C-specific genes by the 8-cell stage. The tight correlation between continued activation of the MERVL-driven totipotency network and developmental arrest provides a specific molecular explanation for the observed failure to progress. 

      (4) Mechanistic insight grounded in chromatin biology. 

      SETDB1, a H3K9 methyltransferase classically linked to heterochromatin and transposon repression, targets MERVL LTRs and MERVL-driven chimeric transcripts in early embryos. Bioinformatic evidence indicates that these loci normally acquire H3K9me3 during the 2C→4C transition. The data articulate a coherent mechanism: maternal SETDB1 deposits repressive H3K9me3 at 2C gene loci to shut down the totipotency network, extending observations from ESC systems to bona fide embryos. 

      (5) Broad implications for development and stem-cell biology. 

      By pinpointing a maternal gatekeeper of the totipotent-to-pluripotent transition, the work suggests that some cases of cleavage-stage arrest (e.g., in IVF) may reflect faulty epigenetic silencing of transposon-driven genes. It also informs stem-cell efforts to control totipotent-like states in vitro (e.g., 2C-like cells), linking epigenetic reprogramming, transposable-element regulation, and developmental potency.

      We thank Reviewer 1 for recognizing the strengths in our work and for the suggestions below.

      Weaknesses: 

      (1) Causality not directly demonstrated. 

      The link among loss of SETDB1, persistence of 2C transcripts, and developmental arrest is compelling but remains correlative. No rescue experiments test whether dampening the 2C/MERVL program restores development. Targeted interventions-e.g., knocking down key 2C drivers (such as Dux) or pharmacologically curbing MERVL-linked transcription in maternal Setdb1 mutants-would strengthen the claim that unchecked 2C activity is causal rather than a by-product of other SETDB1 functions.

      We agree that rescue experiments might strengthen causality. Those experiments, however, would be extremely challenging technically because the knockdowns would need to be precisely timed to follow (and not prevent) the wave of 2c-specific activation. Knocking down 2c drivers in the zygote, for example, may prevent switching on the totipotency program. In addition, while sustained MERVL expression—such as that induced by forced DUX expression—disrupts totipotency exit and embryo development (1, 2), derepression of transcription is very broad in Setdb1<sup>mat-/+</sup> embryos and knocking down individual 2C drivers may not be sufficient to rescue development or restore the exit from totipotency.

      (2) Limited mechanistic resolution of SETDB1 targeting. 

      The study establishes a requirement for maternal SETDB1 but does not define how it is recruited to MERVL loci. Given SETDB1's canonical cooperation with TRIM28/KAP1 and KRAB-ZNFs, upstream sequence-specific factors and/or pre-existing chromatin features likely guide targeting. Direct occupancy and mark-placement evidence (e.g., SETDB1/TRIM28 CUT&RUN or ChIP, and H3K9me3 profiling at MERVL LTRs during the 2C→4C window) would convert inferred mechanisms into demonstrated ones.

      We do show H3K9me3 patterns at MERVL LTRs during the early2c-late2c-2c-4c-8c-morula window from a published dataset. Please see the genome browser images in Figures 4C, 4D, 4E, 6D, 6E and Figure S6. We agree that mapping of SETDB1/TRIM28 to those locations would strengthen the mechanistic insight. However, ChIPseq or CUT&RUN of those proteins in preimplantation embryos are not technically feasible. We do provide genetic evidence for the collaboration between SETDB1 and DUXBL, a DNA-binding factor, by showing that DUXBL cannot switch off its top targets without SETDB1 (Figure 6). Future studies will characterize the molecular mechanisms underlying this (likely indirect) collaboration. We do not think that DUXBL and SETDB1 directly interact, because such interaction was not detected by DUXBL IP-MS (3).

      (3) Narrow scope on MERVL; broader epigenomic consequences underexplored. 

      Maternal SETDB1 may restrain additional repeat classes or genes beyond the 2C network. A systematic repeatome analysis (LINEs/SINEs/ERV subfamilies) would clarify specificity versus a general loss of heterochromatin control. Moreover, potential effects on imprinting or DNA methylation balance are not examined; perturbations there could also contribute to arrest. Bisulfite-based DNA methylation maps at imprinted loci and allele-specific expression analyses would help rule in/out these mechanisms.

      We did examine genes and repeat elements beyond the 2c network. We evaluated gene and TE expression changes using four-way comparisons. Please find the results regarding gene expression in Figure 1C-J, Figure S2, Figure S3, Figure S4., Table S2, Table S3, and Table S4. Please find results on TE expression in Figure S5. Table S6, Table S7, and Table S8 and in the text. We agree that DNA methylation may be altered in Setdb1<sup>mat-/+</sup> embryos. In our hands, evaluating this possibility using bisulfite sequencing requires a larger number of embryos than what we can feasibly obtain (the number of obtained mutant embryos is very small). Regarding imprinted gene expression, one cannot fully assess and interpret imprinted gene expression in preimplantation stage embryos before the maternally deposited transcripts are gone. We reported earlier that clear somatic parental-specific patterns of imprinted gene expression may only start later in development, around 8.5 dpc (4).

      (4) Phenotype quantitation and transcriptomic breadth could be clearer. 

      The developmental phenotype is described qualitatively ("very few beyond 8-cell") without precise stage-wise arrest rates or representative morphology. Tabulated counts (2C/4C/8C/blastocyst), images, and statistics would increase clarity. On the RNA-seq side, the narrative emphasizes known 2C markers; reporting novel/unannotated misregulated transcripts, as well as downregulated pathways (e.g., failure to activate normal 8-cell programs, metabolism, or early lineage markers), would present a fuller portrait of the mutant state.

      Tabulated counts are displayed in Figure 1A, and morphology is shown in Figure S1A. We do say that 4% Setdb1<sup>mat-/+</sup> embryos reached the 8-cel stage by 2.5 dpc. We recovered zero Setdb1<sup>mat-/+</sup> blastocysts at 4.5 dpc (not shown). On the RNA-seq side we do report a more global assessment of transcription of genes and TEs (please see above at point 3), including novel chimeric transcripts (Table S6). Developmental pathways are shown in Figure S3 and Figure S4. Metabolic pathways are displayed in Figure S2.

      Reviewer #2 (Public review): 

      Zeng et al. report that Setdb1-/- embryos fail to extinguish the 1- and 2-cell embryo transcriptional program and have permanent expression of MERVL transposable elements. The manuscript is technically sound and well performed, but, in my opinion, the results lack conceptual novelty.

      (1) The manuscript builds on previous observations that: 1, Setbd1 is necessary for early mouse development, with knockout embryos rarely reaching the 8-cell stage; 2, SETB1 mediates H3K9me3 deposition at transposable elements in mouse ESCs; 3, SETB1silences MERVLs to prevent 2CLC-state acquisition in mouse ESCs. The strength of the current work is the demonstration that this is not due to a general transcriptional collapse; but otherwise, the findings are not surprising. The well-known (several Nature papers of years ago) crosstalk between m6A RNA modification and H3K9me3 in preventing 2CLC generation also partly compromises the novelty of this work.

      We thank the Reviewer for appreciating the technical quality of our work. Regarding novelty, please consider that prior work in ES cells included contradictory findings (please see our Introduction). Prior embryology work (please see our Introduction) did not explain the preimplantation-stage phenotype. We highly appreciate those earlier works. Our work here answers the expectations drawn from prior studies and unequivocally shows that SETDB1 carries out the developmentally essential function of suppressing MERVLs and the 2-cell program in the mouse embryo.

      (2) The conclusions regarding H3K9me3 deposition are inferred based on previously reported datasets, but there is no direct demonstration.

      Dynamic H3K9me3 deposition is displayed at MERVL LTRs during the early2c-late2c-2c-4c-8c-morula window (Figures 4C, 4D, 4E, 6D, 6E and Figure S6) from a published work that has very high-quality data. We agree that demonstrating loss off H3K9me3 in Setdb1<sup>mat-/+</sup> embryos would confirm that the H3K9me3 histone methyltransferase function of SETDB1 (as opposed to any, yet unidentified, non-HMT specific activity of SETDB1) is responsible for shutting down MERVL LTRs. However, ChIP-seq, CUT&RUN, or similar assays are not feasible due to the rarity of Setdb1<sup>mat-/+</sup> embryos.

      (3) The detection of chimeric transcripts is somewhat unreliable using short-read sequencing.

      We used single embryo total RNA-seq and we report detecting chimeric transcripts (Table S6), which is considered more reliable than mRNA-seq for detecting chimeric transcripts, because many are not polyadenylated. We acknowledge, however, that long-read sequencing, which recently is becoming available, but which is still very expensive, is currently the most powerful method for detecting chimeric transcripts. This, however, does not affect the major conclusions or the significance of our work.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We are grateful to the Review Commons reviewers for their constructive feedback, which has significantly strengthened the manuscript. In response, we have performed additional experiments, revised and expanded multiple figures, incorporated new statistical and functional analyses, and carefully edited the text to improve clarity and precision. A detailed point-by-point response to all reviewer comments, together with a summary of revised figures, is provided.

      To address the reviewers' suggestions, we have conducted additional experiments that are now incorporated into new figures, or we have added new images to several existing figures where appropriate.

      For this reason, please note that all figures have been renumbered to improve clarity and facilitate cross-referencing throughout the text. As recommended by Referee #3, all figure legends have been thoroughly revised to reflect these updates and are now labeled following the standard A-Z panel format, enhancing readability and ensuring easier identification. In addition, all figure legends now include the sample size for each statistical analysis.

      For clarity and ease of reference, we provide below a comprehensive list of all figures included in the revised version. Figures that have undergone modifications are underlined.

      Figure 1____. The first spermatogenesis wave in prepuberal mice.

      This figure now includes amplified images of representative spermatocytes and a summary schematic illustrating the timeline of spermatogenesis. In addition, it now presents the statistical analysis of spermatocyte quantification to support the visual data.

      __Figure 2.____ Cilia emerge across all stages of prophase I in spermatocytes during the first spermatogenesis wave. __

      The images of this figure remain unchanged from the original submission, but all the graphs present now the statistical analysis of spermatocyte quantification.

      Figure 3. Ultrastructure and markers of prepuberal meiotic cilia.

      This figure remains unchanged from the original submission; however, we have replaced the ARL3-labelled spermatocyte image (A) with one displaying a clearer and more representative signal.

      __Figure 4. Testicular tissue presents spermatocyte cysts in prepuberal mice and adult humans. __

      This figure remains unchanged from the original submission.

      __Figure 5. Cilia and flagella dynamics are correlated during prepuberal meiosis. __

      This figure remains unchanged from the original submission.

      __Figure 6. Comparative proteomics identifies potential regulators of ciliogenesis and flagellogenesis. __

      This figure remains unchanged from the original submission.

      Figure 7.____ Deciliation induces persistence of DNA damage in meiosis.

      This figure has been substantially revised and now includes additional experiments analyzing chloral hydrate treatment, aimed at more accurately assessing DNA damage under both control and treated conditions. Images F-I and graph J are new.

      Figure 8____. Aurora kinase A is a regulator of cilia disassembly in meiosis.

      This figure is remodelled as the original version contained a mistake in previous panel II, for this, graph in new Fig.8 I has been corrected. In addition, it now contains additional data of αTubulin staining in arrested ciliated metaphases I after AURKA inhibition (new panel L1´).

      __Figure 9. Schematic representation of the prepuberal versus adult seminiferous epithelium. __

      This figure remains unchanged from the original submission.

      __Supplementary Figure 1. Meiotic stages during the first meiotic wave. __

      This figure remains unchanged from the original submission.

      __Supplementary Figure 2 (new)____. __

      This is a new figure that includes additional data requested by the reviewers. It includes additional markers of cilia in spermatocytes (glutamylated Tubulin/GT335), and the control data of cilia markers in non-ciliated spermatocytes. It also includes now the separated quantification of ciliated spermatocytes for each stage, as requested by reviewers, complementing graphs included in Figure 2.

      Please note that with the inclusion of this new Supplementary Figure 2, the numbering of subsequent supplementary figures has been updated accordingly.

      Supplementary Figure 3 (previously Suppl. Fig. 2)__. Ultrastructure of prophase I spermatocytes. __

      This figure is equal in content to the original submission, but some annotations have been included.

      Supplementary Figure 4 (previously Suppl. Fig. 3).__ Meiotic centrosome under the electron microscope. __

      This figure remains unchanged from the original submission, but additional annotations have been included.

      Supplementary Figure 5 (previously Suppl. Fig. 4)__. Human testis contains ciliated spermatocytes. __

      This figure has been revised and now includes additional H2AX staining to better determine the stage of ciliated spermatocytes and improve their identification.

      Supplementary Figure 6 (previously Suppl. Fig. 5). GLI1 and GLI3 readouts of Hedgehog signalling are not visibly affected in prepuberal mouse testes.

      This figure has been remodeled and now includes the quantification of GLI1 and GLI3 and its corresponding statistical analysis. It also includes the control data for Tubulin, instead of GADPH.

      Supplementary Figure 7 (previously Suppl. Fig. 6)__. CH and MLN8237 optimization protocol. __

      This figure has been remodeled to incorporate control experiments using 1-hour organotypic culture treatment.

      Supplementary Figure 8 (previously Suppl. Fig. 7)__. Tracking first meiosis wave with EdU pulse injection during prepubertal meiosis. __This figure remains unchanged from the original submission.

      Supplementary Figure 9 (previously Suppl. Fig. 8)__. PLK1 and AURKA inhibition in cultured spermatocytes. __

      This figure has been remodeled and now includes additional data on spindle detection in control and AURKA-inhibited spermatocytes (both ciliated and non ciliated).

      DETAILED POINT-BY-POINT RESPONSE TO THE REVIEWERS

      We will submit both the PDF version of the revised manuscript and the Word file with tracked changes relative to the original submission. Each modification made in response to reviewers' suggestions is annotated in the Word document within the corresponding section of the text. all new figures have also been uploaded to the system.

      Response to the Referee #1

      In this manuscript by Perez-Moreno et al., titled "The dynamics of ciliogenesis in prepubertal mouse meiosis reveal new clues about testicular maturation during puberty", the authors characterize the development of primary cilia during meiosis in juvenile male mice. The authors catalog a variety of testicular changes that occur as juvenile mice age, such as changes in testis weight and germ cell-type composition. They next show that meiotic prophase cells initially lack cilia, and ciliated meiotic prophase cells are detected after 20 days postpartum, coinciding with the time when post-meiotic spermatids within the developing testes acquire flagella. They describe that germ cells in juvenile mice harbor cilia at all substages of meiotic prophase, in contrast to adults where only zygotene stage meiotic cells harbor cilia. The authors also document that cilia in juvenile mice are longer than those in adults. They characterize cilia composition and structure by immunofluorescence and EM, highlighting that cilia polymerization may initially begin inside the cell, followed by extension beyond the cell membrane. Additionally, they demonstrate ciliated cells can be detected in adult human testes. The authors next perform proteomic analyses of whole testes from juvenile mice at multiple ages, which may not provide direct information about the extremely small numbers of ciliated meiotic cells in the testis, and is lacking follow up experiments, but does serve as a valuable resource for the community. Finally, the authors use a seminiferous tubule culturing system to show that chemical inhibition of Aurora kinase A likely inhibits cilia depolymerization upon meiotic prophase I exit and leads to an accumulation of metaphase-like cells harboring cilia. They also assess meiotic recombination progression using their culturing system, but this is less convincing.

      Author response: We sincerely thank Ref #1 for the thorough and thoughtful evaluation of our manuscript. We are particularly grateful for the reviewer's careful reading and constructive feedback, which have helped us refine several sections of the text and strengthen our discussion. All comments and suggestions have been carefully considered and addressed, as detailed below.

      __Major comments: __

      1. There are a few issues with the experimental set up for assessing the effects of cilia depolymerization on DNA repair (Figure 7-II). First, how were mid pachytene cells identified and differentiated from early pachytene cells (which would have higher levels of gH2AX) in this experiment? I suggest either using H1t staining (to differentiate early/mid vs late pachytene) or the extent of sex chromosome synapsis. This would ensure that the authors are comparing similarly staged cells in control and treated samples. Second, what were the gH2AX levels at the starting point of this experiment? A more convincing set up would be if the authors measure gH2AX immediately after culturing in early and late cells (early would have higher gH2AX, late would have lower gH2AX), and then again after 24hrs in late cells (upon repair disruption the sampled late cells would have high gH2AX). This would allow them to compare the decline in gH2AX (i.e., repair progression) in control vs treated samples. Also, it would be informative to know the starting gH2AX levels in ciliated vs non-ciliated cells as they may vary.

      Response:

      We thank Ref #1 for this valuable comment, which significantly contributed to improving both the design and interpretation of the cilia depolymerization assay.

      Following this suggestion, we repeated the experiment including 1-hour (immediately after culturing), and 24-hour cultures for both control and chloral hydrate (CH)-treated samples (n = 3 biological replicates). To ensure accurate staging, we now employ triple immunolabelling for γH2AX, SYCP3, and H1T, allowing clear distinction of zygotene (H1T−), early pachytene (H1T−), and late pachytene (H1T+) cells. The revised data (Figure 7) now provide a more complete and statistically robust analysis of DNA damage dynamics. These results confirm that CH-induced deciliation leads to persistence of the γH2AX signal at 24 hours, indicating impaired DNA repair progression in pachytene spermatocytes. The new images and graphs are included in the revised Figure 7.

      Regarding the reviewer's final point about the comparison of γH2AX levels between ciliated and non-ciliated cells, we regret that direct comparison of γH2AX levels between ciliated and non-ciliated cells is not technically feasible. To preserve cilia integrity, all cilia-related imaging is performed using the squash technique, which maintains the three-dimensional structure of the cilia but does not allow reliable quantification of DNA damage markers due to nuclear distortion. Conversely, the nuclear spreading technique, used for DNA damage assessment, provides optimal visualization of repair foci but results in the loss of cilia due to cytoplasmic disruption during the hypotonic step. Given that spermatocytes in juvenile testes form developmentally synchronized cytoplasmic cysts, we consider that analyzing a statistically representative number of spermatocytes offers a valid and biologically meaningful measure of tissue-level effects.

      In conclusion, we believe that the additional experiments and clarifications included in revised Figure 7 strengthen our conclusion that cilia depolymerization compromises DNA repair during meiosis. Further functional confirmation will be pursued in future works, since we are currently generating a conditional genetic model for a ciliopathy in our laboratory.

      The authors analyze meiotic progression in cells cultured with/without AURKA inhibition in Figure 8-III and conclude that the distribution of prophase I cells does not change upon treatment. Is Figure 8-III A and B the same data? The legend text is incorrect, so it's hard to follow. Figure 8-III A shows a depletion of EdU-labelled pachytene cells upon treatment. Moreover, the conclusion that a higher proportion of ciliated zygotene cells upon treatment (Figure 8-II C) suggests that AURKA inhibition delays cilia depolymerization (page 13 line 444) does not make sense to me.

      Response:

      We thank Ref#1 for identifying this issue and for the careful examination of Figure 8. We discovered that the submitted version of Figure 8 contained a mismatch between the figure legend and the figure panels. The legend text was correct; however, the figure inadvertently included a non-corresponding graph (previously panel II-A), which actually belonged to Supplementary Figure 7 in the original submission. We apologize for this mistake.

      This error has been corrected in the revised version. The updated Figure 8 now accurately presents the distribution of EdU-labelled spermatocytes across prophase I substages in control and AURKA-inhibited cultures (previously Figure 8-II B, now Figure 8-A). The corrected data show no significant differences in the proportions of EdU-labelled spermatocytes among prophase I substages after 24 hours of AURKA inhibition, confirming that meiotic progression is not delayed and that no accumulation of zygotene cells occurs under this treatment. Therefore, the observed increase in ciliated zygotene spermatocytes upon AURKA inhibition (new Figure 8 H-I) is best explained by a delay in cilia disassembly, rather than by an arrest or slowdown in meiotic progression. The figure legend and main text have been revised accordingly.

      How do the authors know that there is a monopolar spindle in Figure 8-IV treated samples? Perhaps the authors can use a different Tubulin antibody (that does not detect only acetylated Tubulin) to show that there is a monopolar spindle.

      Response:

      We appreciate Ref#1 for this excellent suggestion. In the original submission (lines 446-447), we described that ciliated metaphase I spermatocytes in AURKA-inhibited samples exhibited monopolar spindle phenotypes. This description was based on previous reports showing that AURKA or PLK1 inhibition produces metaphases with monopolar spindles characterized by aberrant yet characteristic SYCP3 patterns, abnormal chromatin compaction, and circular bivalent alignment around non-migrated centrosomes (1). In our study, we observed SYCP3 staining consistent with these characteristic features of monopolar metaphases I.

      However, we agree with Ref #1 that this could be better sustained with data. Following the reviewer's suggestion, we performed additional immunostaining using α-Tubulin, which labels total microtubules rather than only the acetylated fraction. For clarity purposes, the revised Figure 8 now includes α-Tubulin staining in the same ciliated metaphase I cells shown in the original submission, confirming the presence of defective microtubule polymerization and defective spindle organization. For clarity, we now refer to these ciliated metaphases I as "arrested MI". This new data further support our conclusion that AURKA inhibition disrupts spindle bipolarization and prevents cilia depolymerization, indicating that cilia maintenance and bipolar spindle organization are mechanistically incompatible events during male meiosis. The abstract, results, and discussion section has been expanded accordingly, emphasizing that the persistence of cilia may interfere with microtubule polymerization and centrosome separation under AURKA inhibition. The Discussion has been expanded to emphasize that persistence of cilia may interfere with centrosome separation and microtubule polymerization, contrasting with invertebrate systems -e.g. Drosophila (2) and P. brassicae (3)- in which meiotic cilia persist through metaphase I without impairing bipolar spindle assembly.

      1. Alfaro, et al. EMBO Rep 22, (2021). DOI: 15252/embr.202051030 (PMID: 33615693)
      2. Riparbelli et al . Dev Cell (2012) DOI: 1016/j.devcel.2012.05.024 (PMID: 22898783)
      3. Gottardo et al, Cytoskeleton (Hoboken) (2023) DOI: 1002/cm.21755 (PMID: 37036073)

      The authors state in the abstract that they provide evidence suggesting that centrosome migration and cilia depolymerization are mutually exclusive events during meiosis. This is not convincing with the data present in the current manuscript. I suggest amending this statement in the abstract.

      Response:

      We thank Ref#1 for this valuable observation, with which we fully agree. To avoid overstatement, the original statement has been removed from the Abstract, Results, and Discussion, and replaced with a more accurate formulation indicating that cilia maintenance and bipolar spindle formation are mutually exclusive events during mouse meiosis.

      This revised statement is now directly supported by the new data presented in Figure 8, which demonstrate that AURKA inhibition prevents both spindle bipolarization and cilia depolymerization. We are grateful to the reviewer for highlighting this important clarification.

      Minor comments:

      The presence of cilia in all stages of meiotic prophase I in juvenile mice is intriguing. Why is the cellular distribution and length of cilia different in prepubertal mice compared to adults (where shorter cilia are present only in zygotene cells)? What is the relevance of these developmental differences? Do cilia serve prophase I functions in juvenile mice (in leptotene, pachytene etc.) that are perhaps absent in adults?

      Related to the above point, what is the relevance of the absence of cilia during the first meiotic wave? If cilia serve a critical function during prophase I (for instance, facilitating DSB repair), does the lack of cilia during the first wave imply differing cilia (and repair) requirements during the first vs latter spermatogenesis waves?

      In my opinion, these would be interesting points to discuss in the discussion section.

      Response:

      We thank the reviewer for these thoughtful observations, which we agree are indeed intriguing.

      We believe that our findings likely reflect a developmental role for primary cilia during testicular maturation. We hypothesize that primary cilia at this stage might act as signaling organelles, receiving cues from Sertoli cells or neighboring spermatocytes and transmitting them through the cytoplasmic cysts shared by spermatocytes. Such intercellular communication could be essential for coordinating tissue maturation and meiotic entry during puberty. Although speculative, this hypothesis aligns with the established role of primary cilia as sensory and signaling hubs for GPCR and RTK pathways regulating cell differentiation and developmental patterning in multiple tissues (e.g., 1, 2). The Discussion section has been expanded to include these considerations.

      1. Goetz et al, Nat Rev Genet (2010)- DOI: 1038/nrg2774 (PMID: 20395968)
      2. Naturky et al , Cell (2019) DOI: 1038/s41580-019-0116-4 (PMID: 30948801) Our study focuses on the first spermatogenic wave, which represents the transition from the juvenile to the reproductive phase. It is therefore plausible that the transient presence of longer cilia during this period reflects a developmental requirement for external signaling that becomes dispensable in the mature testis. Given that this is only the second study to date examining mammalian meiotic cilia, there remains a vast area of research to explore. We plan to address potential signaling cascades involved in these processes in future studies.

      On the other hand, while we cannot confirm that the cilia observed in zygotene spermatocytes persist until pachytene within the same cell, it is reasonable to speculate that they do, serving as longer-lasting signaling structures that facilitate testicular development during the critical pubertal window. In addition, the observation of ciliated spermatocytes at all prophase I substages at 20 dpp, together with our proteomic data, supports the idea that the emergence of meiotic cilia exerts a significant developmental impact on testicular maturation.

      In summary, although we cannot yet define specific prophase I functions for meiotic cilia in juvenile spermatocytes, our data demonstrate that the first meiotic wave differs from later waves in cilia dynamics, suggesting distinct regulatory requirements between puberty and adulthood. These findings underscore the importance of considering developmental context when using the first meiotic wave as a model for studying spermatogenesis.

      The authors state on page 9 lines 286-288 that the presence of cytoplasmic continuity via intercellular bridges (between developmentally synchronous spermatocytes) hints towards a mechanism that links cilia and flagella formation. Please clarify this statement. While the correlation between the timing of appearance of cilia and flagella in cells that are located within the same segment of the seminiferous tubule may be hinting towards some shared regulation, how would cytoplasmic continuity participate in this regulation? Especially since the cytoplasmic continuity is not between the developmentally distinct cells acquiring the cilia and flagella?

      Response:

      We thank Ref#1 for this excellent question and for the opportunity to clarify our statement.

      The presence of intercellular bridges between spermatocytes is well known and has long been proposed to support germ cell communication and synchronization (1,2) as well as sharing mRNA (3) and organelles (4). A classic example is the Akap gene, located on the X chromosome and essential for the formation of the sperm fibrous sheath; cytoplasmic continuity through intercellular bridges allows Akap-derived products to be shared between X- and Y-bearing spermatids, thereby maintaining phenotypic balance despite transcriptional asymmetry (5). In addition, more recent work has further demonstrated that these bridges are critical for synchronizing meiotic progression and for processes such as synapsis, double-strand break repair, and transposon repression (6).

      In this context, and considering our proteomic data (Figure 6), our statement did not intend to imply direct cytoplasmic exchange between ciliated and flagellated cells. Although our current methods do not allow comprehensive tracing of cytoplasmic continuity from the basal to the luminal compartment of the seminiferous epithelium, we plan to address this limitation using high-resolution 3D and ultrastructural imaging approaches in future studies.

      Based on our current data, we propose that cytoplasmic continuity within developmentally synchronized spermatocyte cysts could facilitate the coordinated regulation of ciliogenesis, and similarly enable the sharing of regulatory factors controlling flagellogenesis within spermatid cysts. This coordination may occur through the diffusion of centrosomal or ciliary proteins, mRNAs, or signaling intermediates involved in the regulation of microtubule dynamics. However, we cannot exclude the possibility that such cytoplasmic continuity extends across all spermatocytes derived from the same spermatogonial clone, potentially providing a larger regulatory network.]] This mechanism could help explain the temporal correlation we observe between the appearance of meiotic cilia and the onset of flagella formation in adjacent spermatids within the same seminiferous segment.

      We have revised the Discussion to explicitly clarify this interpretation and to note that, although hypothetical, it is consistent with established literature on cytoplasmic continuity and germ cell coordination.

      1. Dym, et al. * Reprod.*(1971) DOI: 10.1093/biolreprod/4.2.195 (PMID: 4107186)
      2. Braun et al. Nature. (1989) DOI: 1038/337373a0 (PMID: 2911388)
      3. Greenbaum et al. * Natl. Acad. Sci. USA*(2006). DOI: 10.1073/pnas.0505123103 (PMID: 16549803)
      4. Ventelä et al. Mol Biol Cell. (2003) DOI: 1091/mbc.e02-10-0647 (PMID: 12857863)
      5. Turner et al. Journal of Biological Chemistry (1998). DOI: 1074/jbc.273.48.32135 (PMID: 9822690)
      6. Sorkin, et al. Nat Commun (2025). DOI: 1038/s41467-025-56742-9 (PMID: 39929837) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      Individual germ cells in H&E-stained testis sections in Figure 1-II are difficult to see. I suggest adding zoomed-in images where spermatocytes/round spermatids/elongated spermatids are clearly distinguishable.

      Response:

      Ref#1 is very right in this suggestion. We have revised Figure 1 to improve the quality of the H&E-stained testis sections and have added zoomed-in panels where spermatocytes, round spermatids, and elongated spermatids are clearly distinguishable. These additions significantly enhance the clarity and interpretability of the figure.

      In Figure 2-II B, the authors document that most ciliated spermatocytes in juvenile mice are pachytene. Is this because most meiotic cells are pachytene? Please clarify. If the data are available (perhaps could be adapted from Figure 1-III), it would be informative to see a graph representing what proportions of each meiotic prophase substages have cilia.

      Response:

      We thank the reviewer for this valuable observation. Indeed, the predominance of ciliated pachytene spermatocytes reflects the fact that most meiotic cells in juvenile testes are at the pachytene stage (Figure 1). We have clarified this point in the text and have added a new supplementary figure (Supplementary Figure 2, new figure) presenting a graph showing the proportion of spermatocytes at each prophase I substage that possess primary cilia. This visualization provides a clearer quantitative overview of ciliation dynamics across meiotic substages.

      I suggest annotating the EM images in Sup Figure 2 and 3 to make it easier to interpret.

      Response:

      We thank the reviewer for this helpful suggestion. We have now added annotations to the EM images in Supplementary Figures 3 and 4 to facilitate their interpretation. These visual guides help readers more easily identify the relevant ultrastructural features described in the text.

      The authors claim that the ratio between GLI3-FL and GLI3-R is stable across their analyzed developmental window in whole testis immunoblots shown in Sup Figure 5. Quantifying the bands and normalizing to the loading control would help strengthen this claim as it hard to interpret the immunoblot in its current form.

      Response:

      We thank the reviewer for this valuable suggestion. Following this recommendation, Supplementary Figure 5 has been revised to include quantification of GLI1 and GLI3 protein levels, normalized to the loading control.

      After quantification, we observed statistically significant differences across developmental stages. Specifically, GLI1 expression is slightly higher at 21 dpp compared to 8 dpp. For GLI3, we performed two complementary analyses:

      • Total GLI3 protein (sum of full-length and repressor forms normalized to loading control) shows a progressive decrease during development, with the lowest levels at 60 dpp (Supplementary Figure 5D).
      • GLI3 activation status, assessed as the GLI3-FL/GLI3-R ratio, is highest during the 19-21 dpp window, compared to 8 dpp and 60 dpp. Although these results suggest a possible transient activation of GLI3 during testicular maturation, we caution that this cannot automatically be attributed to increased Hedgehog signaling, as GLI3 processing can also be affected by other processes, such as changes in ciliogenesis. Furthermore, because the analysis was performed on whole-testis protein extracts, these changes cannot be specifically assigned to ciliated spermatocytes.

      We have expanded the Discussion to address these findings and to highlight the potential involvement of the Desert Hedgehog (DHH) pathway, which plays key roles in testicular development, Sertoli-germ cell communication, and spermatogenesis (1, 2, 3). We plan to investigate these pathways further in future studies.

      1. Bitgood et al. Curr Biol. (1996). DOI: 1016/s0960-9822(02)00480-3 (PMID: 8805249)
      2. Clark et al. Biol Reprod. (2000) DOI: 1095/biolreprod63.6.1825 (PMID: 11090455)
      3. O'Hara et al. BMC Dev Biol. (2011) DOI: 1186/1471-213X-11-72 (PMID: 22132805) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      There are a few typos throughout the manuscript. Some examples: page 5 line 172, Figure 3-I legend text, Sup Figure 5-II callouts, Figure 8-III legend, page 15 line 508, page 17 line 580, page 18 line 611.

      Response:

      We thank the reviewer for detecting this. All typographical errors have been corrected, and figure callouts have been reviewed for consistency.

      Response to the Referee #2

      This study focuses on the dynamic changes of ciliogenesis during meiosis in prepubertal mice. It was found that primary cilia are not an intrinsic feature of the first wave of meiosis (initiating at 8 dpp); instead, they begin to polymerize at 20 dpp (after the completion of the first wave of meiosis) and are present in all stages of prophase I. Moreover, prepubertal cilia (with an average length of 21.96 μm) are significantly longer than adult cilia (10 μm). The emergence of cilia coincides temporally with flagellogenesis, suggesting a regulatory association in the formation of axonemes between the two. Functional experiments showed that disruption of cilia by chloral hydrate (CH) delays DNA repair, while the AURKA inhibitor (MLN8237) delays cilia disassembly, and centrosome migration and cilia depolymerization are mutually exclusive events. These findings represent the first detailed description of the spatiotemporal regulation and potential roles of cilia during early testicular maturation in mice. The discovery of this phenomenon is interesting; however, there are certain limitations in functional research.

      We thank Referee #2 for their careful reading of the manuscript and for highlighting important limitations regarding functional interpretation.

      Our primary objective in this study was to provide a rigorous structural, temporal, and developmental characterization of meiotic ciliogenesis in the mammalian testis, a process for which almost no prior data exist. Given this lack of foundational information, we focused on establishing when, where, and in which meiotic stages primary cilia form during prepubertal development, and on identifying candidate regulatory pathways using complementary imaging, proteomic, and pharmacological approaches.

      We agree that genetic ablation models would provide the most direct means to test ciliary function during spermatogenesis. However, we believe that such functional analyses must be preceded by a detailed developmental and phenotypic framework, which was previously unavailable. The present study therefore represents a necessary first step, defining the dynamics, ultrastructure, and molecular context of meiotic cilia during the transition from juvenile to adult spermatogenesis. We are currently generating conditional genetic models to directly address functional mechanisms in future work.

      Regarding the temporal coincidence between the emergence of meiotic cilia and the onset of flagellogenesis, we do not interpret this observation as evidence of stochastic or non-functional protein expression. Rather, we present it as a developmental correlation that may reflect shared regulatory constraints on axonemal assembly during testicular maturation. We have clarified in the revised manuscript that this relationship is descriptive and hypothesis-generating, and we avoid assigning direct causal roles.

      With respect to the proteomic analysis, we agree that proteomics alone cannot establish function. Our intent was not to assign causality, but to provide a developmental, hypothesis-generating dataset identifying candidate regulators that are enriched at the precise developmental window when both meiotic cilia and spermatid flagella first emerge. We have revised the text to explicitly frame these data as a resource for future mechanistic studies, rather than as direct functional evidence.

      Taken together, we believe that the revised manuscript now more accurately reflects the scope and limitations of the study, while providing a robust and much-needed developmental framework for future genetic and functional analyses of meiotic ciliogenesis in mammals. We would be happy to further clarify any aspect of these interpretations if the reviewer or editor considers it helpful.

      Major points:

      1. The prepubertal cilia in spermatocytes discovered by the authors lack specific genetic ablation to block their formation, making it impossible to evaluate whether such cilia truly have functions. Because neither in the first wave of spermatogenesis nor in adult spermatogenesis does this type of cilium seem to be essential. In addition, the authors also imply that the formation of such cilia appears to be synchronized with the formation of sperm flagella. This suggests that the production of such cilia may merely be transient protein expression noise rather than a functionally meaningful cellular structure.

      Response:

      We agree that a genetic ablation model would represent the ideal approach to directly test cilia function in spermatogenesis. However, given the complete absence of prior data describing the dynamics of ciliogenesis during testis development, our priority in this study was to establish a rigorous structural and temporal characterization of this process in the main mammalian model organism, the mouse. This systematic and rigorous phenotypic characterization is a necessary first step before any functional genetics could be meaningfully interpreted.

      To our knowledge, this study represents the first comprehensive analysis of ciliogenesis during prepubertal mouse meiosis, extending our previous work on adult spermatogenesis (1). Beyond these two contributions, only four additional studies have addressed meiotic cilia-two in zebrafish (2, 3), with Mytlys et al. also providing preliminary observations relevant to prepubertal male meiosis that we discuss in the present work, one in Drosophila (4) and a recent one in butterfly (5). No additional information exists for mammalian gametogenesis to date.

      1. López-Jiménez et al. Cells (2022) DOI: 10.3390/cells12010142 (PMID: 36611937)
      2. Mytlis et al. Science (2022) DOI: 10.1126/science.abh3104 (PMID: 35549308)
      3. Xie et al. J Mol Cell Biol (2022) DOI: 10.1093/jmcb/mjac049 (PMID: 35981808)
      4. Riparbelli et al . Dev Cell (2012) DOI: 10.1016/j.devcel.2012.05.024 (PMID: 22898783)
      5. Gottardo et al, Cytoskeleton (Hoboken) (2023) DOI: 10.1002/cm.21755 (PMID: 37036073) We therefore consider this descriptive and analytical foundation to be essential before the development of functional genetic models. Indeed, we are currently generating a conditional genetic model for a ciliopathy in our laboratory. These studies are ongoing and will directly address the type of mechanistic questions raised here, but they extend well beyond the scope and feasible timeframe of the present manuscript.

      We thus maintain that the present work constitutes a necessary and timely contribution, providing a robust reference dataset that will facilitate and guide future functional studies in the field of cilia and meiosis.

      Taking this into account, we would be very pleased to address any additional, concrete suggestions from Ref#2 that could further strengthen the current version of the manuscript

      The high expression of axoneme assembly regulators such as TRiC complex and IFT proteins identified by proteomic analysis is not particularly significant. This time point is precisely the critical period for spermatids to assemble flagella, and TRiC, as a newly discovered component of flagellar axonemes, is reasonably highly expressed at this time. No intrinsic connection with the argument of this paper is observed. In fact, this testicular proteomics has little significance.

      Response:

      We appreciate this comment but respectfully disagree with the reviewer's interpretation of our proteomic data. To our knowledge, this is the first proteomic study explicitly focused on identifying ciliary regulators during testicular development at the precise window (19-21 dpp) when both meiotic cilia and spermatid flagella first emerge.

      While Piprek et al (1) analyzed the expression of primary cilia in developing gonads, proteomic data specifically covering the developmental transition at 19-21 dpp were not previously available. Furthermore, a recent cell-sorting study (2), detected expression of cilia proteins in pachytene spermatocytes compared to round spermatids, but did not explore their functional relevance or integrate these data with developmental timing or histological context.

      In contrast, our dataset integrates histological staging, high-resolution microscopy, and quantitative proteomics, revealing a set of candidate regulators (including DCAF7, DYRK1A, TUBB3, TUBB4B, and TRiC) potentially involved in cilia-flagella coordination. We view this as a hypothesis-generating resource that outlines specific proteins and pathways for future mechanistic studies on both ciliogenesis and flagellogenesis in the testis.

      Although we fully agree that proteomics alone cannot establish causal function, we believe that dismissing these data as having little significance overlooks their value as the first molecular map of the testis at the developmental window when axonemal structures arise. Our dataset provides, for the first time, an integrated view of proteins associated with ciliary and flagellar structures at the developmental stage when both axonemal organelles first appear. We thus believe that our proteomic dataset represents an important and novel contribution to the understanding of testicular development and ciliary biology.

      Considering this, we would again welcome any specific suggestions from Ref#2 on additional analyses or clarifications that could make the relevance of this dataset even clearer to readers.

      1. Piprek et al. Int J Dev Biol. (2019) doi: 10.1387/ijdb.190049rp (PMID: 32149371).
      2. Fang et al. Chromosoma. (1981) doi: 10.1007/BF00285768 (PMID: 7227045). Response to the Referee #3

      In "The dynamics of ciliogenesis in prepubertal mouse meiosis reveals new clues about testicular development" Pérez-Moreno, et al. explore primary cilia in prepubertal mouse spermatocytes. Using a combination of microscopy, proteomics, and pharmacological perturbations, the authors carefully characterize prepubertal spermatocyte cilia, providing foundational work regarding meiotic cilia in the developing mammalian testis.

      Response: We sincerely thank Ref#3 for their positive assessment of our work and for the thoughtful suggestions that have helped us strengthen the manuscript. We are pleased that the reviewer recognizes both the novelty and the relevance of our study in providing foundational insights into meiotic ciliogenesis during prepubertal testicular development. All specific comments have been carefully considered and addressed as detailed below.

      Major concerns:

      1. The authors provide evidence consistent with cilia not being present in a larger percentage of spermatocytes or in other cells in the testis. The combination of electron microscopy and acetylated tubulin antibody staining establishes the presence of cilia; however, proving a negative is challenging. While acetylated tubulin is certainly a common marker of cilia, it is not in some cilia such as those in neurons. The authors should use at least one additional cilia marker to better support their claim of cilia being absent.

      Response:

      We thank the reviewer for this helpful suggestion. In the revised version, we have strengthened the evidence for cilia identification by including an additional ciliary marker, glutamylated tubulin (GT335), in combination with acetylated tubulin and ARL13B (which were included in the original submission). These data are now presented in the new Supplementary Figure 2, which also includes an example of a non-ciliated spermatocyte showing absence of both ARL13B and AcTub signals.

      Taken together, these markers provide a more comprehensive validation of cilia detection and confirm the absence of ciliary labelling in non-ciliated spermatocytes.

      The conclusion that IFT88 localizes to centrosomes is premature as key controls for the IFT88 antibody staining are lacking. Centrosomes are notoriously "sticky", often sowing non-specific antibody staining. The authors must include controls to demonstrate the specificity of the staining they observe such as staining in a genetic mutant or an antigen competition assay.

      Response:

      We appreciate the reviewer's concern and fully agree that antibody specificity is critical when interpreting centrosomal localization. The IFT88 antibody used in our study is commercially available and has been extensively validated in the literature as both a cilia marker (1, 2), and a centrosome marker in somatic cells (3). Labelling of IFT88 in centrosomes has also been previously described using other antibodies (4, 5). In our material, the IFT88 signal consistently appears at one of the duplicated centrosomes and at both spindle poles-patterns identical to those reported in somatic cells. We therefore consider the reported meiotic IFT88 staining as specific and biologically reliable.

      That said, we agree that genetic validation would provide the most definitive confirmation. We would like to inform that we are currently since we are currently generating a conditional genetic model for a ciliopathy in our laboratory that will directly assess both antibody specificity and functional consequences of cilia loss during meiosis. These experiments are in progress and will be reported in a follow-up study.

      1. Wong et al. Science (2015). DOI: 1126/science.aaa5111 (PMID: 25931445)
      2. Ocbina et al. Nat Genet (2011). DOI: 1038/ng.832 (PMID: 21552265)
      3. Vitre et al. EMBO Rep (2020). DOI: 15252/embr.201949234 (PMID: 32270908)
      4. Robert A. et al. J Cell Sci (2007). DOI: 1242/jcs.03366 (PMID: 17264151)
      5. Singla et al, Developmental Cell (2010). DOI: 10.1016/j.devcel.2009.12.022 (PMID: 20230748) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      There are many inconsistent statements throughout the paper regarding the timing of the first wave of spermatogenesis. For example, the authors state that round spermatids can be detected at 21dpp on line 161, but on line 180, say round spermatids can be detected a 19dpp. Not only does this lead to confusion, but such discrepancies undermine the validity of the rest of the paper. A summary graphic displaying key events and their timing in the first wave of spermatogenesis would be instrumental for reader comprehension and could be used by the authors to ensure consistent claims throughout the paper.

      Response:

      We thank the reviewer for identifying this inconsistency and apologize for the confusion. We confirm that early round spermatids first appear at 19 dpp, as shown in the quantitative data (Figure 1J). This can be detected in squashed spermatocyte preparations, where individual spermatocytes and spermatids can be accurately quantified. The original text contained an imprecise reference to the histological image of 21 dpp (previous line 161), since certain H&E sections did not clearly show all cell types simultaneously. However, we have now revised Figure 1, improving the image quality and adding a zoomed-in panel highlighting early round spermatids. Image for 19 dpp mice in Fig 1D shows early, yet still aflagellated spermatids. The first ciliated spermatocytes and the earliest flagellated spermatids are observed at 20 dpp. This has been clarified in the text.

      In addition, we also thank the reviewer for the suggestion of adding a summary graphic, which we agree greatly facilitates reader comprehension. We have added a new schematic summary (Figure 1K) illustrating the key stages and timing of the first spermatogenic wave.

      In the proteomics experiments, it is unclear why the authors assume that changes in protein expression are predominantly due to changes within the germ cells in the developing testis. The analysis is on whole testes including both the somatic and germ cells, which makes it possible that protein expression changes in somatic cells drive the results. The authors need to justify why and how the conclusions drawn from this analysis warrant such an assumption.

      Response:

      We agree with the reviewer that our proteomic analysis was performed on whole testis samples, which contain both germ and somatic cells. Although isolation of pure spermatocyte populations by FACS would provide higher resolution, obtaining sufficient prepubertal material for such analysis would require an extremely large number of animals. To remain compliant with the 3Rs principle for animal experimentation, we therefore used whole-testis samples from three biological replicates per age.

      We acknowledge that our assumption-that the main differences arise from germ cells-is a simplification. However, germ cells constitute the vast majority of testicular cells during this developmental window and are the population undergoing major compositional changes between 15 dpp and adulthood. It is therefore reasonable to expect that a substantial fraction of the observed proteomic changes reflects alterations in germ cells. We have clarified this point in the revised text and have added a statement noting that changes in somatic cells could also contribute to the proteomic profiles.

      The authors should provide details on how proteins were categorized as being involved in ciliogenesis or flagellogenesis, specifically in the distinction criteria. It is not clear how the categorizations were determined or whether they are valid. Thus, no one can repeat this analysis or perform this analysis on other datasets they might want to compare.

      Response:

      We thank the reviewer for this opportunity to clarify our approach. The categorization of protein as being involved in ciliogenesis or flagellogenesis was based on their Gene Ontology (GO) cellular component annotations obtained from the PANTHER database (Version 19.0), using the gene IDs of the Differentially Expressed Proteins (DEPs). Specifically, we used the GO terms cilium (GO:0005929) and motile cilium (GO:0031514). Since motile cilium is a subcategory of cilium, proteins annotated only with the general cilium term, but not included under motile cilium, were considered to be associated with primary cilia or with shared structural components common to different types of cilia. These GO terms are represented in the bottom panel of the Figure 6.

      This information has been added to the Methods section and referenced in the Results for transparency and reproducibility.

      In the pharmacological studies, the authors conclude that the phenotypes they observe (DNA damage and reduced pachytene spermatocytes) are due to loss of or persistence of cilia. This overinterprets the experiment. Chloral hydrate and MLN8237 certainly impact ciliation as claimed, but have additional cellular effects. Thus, it is possible that the observed phenotypes were not a direct result of cilia manipulation. Either additional controls must address this or the conclusions need to be more specific and toned down.

      Response:

      We thank the reviewer for this fair observation and have taken steps to strengthen and refine our interpretation. In the revised version, we now include data from 1-hour and 24-hour cultures for both control and chloral hydrate (CH)-treated samples (n = 3 biological replicates). The triple immunolabelling with γH2AX, SYCP3, and H1T allows accurate staging of zygotene (H1T⁻), early pachytene (H1T⁻), and late pachytene (H1T⁺) spermatocytes.

      The revised Figure 7 now provides a more complete and statistically supported analysis of DNA damage dynamics, confirming that CH-induced deciliation leads to persistent γH2AX signal at 24 hours, indicative of delayed or defective DNA repair progression. We have also toned down our interpretation in the Discussion, acknowledging that CH could affect other cellular pathways.

      As mentioned before, the conditional genetic model that we are currently generating will allow us to evaluate the role of cilia in meiotic DNA repair in a more direct and specific way.

      Assuming the conclusions of the pharmacological studies hold true with the proper controls, the authors still conflate their findings with meiotic defects. Meiosis is not directly assayed, which makes this conclusion an overstatement of the data. The conclusions need to be rephrased to accurately reflect the data.

      Response:

      We agree that this aspect required clarification. As noted above, we have refined both the Results and Discussion sections to make clear that our assays specifically targeted meiotic spermatocytes.

      We now present data for meiotic stages at zygotene, early pachytene and late pachytene. This is demonstrated with the labelling for SYCP3 and H1T, both specific marker for meiosis that are not detectable in non meiotic cells. We believe that this is indeed a way to assay the meiotic cells, however, we have specified now in the text that we are analysing potential defects in meiosis progression. We are sorry if this was not properly explained in the original manuscript: it is now rephrased in the new version both in the results and discussion section.

      It is not clear why the authors chose not to use widely accepted assays of Hedgehog signaling. Traditionally, pathway activation is measured by transcriptional output, not GLI protein expression because transcription factor expression does not necessarily reflect transcription levels of target genes.

      Response:

      We agree with the reviewer that measuring mRNA levels of Hedgehog pathway target genes, typically GLI1 and PTCH1, is the most common method for measuring pathway activation, and is widely accepted by researchers in the field. However, the methods we use in this manuscript (GLI1 and GLI3 immunoblots) are also quite common and widely accepted:

      Regarding GLI1 immunoblot, many articles have used this method to monitor Hedgehog signaling, since GLI1 protein levels have repeatedly been shown to also go up upon pathway activation, and down upon pathway inhibition, mirroring the behavior of GLI1 mRNA. Here are a few publications that exemplify this point:

      • Banday et al. 2025 Nat Commun. DOI: 10.1038/s41467-025-56632-0 (PMID: 39894896)
      • Shi et al 2022 JCI Insight DOI: 10.1172/jci.insight.149626 (PMID: 35041619)
      • Deng et al. 2019 eLife, DOI: 10.7554/eLife.50208 (PMID: 31482846)
      • Zhu et al. 2019 Nat Commun, DOI: 10.1038/s41467-019-10739-3 (PMID: 31253779)
      • Caparros-Martin et al 2013 Hum Mol Genet, DOI: 10.1093/hmg/dds409 (PMID: 23026747) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      As for GLI3 immunoblot, Hedgehog pathway activation is well known to inhibit GLI3 proteolytic processing from its full length form (GLI3-FL) to its transcriptional repressor (GLI3-R), and such processing is also commonly used to monitor Hedgehog signal transduction, of which the following are but a few examples:

      • Pedraza et al 2025 eLife, DOI: 10.7554/eLife.100328 (PMID: 40956303)
      • Somatilaka et al 2020 Dev Cell, DOI: 10.1016/j.devcel.2020.06.034 (PMID: 32702291)
      • Infante et al 2018, Nat Commun, DOI: 10.1038/s41467-018-03339-0 (PMID: 29515120)
      • Wang et al 2017 Dev Biol DOI: 10.1016/j.ydbio.2017.08.003 (PMID: 28800946)
      • Singh et al 2015 J Biol Chem DOI: 10.1074/jbc.M115.665810 (PMID: 26451044) *note: due to manuscript-length limitations, not all cited references can be included in the text; they are listed here to substantiate our response.

      In summary, we think that we have used two well established markers to look at Hedgehog signaling (three, if we include the immunofluorescence analysis of SMO, which we could not detect in meiotic cilia).

      These Hh pathway analyses did not provide any convincing evidence that the prepubertal cilia we describe here are actively involved in this pathway, even though Hh signaling is cilia-dependent and is known to be active in the male germline (Sahin et al 2014 Andrology PMID: 24574096; Mäkelä et al 2011 Reproduction PMID: 21893610; Bitgood et al 1996 Curr Biol. PMID: 8805249).

      That said, we fully agree that our current analyses do not allow us to draw definitive conclusions regarding Hedgehog pathway activity in meiotic cilia, and we now state this explicitly in the revised Discussion.

      Also in the Hedgehog pathway experiment, it is confusing that the authors report no detection of SMO yet detect little to no expression of GLIR in their western blot. Undetectable SMO indicates Hedgehog signaling is inactive, which results in high levels of GLIR. The impact of this is that it is not clear what is going on with Hh signaling in this system.

      Response:

      It is true that, when Hh signaling is inactive (and hence SMO not ciliary), the GLI3FL/GLI3R ratio tends to be low.

      Although our data in prepuberal mouse testes show a strong reduction in total GLI3 protein levels (GLI3FL+GLI3R) as these mice grow older, this downregulation of total GLI3 occurs without any major changes in the GLI3FL/GLI3R ratio, which is only modestly affected (suppl. Figure 6).

      Hence, since it is the ratio that correlates with Hh signaling rather than total levels, we do not think that the GLI3R reduction we see is incompatible with our non-detection of SMO in cilia: it seems more likely that overall GLI3 expression is being downregulated in developing testes via a Hh-independent mechanism.

      Also potentially relevant here is the fact that some cell types depend more on GLI2 than on GLI3 for Hh signaling. For instance, in mouse embryos, Hh-mediated neural tube patterning relies more heavily on GLI2 processing into a transcriptional activator than on the inhibition of GLI3 processing into a repressor. In contrast, the opposite is true during Hh-mediated limb bud patterning (Nieuwenhuis and Hui 2005 Clin Genet. PMID: 15691355). We have not looked at GLI2, but it is conceivable that it could play a bigger role than GLI3 in our model.

      Moreover, several forms of GLI-independent non-canonical Hh signaling have been described, and they could potentially play a role in our model, too (Robbins et al 2012 Sci Signal. PMID: 23074268).

      We have revised the discussion to clarify some of these points.

      All in all, we agree that our findings regarding Hh signaling are not conclusive, but we still think they add important pieces to the puzzle that will help guide future studies.

      There are multiple instances where it is not clear whether the authors performed statistical analysis on their data, specifically when comparing the percent composition of a population. The authors need to include appropriate statistical tests to make claims regarding this data. While the authors state some impressive sample sizes, once evaluated in individual categories (eg specific cell type and age) the sample sizes of evaluated cilia are as low as 15, which is likely underpowered. The authors need to state the n for each analysis in the figures or legends.

      We thank the reviewer for highlighting this important issue. We have now included the sample size (n) for every analysis directly in the figure legends. Although this adds length, it improves transparency and reproducibility.

      Regarding the doubts of Ref#3 about the different sample sizes, the number of spermatocytes quantified in each stage is in agreement with their distribution in meiosis (example, pachytene lasts for 10 days this stage is widely represented in the preparations, while its is much difficult to quantify metaphases I that are less present because the stage itself lasts for less than 24hours). Taking this into account, we ensured that all analyses remain statistically valid and representative, applying the appropriate statistical tests for each dataset. These details are now clearly indicated in the revised figures and legends.

      Minor concerns:

      1. The phrase "lactating male" is used throughout the paper and is not correct. We assume this term to mean male pups that have yet to be weaned from their lactating mother, but "lactating male" suggests a rare disorder requiring medical intervention. Perhaps "pre-weaning males" is what the authors meant.

      Response:

      We thank the reviewer for noticing this terminology error. The expression has been corrected to "pre-weaning males" throughout the manuscript.

      The convention used to label the figures in this paper is confusing and difficult to read as there are multiple panels with the same letter in the same figure (albeit distinct sections). Labeling panels in the standard A-Z format is preferred. "Panel Z" is easier to identify than "panel III-E".

      Response:

      We thank the reviewer for this suggestion. All figures have been relabelled using the standard A-Z panel format, ensuring consistency and easier readability across the manuscript.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (R1)

      R1 General statement: Here, Escalera-Maurer and colleagues, present an up-to-date distribution of homologues of Hok toxic proteins belonging to the well-annotated, but otherwise functionally obscure, hok/Sok type I toxin-antitoxin system, across the RefSeq database. Although such computational analyses have been done in the past, the authors here find many more hok homologs than described before, and they categorise their distribution based on whether they are encoded on chromosomes, plasmids, or (pro)phages. These computational analyses are in general tricky with T1TAs, as their toxins are quite short (~50 amino acids, as is the case for Hok), which is why the authors here used three separate approaches to expand their search (nucleotide-level BLAST, protein-homology, or both combined with Infernal). The authors cluster the Hok homologues they find based on a 60% sequence identity cut-off (expanding the known clusters in the process), and proceeded to test 31 candidates belonging to 15 sequence-clusters for their toxicity in Salmonella Typhimurium LT2, showing that 30/31 were toxic upon induction. An interesting finding from their endeavours is that hok/Sok homologues are enriched within prophages and large plasmids, but are not enriched near bacterial anti-phage defense systems (in contrast to the SymE/SymR T1TA). The findings suggest that hok/Sok are indeed sometimes linked to phage and plasmid biology, although they might not be antiphage defenses per se (they have been clearly shown in the past to be addiction modules, and this is still clearly true).

      Authors' answer to R1 General statement: __We do not state here that hok/Sok are not anti-phage defense systems, but we simply observe that they do not cluster with anti-phage defense systems. We have also observed (unpublished data) that known defense systems do not systematically cluster together with other defense systems. Therefore, strong association with other defense systems would have been a strong indication of their function in phage defense but the fact that we did not observe any association with defense systems does not exclude they are involved in phage defense. __

      R1_C1: My expertise lies towards the experimental side of the authors' work, I thus cannot comment on the accuracy/robustness of the computational analyses performed here. The authors do a fine job in clearly stating their findings overall; I could follow most of the conclusions, and I deemed that most of them were supported by their work. Additionally, I find that this paper is a missed opportunity to uncover even more novel biology connected to the interesting hok/Sok T1TAs. The paper does not provide a new framework to think about what is the function of the chromosomal/prophage hok/Sok T1TA systems, although I realize that this is very difficult to accomplish, especially when considering that hok/Sok systems have been around in the literature for almost 40 years.

      Authors' answer to R1_C1: We agree with the reviewer, as we indeed performed this analysis having in mind to clarify the role of hok/Sok systems. However, we still believe that our strong survey of Hok loci put in light their enrichment in various mobile genetic elements, such as prophage and large conjugative plasmids, which is indubitably linked to their function. In addition, our study will guide future experimental efforts in uncovering the function of these systems, for example by helping researchers to select relevant homologs to test for a specific function.__ __

      R1_C2: My major comment is in regard to the Hok toxicity assays (Fig. 2). The authors state in the discussion that "Hok peptides originating from chromosomes are as toxic as those from plasmids", but I believe that the way that they tested their constructs might not have allowed them to see toxicity differences between the two groups. Specifically, using the multi-copy plasmid pAZ3 (pBR322 origin of replication; ~15-20 plasmid copies per chromosome) to induce the different Hok toxin homologues in Salmonella Typhimurium LT2 with arabinose might have masked toxicity differences that would otherwise be apparent on the chromosomal expression-level.

      Some of the authors themselves have previously used the FASTBAC-Seq method to study the Hok homologue from plasmid R1, a useful technique during which a toxin is integrated in the chromosome, in order to study their toxicity under natural levels of expression. I believe that an ideal scenario would be to apply FASTBAC-seq to some of the 31 Hok homologues described here (e.g., a subset of plasmidic vs chromosomal Hok homologues) to shed light on potential toxicity differences between the Hok clusters. This would increase the value of the presented study.

      Alternatively, the authors could employ an L-arabinose concentration gradient to titrate the expression levels of the Hok toxins in order to potentially see different toxicity levels from the different homologues. However, this is not going to work in the system as they are using it now for two reasons:

      1. a) the S. Typhimurium LT2 (STm) used here has its arabinose utilization operon intact (araBAD), which means that Salmonella can catabolize arabinose to use it as a carbon source. This catabolization process interferes with the arabinose induction (i.e., Salmonella eats arabinose instead of using it as the Hok inducer). To ameliorate this, the authors could delete the araBAD operon in STm, rendering STm incapable of catabolizing arabinose, and repeat the experiments in that strain. Or use E. coli BW25113 as the expression host, which already has the araBAD operon deleted (it is not clear to me why the different Hok homologues would not be toxic in E. coli, as the different Hok homologues are widely diverse in sequence, as the authors found here).
      2. b) Even with the araBAD operon deleted, the arabinose induction would be bimodally on or off in the population, due to the bimodal expression of the arabinose transporter (AraE; see Khlebnikov et al., 2002). This would again not allow for titratable arabinose-inducible expression from different concentrations of arabinose. The solution for this would be to co-express a separate plasmid with araE, which would render every cell the same in regards to arabinose permeability, and thus the system would be titratable (as explained in Khlebnikov et al., 2002). Therefore, if the authors would be interested to go towards this route, they would have to first delete the araBAD from STm, then transform STm with an araE plasmid, and redo the experiments. In addition, I would propose to the authors to use the drop plate method (agar plate-based), which is more sensitive compared to the liquid assays employed here.

      Having said all that, I understand that all this experimental work would be strenuous and time-consuming, and although I would like to see it happen, this is not my paper. I would be content therefore if the authors toned down the claim that plasmidic vs chromosomal Hok homologues have the same toxicity, and discuss that chromosomal levels of toxicity are an important caveat that has not been explored here.

      __Authors' answer to R1_C2: __ We thank the reviewer for the detailed suggestion on how to better assess toxicity differences by using an araBAD deletion mutant overexpressing araE. We repeated the arabinose induction assays using drop assays and strain BW25223 with plasmid pJAT13araE and our pAZ3 based plasmid carrying Hok CDS homologs. However, we obtained similar data, not being able to distinguish between the toxicity of chromosomal versus plasmidic CDS, even using different concentration of Arabinose. This is probably because low concentration of the Hok protein are sufficient for activity, but here we are bypassing all post-transcriptional silencing by the native Hok mRNAs by expressing directly the protein, and we are using a multicopy plasmid. We now included 0.01% arabinose induction drop assays in the manuscript as the data obtained with other arabinose concentration did not provide new information. In any case, we are still not accessing the native expression levels for the following reasons 1/ chromosomal level of toxicity were not explored here and 2/ only the toxicity of the coding sequence but not the full mRNA was tested. Indeed, we do not know the exact sequence of the hok homolog mRNAs and this is beyond the scope of the study. These remarks were clearly added in the discussion.

      We agree that the sentence "Hok peptides originating from chromosomes are as toxic as those from plasmids" was too strong and we have added the caveats of our experimental design in the discussion. While we indeed did not compare the toxicity of the peptides, we still showed that chromosomal Hok can be toxic upon overexpression, which would not be the case if the sequences were degenerated.

      The reviewer also suggests the use of the FASTBAC-Seq method, that we previously used to study Hok from the R1 plasmid, which is a method to study toxic type I toxins at the native expression level. While FASTBAC-Seq identifies loss-of-function mutants of the systems, it does not allow to determine a difference of toxicity between systems per se. In addition, FASTBAC-Seq was always done in the context of the full mRNA, not only the coding sequence, and these sequences are presently unknown for most homologs.

      Other comments:

      __R1_C3: __a) There is barely any discussion of the Sok component (RNA antitoxin) of the homologues; why is that? Could you please discuss Sok differences across the homologues, or at least explain why this is not discussed at all in the paper (e.g., in the discussion)?

      Authors' answer to R1_C3: __It is not trivial to identify the Sok RNA sequence, this is why it was not done in this study, a paragraph was added in the discussion explaining this. __

      __R1_C4: __b) In the results section, the Hok clusters are referred to as 62 in number ("Because Hok sequences were too short and variable to construct a meaningful phylogenetic tree, we clustered the Hok sequences with a 60% identity threshold and obtained 62 clusters"), but then in the discussion section, the cluster number becomes 74 ("We highlighted the high sequence variability within Hok peptides by obtaining a total of 74 clusters with 60% identity (Fig. S7)."). Which one is the right number, and why is there a discrepancy?

      Authors' answer to R1_C4: We apologize for the discrepancy between the number. The first number corresponded to the Hok hits from the refSeq and we then added the Hok hits from the plasmid and virus databases (performed later in the manuscript). We clarified this information both in the result and discussion texts (61 clusters from RefSeq and 79 in total, 74 was a typo).__ __

      __R1 Significance: __The most well-clarified aspect of the paper presented here is the distribution of Hok homologues, with the novel aspect of the location in which the hok/Sok T1TAs reside (i.e., chromosome, plasmid, or phage). There is room for the molecular genetics part to be developed further, as I discussed earlier, however this study is the most up-to-date characterization of the diversity of Hok homologues, and will be of interest to the T1TA and the general toxin-antitoxin field.

      __Reviewer #2 (R2) __

      R2 General statement: The authors examined how the Hok toxins are spread across bacterial genomes. The manuscript including its figures is hard to read and understand. I commented figure 1 in details, but similar comments apply to the other figures. Overall, the data lack clarity and precision. Finding information about sequences, clusters in the supplementary materials was not easy. The manuscript should be thoroughly revised. In addition, I believe that other aspects should be developed to expand the interest of the study, such as the co-occurrence of multiple systems in chromosomes, on plasmids and whether they are able to crosstalk. This might provide some evolutionary insights into the biology of these toxins.

      __Authors' answer to R2 General statement: __We designed all figures according to established standards for scientific data visualization, although we recognize that different presentations may work better for different audiences. In our detailed response to Figure 1A, we explain how UpSet plots are constructed and interpreted, which we hope clarifies the visualization approach for the full dataset. We are open to discussing specific improvements if the reviewer has suggestions for enhanced clarity. To address concerns about accessibility, we want to clarify that all sequences are compiled in Table S1 with their clus100 identifiers, making them easy to locate. We are open to reorganizing supplementary materials if a different structure would be more user-friendly. Finally, we agree that an extensive analysis of co-occurrences and crosstalks would be valuable. However, predicting crosstalk bioinformatically for all genomes presents challenges, as it would require predicting RNA:RNA interactions between hok mRNA and Sok sequences, which are currently unknown. Given these limitations, this analysis was beyond the scope of the current study.

      R2_C1: The introduction lacks information regarding the Hok protein (size, structure prediction, localization) as well as a bit of explanation about the reason of looking at these toxins. The description of the potential roles should be a bit expanded.

      Authors' answer to R2_C1: Following the comment from the reviewer, we have provided additional information about Hok in the introduction.

      __R2_C2: __When the authors talk about 'loci', they mean genes encoding Hok homologs if I understand correctly. They did not look for the Sok sequences (hok-sok loci).

      __Author's answer to R2_C2: __Indeed, we did not look for the Sok sequences and we are only describing Hok homologs loci, that could either encode or lack a Sok homolog.

      __R2_C3: __It is not clear what the authors did with the sequences for which they could not detect a start codon and a SD (although it is unusual to refer to SD in the context of protein sequence)

      Authors' answer to R2_C3: The peptides were annotated by extending the initial hit until the first start codon. Therefore, all annotated peptides have a start codon. Shine-Dalgarno sequences were annotated when confidently predicted, to provide additional information. Sequences were not excluded based on the presence or absence of the SD.

      __R2_C4: __Figure 1A is not clear. The total of the bars equal 32,532 which is the number of 'loci' detected by the combination of the different methods. However, it is not clear to me how many are redundant. For instance, I suppose that all the 8483 sequences that were retrieved using blastn and Infernal were retrieved using MMseqs2, blastn and Infernal. So, what is the actual number of sequences that were found? When the authors talk about 1264 distinct peptides, what do they mean? What are the numbers on the X axis (18209, 2260, 27728)?

      Author's answer to R2_C4: Figure A1 is a very typical "UpSet" plot, as indicated in the legend (A. Lex, N. Gehlenborg, H. Strobelt, R. Vuillemot and H. Pfister, "UpSet: Visualization of Intersecting Sets," in IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 1983-1992, 31 Dec. 2014, doi: 10.1109/TVCG.2014.2346248). Those plots are a data visualization method for showing data with more than two intersecting sets. The Hok sequence hits were obtained by 3 different methods stated on the rows (MMseqs2, blastn and Infernal, therefore the number 18209 is the number of hits by the MMseqs2, 22680 the number of hits by blastn and 27728 the number of hits by Infernal). The columns show the intersections between these three sets. For example, the mentioned 8483 sequences (second column) were only found by blastn and Infernal but not by MMseqs2. The actual total number of sequences found is indeed 32 532. The 1264 distinct peptides are peptides with different sequences. After removing false positives, degenerated sequences and small peptides, we obtained 1264 unique Hok sequences that are found in the 32532 bacterial loci.

      __R2_C5: __About Infernal: first the authors are stating that only 8% of the sequences are lost when not considering the mRNA structure - which they seem to consider as negligeable. Then in the next section, they state that Infernal is the best tool at identifying clusters that are not detected otherwise. Seems a bit contradictory.

      __Authors' answer to R2_C5: __We appreciate the reviewer pointing out this apparent contradiction, we have clarified this part in the revised manuscript. Infernal uses both sequence and structure information simultaneously for homology detection. While only 8% of Infernal's hits are detected uniquely when structural information was considered, these sequences account for 9 additional clusters with notably high sequence diversity, which would otherwise have been undetected. Therefore, we believe that Infernal is the best tool to capture novel cluster diversity.

      __R2_C6: __Cluster determination. The threshold was put at 60% identity. What is the rationale for the 60% identity? Given that the Hok sequences (like toxins and antitoxins from TA systems in general) are highly variable, this leads to a high number of clusters. I'm not sure of the relevance of these clusters. Are there any other criteria to define clusters?

      Authors' answer to R2_C6: We selected 60% identity as a balance between capturing sequence diversity and generating interpretable results. We also tested 70, 80 and 90% and obtained 128, 221, 377 clusters, respectively, which would be too many for a meaningful visualization and interpretation. The best clustering method would be constructing a phylogenetic tree. However, as explained in the discussion, because the high sequence diversity prevented the construction of a reliable phylogenetic tree, clustering was used as an alternative strategy to identify and interpret patterns of sequence variability.

      __R2_C7: __The authors claim that most of the Hok diversity is found on chromosomes. However, the number of chromosomal Hok is higher than that located on plasmids, which might be related to the different sizes of the different replicons ie, chromosomes being larger than plasmids. Is there a way to normalize by determining the density per size?

      Authors' answer to R2_C7: We do not claim that chromosomes contain most of Hok diversity, as this would be indeed influenced by biases in the databases. We are just describing that we found most of the diversity in chromosomes, but we cannot conclude whether this is a true representation of the frequencies in nature.__ __

      R2_C8: '46 of the 62 clusters contained 10 or less distinct sequences and might be in the process of degenerating'. The authors also linked this with SD detection. Please explain. From what was indicated earlier, I understand that sequences with premature stop codons or short sequences (Authors' answer to R2_C8: We did not remove sequences for which we could not predict the SD. Indeed, lacking SD is a sign that the hok mRNA might not be able to play its biological role and would be indicative that the sequences have degenerated. To evaluate this hypothesis, we experimentally tested 5 sequences without a predicted SD and two of those were not toxic (see Table S2). In order to assess if the low abundant clusters contained degenerated sequences we experimentally tested representatives from some of the clusters with only one Hok CDS and found most of them to be toxic.

      R2_C9: 'Only 7.3% of the unique sequences were found on both plasmids and chromosomes'. From this observation, the authors conclude that 'there is little stable transfer from chromosomes to plasmids or vice-versa'. I don't understand what this means. Do they mean identical sequences? The fact that sequences differ from chromosomes to plasmids does not rule out 'stable transfer'. What do they actually mean by stable transfer? Once the gene is horizontally transferred, it is fixed and vertically transmitted? Same comments apply to the inter-genera horizontal transfer by plasmids.

      __Authors' answer to R2_C9: __Due to the impossibility of constructing a reliable phylogenetic tree, we used identity of sequences across different localizations or genera as our marker for recent, stable transfer events. We define stable transfer as the persistence of sequences in an unchanged form following horizontal transfer; long enough to be detected in current databases. Our approach likely underestimates total transfer events, as sequences accumulating mutations after transfer would not be captured. We would expect to observe numerous identical sequences across plasmids and chromosomes if frequent exchange were occurring, unless rapid mutation after the transfer prevented their detection as identical sequences. We have added a sentence to clarify this in the manuscript and removed the term stable transfer.

      __R2_C10: __I don't understand the next section about 'family'. What do the authors mean about 'family'? Genera? The same apply to the next section about the Y to C recoding. Did the authors do point mutations in the conserved amino acids/codons to test whether they are important for toxicity? Some Hok variants lacks some of the conserved amino acids and are toxic (under overexpression conditions in Salmonella). What about T18, C31 and E42?

      Authors' answer to R2_C10: Families (Enterobacteriaceae, Vibrionaceae etc... ) and genera (Escherichia, Salmonella etc...) refer to the taxonomic categories. Following the reviewer comment, we experimentally assessed the toxicity of Hok from R1 plasmid after mutating the conserved amino acids to alanine residues. All the mutants were found to be toxic under our expression conditions.

      __R2_C11: __The prevalence of Hok in chromosomes or on plasmids might depend on various confounding parameters, such as the size, number of sequences available among others. The authors should find methods to correct for all that.

      Authors' answer to R2_C11: Normalization would indeed be needed if we were comparing the prevalence on chromosomes vs the prevalence on plasmids. Here, we do not claim that Hok homologs are more prevalent in plasmid or chromosomes and only describe where we found them.

      __R2_C12: __Link with defense systems. The threshold was set at 20 kb. Why this threshold?

      Authors' answer to R2_C12: The size of defense islands in a previous report was approximately 40 kb, by setting up a 20 kb threshold we searched for defense systems in a region of 40 kb adjacent to each of the homologs (https://doi.org/10.1126/science.aar4120). If the specific homolog was part of a defense island we would expect that it is less than 20 kb apart from any defense system.

      __R2 Significance: __The paper in its current state appears to serve the role of a data repository rather than a thorough and original analysis. It requires extensive revisions before it can be of interest to experts in the toxin-antitoxin field.

      __ ____Reviewer #3 (R3): __

      R3 General statement: In the manuscript, "The Hok bacterial toxin: diversity, toxicity, distribution and genomic localization," by Escalera-Maurer et al., investigate the distribution of Hok type I toxin proteins across bacterial species. The Hok-Sok type I toxin-antitoxin system was first described on plasmids where it serves to maintain the plasmid in a population of bacterial cells: translation of the hok mRNA is prevented via the small antitoxin RNA Sok. Upon plasmid loss, with no new transcription of sok, the highly stable hok mRNA is translated into a small protein, killing the plasmid-less cell. Homologues to the system were identified in the chromosome of E. coli in the 1990s, and subsequent analyses have identified identical systems in other bacterial chromosomes, though they are close relatives to E. coli. Given the increased number of bacterial genomes sequenced, the group examined how widespread Hok may be across bacteria. They used a combination of BLASTn, MMseqs2 (protein) and Infernal (RNA) to identify, as best possible, all possible homologs. They then used sequence identity cut-offs to form Hok "clusters," and identified key features of the cluster as well as tested toxicity of overproduction of 31 homologs in a strain of Salmonella. Overall, though a variety of bioinformatic predictions and analyses, the manuscript identifies an expanded number of Hok members not previously identified and broaden the species it is found in, supported that Hok is not associate with defense systems, and provides additional support that horizontal transfer of hok genes is likely via plasmids (where hok is presumed to have originated).

      Major comments: There are some areas of the text that are a bit too definitive (these can be fixed or better explained in the text) and a few questions raised about the analyses and interpretations.

      Authors' answer to R3 Major Comment: As suggested by the reviewer, we rephrased parts of the manuscript.

      __These are the specific comments: __

      Introduction R3_C1: First paragraph: "Toxin production leads to the death of the cell encoding it" For many chromosomally encoded systems, toxicity has only been observed via artificial overexpression. This is an important point, as for many systems, a true biological function remains unknown. Further, add caveats regarding toxin function (for systems with validated function, they are involved in...). Again, there are still many questions for many t-at systems, in particular the Type I systems.

      __Authors' answer to R3_C1: __Indeed, the function of type 1 TA, in particular chromosomal ones, is still a matter of debate. While for hok/Sok R1, we previously showed death by expression at the chromosomal level, this was not shown for all TA (Le Rhun et al., NAR, 2023). We added that it could lead to the death or growth arrest of the cell instead and added the reviewer changes to for the function part.

      __R3_C2: __Introduction: type I's are more narrow in distribution, but much of this is due to their size and lack of biochemical domains. Again, please clarify more here.

      __Authors' answer to R3_C2: __We added the reviewer suggestion to the text.

      __R3_C3: __Introduction: while Hok's have been found on chromosomes, in E. coli strains, there is clear evidence that many are inactive. This comes up in the discussion, but it is worth including briefly in the introduction.

      Authors' answer to R3_C3: We have now added in the introduction that in the K12 laboratory strain, most chromosomal hok/Sok were found to be inactive.

      __R3_C4: __For the predicted transmembrane domain: it would be worth to include a box/indication as to where that is within the peptide (with the understanding it may not be exact). Is there more/less variation here? I'm assuming all clusters/family have a predicted TM domain?

      __Authors' answer to R3_C4: __When predicting the TM domain using DeepTMHMM - 1.0 prediction (https://services.healthtech.dtu.dk/services/DeepTMHMM-1.0/), 227 out of the 1264 unique Hok sequence are predicted to have a TM (transmembrane), 7 a SP (signal peptide) and a TM and 1025 have a SP. When predicting the TM of the consensus sequence (most abundant amino-acid) shown in Fig. 1D, region A8 to L25 is predicted to be inserted in the membrane, with the Nterm inside and Cterm outside.

      __R3_C5: __What is the cutoff for being a Hok? Did they take the "last hit" and use that in additional searches to see if more appeared? If that was done, and the search was exhaustive, this really important to add for the reader.

      Authors' answer to R3_C5: The MMseqs2 search was performed using 5 iterations as indicated in the M&M, meaning that the hits of the one search were used to search the database again five time in a raw. Importantly, an attempt to increase the number of iterations to 10 did not significantly increase the number of hits. Therefore, at least for the MMseqs2 search in the RefSeq database, we are close to being exhaustive.

      __R3_C6: __Figure S4: the authors state that there was no difference in the degree of toxicity between the clusters. There do appear to be some peptides tested that at the arabinose concentration used did not repress growth as immediately as others. If higher arabinose concentration is used, does that eliminate these differences? OR are many of these suppressors-if diluted back again, do they grow as if they are non-toxic in arabinose?

      Authors' answer to R3_C6: As suggested by Reviewer 1 (R1_C2), we performed titration of arabinose in a system overexpressing araE in a ΔaraBAD but were not able to find difference of toxicity in our conditions, see also our answer to R1_C2.

      __R3_C7: __Discussion: "because non-functional homologs are expected to quickly accumulate mutations..." is a bit problematic. Hok is highly regulated-as are some of the other well-described type I toxins. In MG1655, while the coding sequence may be intact, there are other mutations and/or insertion elements that prevent expression (and be extension, function. Given the lack of consensus data for type Is, it is best to provide more context for this. If the authors wish to argue that they should quickly accumulate mutations, it would be good to provide additional rates/evidence (even for other loci) from the Enterobacteriaceae.

      __Authors' answer to R3_C7: __We agree this statement might need to be supported further. We have removed this sentence to address this concern.

      __Minor comments: __

      __R3_C8: __For the sequences used in the search: please provide the sequence used in addition to the reference to the T1TAdb. Was the full-length hok mRNA, including mok, used? Please provide the nucleic acid sequence (and include description of whether full-length, etc.) in Materials and Methods or in Supplemental.

      __Authors' answer to R3_C8: __Sequences and code were deposited on https://gitub.u-bordeaux.fr/alerhun/Escalera-Maurer_2025. This files named curated_Hok.fasta and hok.fa, corresponding to Hok protein and mRNA sequences respectively are available in the file "T1TAdb input".

      __R3_C9: __60% identity was used for clustering. Did this become a problem-meaning separation of same property amino acid?

      __Authors' answer to R3_C9: __We checked amino acid signatures for each cluster (Fig S2), but could not find anything relevant.

      __R3_C10: __Fig. S2: for the clusters shown, please add in HokB, HokE, etc., to better correspond to Figure 1 in the main text.

      __Authors' answer to R3_C10: __The clusters were annotated according to the suggestion.

      __R3_C11: __Fig S1: this figure is challenging to orient-what are the numbers (8_10_85)?

      Authors' answer to R3_C11: The figure was generated using the CLANS tool, with each unique sequence retrieved by our analysis shown as a dot. Hok homologous sequences are in red and cluster together, the outlier clusters are annotated with the numbers corresponding to their 60% identity cluster. We understand that separating the number using an underscore could lead to confusion, therefore we have now separated the numbers using a coma.

      __R3_C12: __Please make a separate table or sheet for the experimentally tested peptides. Table S1 is quite large and a separate table/sheet would make this easier to find. If possible, please give the files names a more descriptive title (Table S1 in the name for example). This may be an issue with Review Commons but the individual file names were non-descript and the descriptions on the webpage did not indicate what the file contained.

      __Authors' answer to R3_C12: __We named the files Table S1 and File_S1 to S7. We added a table S2 with the experimentally tested peptides. Note that identical peptides can be sometime found in several bacterial loci.

      __R3_C13: __Figure S9: the black arrow for Hok is hard to see-it appears that the long grey bar going through multiple loci is indicative of Hok. Perhaps label this differently to make it easier on the reader (the line initially seemed to be a formatting issue and not indicative of the position of Hok.

      __Authors' answer to R3_C13: __We have now added a new label to indicate where is Hok, and clarified it in the figure legend.

      __R3_C14: __While the authors focused on Hok for this approach, which is fine and appropriate, can they comment at all about where mok is there in these new clusters/sub-families? Sok potential?

      __Authors' answer to R3_C14: __We added a paragraph about Mok in the discussion.

      __R3 Significance: __Overall the paper is a sound bioinformatic exercise and is improved with the testing of numerous "new" Hok proteins. Most of the comments can be done with some clarifications and maybe some additional analyses and/or verification which should take minimal time. The authors are over-emphatic at points as indicated and need to be more careful and precise with their language.

      In terms of advancement, it advances the distribution of these systems and adds to the depth of sub-classes. The audience will be more specialized to those who study these systems.

      Expertise: I have been studying type I toxin-antitoxin systems since the mid-2000s. We published a study examining (and mentioned well by this article!) the distribution in chromosomes of type I toxin-antitoxin systems, identified brand-new systems (that were chromosomally-limited at the time). My lab has continued to study regulation of type I toxins and distribution of chromosomally-only-encoded systems (so not Hok).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) The authors devote significant effort to characterizing the physical interaction between Bicc1 and Pkd2. However, the study does not examine or discuss how this interaction relates to Bicc1's well-established role in posttranscriptional regulation of Pkd2 mRNA stability and translation efficiency.

      The reviewer is correct that the present study has not addressed the downstream consequences of uthis interaction considering that Bicc1 is a posttranscriptional regulator of Pkd2 (and potentially Pkd1). We think that the complex of Bicc1/Pkd1/Pkd2 retains Bicc1 in the cytoplasm and thus restrict its activity in participating in posttranscriptional regulation (see Author response image 1). We, however, do not yet have data to support this and thus have not included this model in the manuscript. Yet, we have updated the discussion of the manuscript to further elaborate on the potential mechanism of the Bicc1/Pkd1/Pkd2 complex.

      We have updated the discussion to include a discussion on the potential consequences on posttranscriptional regulation by Bicc1.

      Author response image 1.

      Model of BICC1, PC1 and PC2 self-regulation. In this model Bicc1 acts as a positive regulator of PKD gene expression. In the presence of ‘sufficient’ amounts of PC1/PC2 complex, it is tethered to the complex and remains biologically inactive (Fig. 1A). However, once the levels of the PC1/PC2 complex are reduced, Bicc1 is now present in the cytoplasm to promote expression of the PKD proteins, thereby raising their levels (Fig. 4B), which then in turn will ‘shutdown’ Bicc1 activity by again tethering it to the plasma membrane.

      (2) Bicc1 inactivation appears to downregulate Pkd1 expression, yet it remains unclear whether Bicc1 regulates Pkd1 through direct interaction or by antagonizing miR-17, as observed in Pkd2 regulation. This should be further examined or discussed.

      This is a very interesting comment. Vishal Patel published that PKD1 is regulated by a mir-17 binding site in its 3’UTR (PMID: 35965273). We, however, have not evaluated whether BICC1 participates in this regulation. A definitive answer would require utilization of the mice described in above reference, which is beyond the scope of this manuscript. We, however, have revised the discussion to elaborate on this potential mechanism. 

      We have updated the discussion to include a statement on the potential direct regulation of Pkd1 mRNA by Bicc1.

      (3) The evidence supporting Bicc1 and ADPKD gene cooperativity, particularly with Pkd1, in mouse models is not entirely convincing, likely due to substantial variability and the aggressive nature of Bpk/Bpk mice. Increasing the number of animals or using a milder Bicc1 strain, such as jcpk heterozygotes, could help substantiate the genetic interaction.

      We have initially performed the analysis using our Bicc1 complete knockout, we previously reported on (PMID 20215348) focusing on compound heterozygotes. Yet, similar to the Pkd1/Pkd2 compound heterozygotes (PMID 12140187) no cyst development was observed when we sacrificed the mice as late as P21. Our strain is similar to the above mentioned jcpk, which is characterized by a short, abnormal transcript thought to result in a null allele (PMID: 12682776). We thank the reviewer for pointing us to the reference showing the heterozygous mice exhibit glomerular cysts in the adults (PMID: 7723240). This suggestion is an interesting idea we will investigate. In general, we agree with the reviewer that a better understanding of the contribution of Bicc1 to the adult PKD phenotype will be critical. To this end, we are currently generating a floxed allele of Bicc1 that will allow us to address the cooperativity in the adult kidney, when e.g. crossed to the Pkd1<sup>RC/RC</sup> mice. Yet, these experiments are beyond the timeframe for this revision. 

      No changes were made in the revised manuscript. 

      Reviewer #2 (Public review):

      (1) These results are potentially interesting, despite the limitation, also recognized by the authors, that BICC1 mutations seem exceedingly rare in PKD patients and may not "significantly contribute to the mutational load in ADPKD or ARPKD". The manuscript has several intrinsic limitations that must be addressed. 

      As mentioned above, the study was designed to explore whether there is an interaction between BICC1 and the PKD1/PKD2 and whether this interaction is functionally important. How this translates into the clinical relevance will require additional studies (and we have addressed this in the discussion of the manuscript).

      (2) The manuscript contains factual errors, imprecisions, and language ambiguities. This has the effect of making this reviewer wonder how thorough the research reported and analyses have been. 

      We respectfully disagree with the reviewer on the latter interpretation. The study was performed with rigor. We have carefully assessed the critiques raised by the reviewer. As presented below, most of the criticisms raised by the reviewer have been easily addressed in the revised version of the manuscript. Yet, none of the critiques seems to directly impact the overall interpretation of the data. 

      Reviewer #1 (Recommendations for the authors):

      (1) The manuscript requires further editing. For example, figure panels and legends are mismatched in Figure 1

      We have corrected the labeling of Figure 1. 

      (2) Y-axis units and values are inconsistent in Figures 4b-4g, Supplementary Figures S2e and S2f are not referenced in the text, genotypes are missing in Supplementary Figure S3f, and numerous typographical errors are present.

      In respect to the y-axis in Figure 4b-g, the scale is different for each of them, but that is intentional as one would lose the differences if they were all scaled identically. But we have now mentioned this in the figure legend to make the reader aware of it. In respect to the Supplemental Figure S2e,f, we included the panels in the description of the mutant BICC1 lines, but unfortunately forgot to reference them. This has now been done.

      We have updated the labeling of the Y-axis for the cystic indices adding “[%]” as the unit and updated the figure legend of Figure 4. We have included the genotypes in Supplementary Figure S3f. The Supplementary Figure S2e,f is now mentioned in the supplemental material (page 9, 2<sup>nd</sup> paragraph). 

      Reviewer #2 (Recommendations for the authors):

      (1) Previous data from mouse, Xenopus, and zebrafish suggest a crucial role for the RNAbinding protein Bicc1 in the pathogenesis of PKD, although BICC1 mutations in human PKD have not been previously reported." The cited sources (and others that were not cited) link Bicc1 mutations to renal cysts, similar to a report by Kraus (PMID: 21922595) that the authors cite later. However, a more direct link to PKD was reported by Lian and colleagues using whole Pkd1 mice (PMID: 20219263) and by Gamberi and colleagues using Pkd1 kidneys and human microarrays (PMID: 28406902). Although relevant, neither is cited here, and only the former is cited later in the manuscript.

      Thanks for pointing this out. We have added these three citations.

      We have added these three citations (PMID: 21922595, PMID: 20219263 and PMID: 28406902) in the indicated sentence.

      (2) In Figure 1B, the lanes do not seem to correspond among panels, particularly evident in the panel with myc-mBicc1. Hence, it is difficult to agree with the presented conclusions.

      We have corrected the labeling of the lanes in Figure 1b.

      (3) In the Figure 1 legend: "(g) Western blot analysis following co-IP experiments, using an anti-mouse Bicc1 or anti-goat PC2 antibody as bait, identified protein interactions between endogenous PC2 and BICC1 in UCL93 cells. Non-immune goat and mouse IgG were included as a negative control." There is no mention of panel H, although this reviewer can imagine what the authors meant. The capitalization differs in the figure and legend. More troublingly, in panel G, a non-defined star indicates a strong band present in both immune and non-immune control.

      We have corrected the figure legend of Figure 1 and clarified the non-specific band in the figure legend.

      (4) In Figure 4, the authors do not show the matched control for the Bicc1 Pkd1 interaction in panel d, nor do they show a scale bar in either a) or d). Thus, the phenotypic severity cannot be properly assessed.

      Thanks for pointing out the missing scale bars, which have now been added. In respect to the two kidneys shown in Figure 4d, the two kidneys shown are from littermates to illustrate the kidney size in agreement with the cumulative data shown in Figure 4e. Unfortunately, this litter did not have a wildtype control. As the data analysis in Figure 4e is based on littermates, mixing and matching kidneys of different litters does not seem appropriate. Thus, we have omitted showing a wildtype control in this panel. However, the size of the wildtype kidney can be seen in Figure 4a.

      We have added the scale bar to both panels and have updated the figure legend to emphasize that the kidneys shown are from littermates and that no wildtype littermate was present in this litter.

      (5) "Surprisingly, an 8-fold stronger interaction was observed between full-length PC1 and myc-mBicc1-ΔKH compared to mycmBicc1 or myc-mBicc1-ΔSAM." Assuming all the controls for protein folding and expression levels have been carried out and not shown/mentioned, this sentence seems to contradict the previous statement that Bicc1deltaSAM reduced the interaction with PC1 by 55%. Because the full length and SAM deletion have different interaction strengths, the latter sentence makes no sense.

      The reduction in the levels of myc-mBicc1-ΔSAM compared to wildtype mycmBicc1 in respect to PC1 binding was not significant. We have clarified this in the text.

      We have corrected the sentence and modified the Figure accordingly. 

      (6) Imprecise statements make a reader wonder how to interpret the data: "More than three independent experiments were analyzed." Stating the sample size or including it in the figure would save space and improve confidence in the data presented.

      We have stated the exact number of animals per conditions above each of the bars.

      (7) "Next, we performed a similar mouse study for Pkd1 by reducing the gene dose of Pkd1 postnatally in the collecting ducts using a Pkhd1-Cre as previously described40" What did the authors mean?

      The reference was included to cite the mouse strain, but realized that it can be mis-interpreted that the exact experiments has been performed previously. We have clarified this in the text.

      We have reworded the sentence to avoid misinterpretation. 

      (8) The authors examined the additive effects of knocking down Bicc1, Pkd1, and Pkd2 with morpholinos in Xenopus and, genetically, in mice. While the Bicc1[+/-] Pkd1 or 2[+/-] double heterozygote mice did not show phenotypes, the authors report that the Bicc1[-/-] Pkd1 or 2 [+/-] did instead show enlarged kidneys. What is the phenotype of a Bicc1[+/-] Pkd1 or 2 [-/-]? What we learn from the author's findings among the PKD population suggests that the latter situation would be potentially translationally relevant.

      The mouse experiments were designed to address a cooperativity between Bicc1 and either Pkd1 or Pkd2 and whether removal of one copy of Pkd1 or Pkd2 would further worsen the Bicc1 cystic kidney phenotype. Thus, the parental crosses were chosen to maximize the number of animals obtained for these genotypes. Unfortunately, these crosses did not yield the genotypes requested by the reviewer. To address the contribution of Bicc1 towards the PKD population, we will need to perform a different cross, where we eliminate Pkd1 or Pkd2 in a floxed background of Bicc1 postnatally in adult mice. While we are gearing up to perform such an experiment, this is timewise beyond the scope of the manuscript. In addition, please note that we have addressed the question about the translation towards the PKD population already in the discussion of the original submission (page 13/14, last/first paragraph).

      No changes have been made to the revised version of the manuscript.

      (9) How do the authors interpret the milder effects of the Bicc1[-/-] Pkd1[+/-] compared to Bicc1[-/-] Pkd2[+/-] relative to the respective protein-protein interactions?

      The milder effects are due to the nature of the crosses. While the Pkd2 mutant is a germline mutation, the Pkd1 mutant is a conditional allele eliminating Pkd1 only in the collecting ducts of the kidney. As such, we spare other nephron segments such as the proximal tubules, which also significantly contribute to the cyst load. As such these mouse data support the interaction between Pkd1 and Pkd2 with Bicc1, but do not allow us to directly compare the outcomes. While this was mentioned in the previous version of the manuscript, we have expanded on this in the revised version of the manuscript.

      We have expanded the results section in the revised version of the manuscript highlighting that the two different approaches cannot be directly compared.

      (10) How do the authors interpret that the strong Bicc1[Bpk] Pkd1 or Pkd2 double heterozygote mice did not have defects and "kidneys from Bicc1+/-:Pkd2+/- did not exhibit cysts (data not shown)", when the VEO PKD patients and - although not a genetic reduction - also the morpholino-treated Xenopus did?

      VEO PKD patients are characterized by a loss of function of PKD1 or PKD2 and – as we propose in this manuscript - that BICC1 further aggravates the phenotype. Yet, we do not address either in the mouse or Xenopus experiments whether BICC1 is a genetic modifier. We are simply addressing whether the two genes show a genetic interaction. In the mouse studies, we eliminate one copy of Pkd1 or Pkd2 in the background of a hypomorphic allele of Bicc1. Similarly, in the Xenopus experiments, we employ suboptimal doses of the morpholino oligomers, i.e., concentrations that did not yield a phenotypic change and then asked whether removing both together show cooperativity. It is important to state that this is based on a biological readout and not defined based on the amount of protein. While we have described this already in the original manuscript (page 7, first paragraph), we have amended our description of the Xenopus experiment to make this even clearer. 

      Finally, we agree with the reviewer that if we were to address whether Bicc1 is a modifier of the PKD phenotype in mouse, we would need to reduce Bicc1 function in a Pkd1 or Pkd2 mutants. Yet, we have recognized this already in the initial version of the manuscript in the discussion (page 14, first paragraph).

      We have expanded the results section when discussing the suboptimal amounts of the morpholino oligos (Page 6, 1<sup>st</sup> paragraph).

      (11) Unclear: "While variants in BICC1 are very rare, we could identify two patients with BICC1 variants harboring an additional PKD2 or PKD1 variant in trans, respectively." Shortly after, the authors state in apparent contradiction that "the patients had no other variants in any of other PKD genes or genes which phenocopy PKD including PKD1, PKD2, PKHD1, HNF1s, GANAB, IFT140, DZIP1L, CYS1, DNAJB11, ALG5, ALG8, ALG9, LRP5, NEK8, OFD1, or PMM2."

      The reviewer is correct. This should have been phrased differently. We have now added “Besides the variants reported below” to clarify this more adequately.

      The sentence was changed to start with “Besides the variants reported below, […].”

      (12) "The demonstrated interaction of BICC1, PC1, and PC2 now provides a molecular mechanism that can explain some of the phenotypic variability in these families." How do the authors reconcile this statement with their reported ultra-rare occurrence of the BICC1 mutations?

      As mentioned in the manuscript and also in response to the other two reviewers, Bicc1 has been shown to regulate Pkd2 gene expression in mice and frogs via an interaction with the miR-17 family of microRNAs. Moreover, the miR-17 family has been demonstrated to be critical in PKD (PMID: 30760828, PMID: 35965273, PMID: 31515477, PMID: 30760828). In fact, both other reviewers have pointed out that we should stress this more since Bicc1 is part of this regulatory pathway. Future experiments are needed to address whether Bicc1 contributes to the variability in ADPKD onset/severity. Yet, this is beyond the scope of this study. 

      Based on the comments of the two other reviewers we have further addressed the Bicc1/miR-17 interaction.

      (13) The manuscript should use correct genetic conventions of italicization and capitalization. This is an issue affecting the entire manuscript. Some exemplary instances are listed below.

      (a) "We also demonstrate that Pkd1 and Pkd2 modifies the cystic phenotype in Bicc1 mice in a dose-dependent manner and that Bicc1 functionally interacts with Pkd1, Pkd2 and Pkhd1 in the pronephros of Xenopus embryos." Genes? Proteins?

      The data presented in this section show that a hypomorphic allele of Bicc1 in mouse and a knockdown in Xenopus yields this. As both affect the proteins, the spelling should reflect the proteins.

      No changes have been made in the revised manuscript.

      (b) The sentence seems to use both the human and mouse genetic capitalization, although it refers to experiments in the mouse system “to define the Bicc1 interacting domains for PC2 (Fig. 2d,e). Full-length PC2 (PC2-HA) interacted with full-length myc-mBICC1.”

      We agree with the review that stating the species of the molecules used is critical, we have adapted a spelling of Bicc1, where BICC1 is the human homologue, mBicc1 is the mouse homologue and xBicc1 the Xenopus one.

      We have highlighted the species spelling in the methods section and labeled the species accordingly throughout the manuscript and figures. 

      (14) “Together these data supported our biochemical interaction data and demonstrated that BICC1 cooperated with PKD1 and PKD2.” Are the authors implying that these results in mice will translate to the human protein?

      We agree that we have not formally shown that the same applies to the human proteins. Thus, we have changed the spelling accordingly.

      We have revised the capitalization of the proteins. 

      (15) The text is often unclear, terse, or inconsistent.

      (a) “These results suggested that the interaction between PC1 and Bicc1 involves the SAM but not the KH/KHL domains (or the first 132 amino acids of Bicc1). It also suggests that the N-terminus could have an inhibitory effect on PC1-BICC1 association.” How do the authors define the N-terminus? The first 132 aa? KH/KHL domains?

      This was illustrated in the original Figure 2A. The DKH constructs lack the first 351 amino acids. 

      To make this more evident, we have specified this in the text as well.

      (b) Similarly, the authors state below, "Unlike PC1, PC2 interacted with mycmBICC1ΔSAM, but not myc-mBICC1-ΔKH suggesting that PC2 binding is dependent on the N-terminal domains but not the SAM domain." It is unclear if the authors refer to the KH/KHL domains or others. Whatever the reference to the N-terminal region, it should also be consistent with the section above.

      This is now specified in the text.

      (c) Unclear: "We have previously demonstrated that Pkd2 levels are reduced in a complete Bicc1 null mice,22 performing qRT-PCR of P4 kidneys (i.e. before the onset of a strong cystic phenotype), revealed that Bicc1, Pkd1 and Pkd2 were statistically significantly down9 regulated (Fig. 4h-j)".

      We have changed the text to clarify this. 

      (d) “Utilizing recombinant GST domains of PC1 and PC2, we demonstrated that BICC1 binds to both proteins in GST-pulldown assays (Fig. 1a, b)." GST-tagged domains? Fusions?

      We have changed the text to clarify this. 

      (e) "To study the interaction between BICC1, PKD1 and PKD2 we combined biochemical approaches, knockout studies in mice and Xenopus, genetic engineered human kidney cells" > genetically engineered.

      We have changed the text to clarify this.

      (f) Capitalization (e.g., see Figure S3, ref. the Bpk allele) and annotation (e.g., Gly821Glu and G821E) are inconsistent.

      We have homogenized the labeling of the capitalization and annotations throughout the manuscript. 

      (g) What do the authors mean by "homozygous evolutionarily well-conserved missense variant"?

      We have changed this is the revised version of the manuscript. 

      Reviewer #3 (Public review/Recommendations to the authors):

      (1) A further study in HUREC cells investigating the critical regulatory role of BICC1 and potential interaction with mir-17 may yet lead to a modifiable therapeutic target.

      (2) This study should ideally include experiments in HUREC material obtained from patients/families with BICC1 mutations and studying its effects on the PKD1/2 complex in primary cell lines.

      This is an excellent suggestion. We agree with the reviewer that it would have been interesting to analyze HUREC material from the affected patients. Unfortunately, besides DNA and the phenotypic analysis described in the manuscript neither human tissue nor primary patient-derived cells collected once the two patients with the BICC1 p.Ser240Pro variant passed away.

      No changes to the revised manuscript have been made to address this point.

      (3) Please remove repeated words in the following sentence in paragraph 2 of the introduction: "BICC1 encodes an evolutionarily conserved protein that is characterized by 3 K-homology (KH) and 2 KH-like (KHL) RNA-binding domains at the N-terminus and a SAM domain at the C-terminus, which are separated by a by a disordered intervening sequence (IVS).23-28".

      This has been changed.

    1. Author response:

      Reviewer #1 (Public review):

      The authors analysed large-scale brain-state dynamics while humans watched a short video. They sought to identify the role of thalamocortical interactions.

      Major concerns

      (1) Rationale for using the naturalistic stimulus

      In terms of brain state dynamics, previous studies have already reported large-scale neural dynamics by applying some data-driven analyses, like energy landscape analysis and Hidden Markov Model, to human fMRI/EEG data recorded during resting/task states. Considering such prior work, it'd be critical to provide sufficient biological rationales to perform a conceptually similar study in a naturalistic condition, i.e., not just "because no previous work has been done". The authors would have to clarify what type of neural mechanisms could be missed in conventional resting-state studies using, say, energy landscape analysis, but could be revealed in the naturalistic condition.

      We appreciate your insightful comments regarding the need for a biological rationale in our study. As you mentioned, there are similar studies, just like Meer et al. utilized Hidden Markov Models to identify various activation modes of brain networks that included subcortical regions[1], Song et al. linked brain states to narrative understandings and attentional dynamics[2, 3]. These studies could answer why we use naturalistic stimuli datasets. Moreover, there is evidence suggesting that the thalamus plays a crucial role in processing information in a more naturalistic context while pointing out the vital role in thalamocortical communications[4, 5]. So, we tended to bridge thalamic activity and cortical state transition using the energy landscape description.

      To address these gaps in conventional resting-state studies, we explored an alternative method—maximum entropy modeling based on the energy landscape. This allowed us to validate how the thalamus responds to cortical state transitions. To enhance clarity, we will update our introduction to emphasize the motivations behind our research and the significance of examining these neural mechanisms in a naturalistic setting.

      (2) Effects of the uniqueness of the visual stimulus and reproducibility

      One of the main drawbacks of the naturalistic condition is the unexpected effects of the stimuli. That is, this study looked into the data recorded from participants who were watching Sherlock, but what would happen to the results if we analyzed the brain activity data obtained from individuals who were watching different movies? To ensure the generalizability of the current findings, it would be necessary to demonstrate qualitative reproducibility of the current observations by analysing different datasets that employed different movie stimuli. In fact, it'd be possible to find such open datasets, like www.nature.com/articles/s41597-023-02458-8.

      We appreciate your concern regarding the reproducibility of our findings. The dataset from the "Sherlock" study is of high quality and has shown good generalizability in various research contexts. We acknowledge the importance of validating our results with different datasets to enhance the robustness of our conclusions. While we are open to exploring additional datasets, we intend to pursue this validation once we identify a suitable alternative. Currently, we are considering a comparison with the dataset from "Forrest Gump" as part of our initial plan.

      (3) Spatial accuracy of the "Thalamic circuit" definition

      One of the main claims of this study heavily relies on the accuracy of the localization of two different thalamic architectures: matrix and core. Given the conventional or relatively low spatial resolution of the fMRI data acquisition (3x3x3 mm^3), it appears to be critically essential to demonstrate that the current analysis accurately distinguished fMRI signals between the matrix and core parts of the thalamus for each individual.

      We acknowledge the importance of accurately localizing the different thalamic architectures, specifically the matrix and core regions. To address this, we downsampled the atlas of matrix and core cell populations from the previous study from a resolution of 2x2x2 mm<sup>3</sup> to 3x3x3 mm<sup>3</sup>, which aligns with our fMRI data acquisition. We would report the atlas as Supplementary Figures in our revision.

      (4) More detailed analysis of the thalamic circuits

      In addition, if such thalamic localisation is accurate enough, it would be greatly appreciated if the authors perform similar comparisons not only between the matrix and core architectures but also between different nuclei. For example, anterior, medial, and lateral groups (e.g., pulvinar group). Such an investigation would meet the expectations of readers who presume some microscopic circuit-level findings.

      We appreciate your suggestion regarding a more detailed analysis of thalamic circuits. We have touched upon this in the discussion section as a forward-looking consideration. However, we believe that performing nuclei segmentation with 3T fMRI may not be ideal due to well-documented concerns regarding signal-to-noise ratio and spatial resolution. That said, we are interested in exploring these nuclei-pathway connections to cortical areas in future studies with a proper 7T fMRI naturalistic dataset.

      (5) Rationale for different time window lengths

      The authors adopted two different time window lengths to examine the neural dynamics. First, they used a 21-TR window for signal normalisation. Then, they narrowed down the window length to 13-TR periods for the following statistical evaluation. Such a seemingly arbitrary choice of the shorter time window might be misunderstood as a measure to relax the threshold for the correction of multiple comparisons. Therefore, it'd be appreciated if the authors stuck to the original 21-TR time window and performed statistical evaluations based on the setting.

      Thank you for your valuable feedback regarding the choice of time window lengths. We aimed to maintain consistency in window lengths across our analyses. In light of your comments and suggestions from other reviewers, we plan to test our results using different time window lengths and report findings that generalize across these variations. Should the results differ significantly, we will discuss the implications of this variability in our revised manuscript.

      (6) Temporal resolution

      After identifying brain states with energy landscape analysis, this study investigated the brain state transitions by directly looking into the fMRI signal changes. This manner seems to implicitly assume that no significant state changes happen in one TR (=1.5sec), which needs sufficient validation. Otherwise, like previous studies, it'd be highly recommended to conduct different analyses (e.g., random-walk simulation) to address and circumvent this problem.

      Thank you for raising this important point regarding temporal resolution. Many fMRI studies, such as those examining event boundaries during movie watching, operate under similar assumptions concerning state changes within one TR. For example, Barnett et al. processed the dynamic functional connectivity (dFC) with a window of 20 TRs (24.4s). So, we do not think it is a limitation but is a common question related to fMRI scanning parameters. To strengthen our analysis of state transitions and ensure they are not merely coincidental, we plan to conduct random-walk simulations, as suggested, to validate our findings in accordance with methodologies used in previous research.

      Reviewer #2 (Public review):

      Summary:

      In this study, Liu et al. investigated cortical network dynamics during movie watching using an energy landscape analysis based on a maximum entropy model. They identified perception- and attention-oriented states as the dominant cortical states during movie watching and found that transitions between these states were associated with inter-subject synchronization of regional brain activity. They also showed that distinct thalamic compartments modulated distinct state transitions. They concluded that cortico-thalamo-cortical circuits are key regulators of cortical network dynamics.

      Strengths:

      A mechanistic understanding of cortical network dynamics is an important topic in both experimental and computational neuroscience, and this study represents a step forward in this direction by identifying key cortico-thalamo-cortical circuits. The analytical strategy employed in this study, particularly the LASSO-based analysis, is interesting and would be applicable to other data types, such as task- and resting-state fMRI.

      We thanks for this comment and encouragement.

      Weaknesses:

      Due to issues related to data preprocessing, support for the conclusions remains incomplete. I also believe that a more careful interpretation of the "energy" derived from the maximum entropy model would greatly clarify what the analysis actually revealed.

      Thank you for your valuable suggestions, and we apologize for any misunderstandings regarding the interpretation of the energy landscape in our study. To address this issue, we will include a dedicated paragraph in both the methods and results sections to clarify our use of the term "energy" derived from the maximum entropy model. This addition aims to eliminate any ambiguity and provide a clearer understanding of what our analysis reveals.

      (1) I think the method used for binarization of BOLD activity is problematic in multiple ways.

      a) Although the authors appear to avoid using global signal regression (page 4, lines 114-118), the proposed method effectively removes the global signal. According to the description on page 4, lines 117-122, the authors binarized network-wise ROI signals by comparing them with the cross-network BOLD signal (i.e., the global signal): at each time point, network-wise ROI signals above the cross-network signal were set to 1, and the rest were set to −1. If I understand the binarization procedure correctly, this approach forces the cross-network signal to be zero (up to some noise introduced by the binarization of network-wise signals), which is essentially equivalent to removing the global signal. Please clarify what the authors meant by stating that "this approach maintained a diverse range of binarized cortical states in data where the global signal was preserved" (page 4, lines 121-122).

      Thank you for highlighting the potential issue with our binarization method. We appreciate your insights regarding the comparison of network-wise ROI signals with the cross-network BOLD signal, as this may inadvertently remove the global signal. To address this, we will conduct a comparative analysis of results obtained from both our current approach and the original pipeline. If we decide to retain our current method, we will carefully reconsider the rationale and rephrase our descriptions to ensure clarity regarding the preservation of the global signal and the diversity of binarized cortical states.

      b) The authors might argue that they maintained a diverse range of cortical states by performing the binarization at each time point (rather than within each network). However, I believe this introduces another problem, because binarizing network-wise signals at each time point distorts the distribution of cortical states. For example, because the cross-network signal is effectively set to zero, the network cannot take certain states, such as all +1 or all −1. Similarly, this binarization biases the system toward states with similar numbers of +1s and −1s, rather than toward unbalanced states such as (+1, −1, −1, −1, −1, −1). These constraints and biases are not biological in origin but are simply artifacts of the binarization procedure. Importantly, the energy landscape and its derivatives (e.g., hard/easy transitions) are likely to be affected by these artifacts. I suggest that the authors try a more conventional binarization procedure (i.e., binarization within each network), which is more robust to such artifacts.

      Related to this point, I have a question regarding Figure S1, in which the authors plotted predicted versus empirical state probabilities. As argued above, some empirical state probabilities should be zero because of the binarization procedure. However, in Figure S1, I do not see data points corresponding to these states (i.e., there should be points on the y-axis). Did the authors plot only a subset of states in Figure S1? I believe that all states should be included. The correlation coefficient between empirical and predicted probabilities (and the accuracy) should also be calculated using all states.

      Thank you for your thoughtful examination of our data processing pipeline. We agree that a comparison between the conventional binarization method and our current approach is warranted, and we appreciate your suggestion. Upon reviewing Figure S1, we discovered that there was indeed an error related to the plotting style set to "log10." As you correctly pointed out, the data should reflect that the probabilities for states where all networks are either activated or deactivated are zero. We are very interested in exploring the state distributions obtained from both the original and current approaches, as your comments highlight important considerations. We sincerely appreciate your insightful feedback and will make sure to address these points thoroughly in our first revision.

      c) The current binarization procedure likely inflates non-neuronal noise and obscures the relationship between the true BOLD signal and its binarized representation. For example, consider two ROIs (A and B): both (+2%, +1%) and (+0.01%, −0.01%) in BOLD signal changes would be mapped to (+1, −1) after binarization. This suggests that qualitatively different signal magnitudes are treated identically. I believe that this issue could be alleviated if the authors were to binarize the signal within each network, rather than at each time point.

      Thank you for your important observation regarding the potential inflation of non-neuronal noise in our current binarization procedure. We recognize that this process could lead to qualitatively different signal magnitudes being treated similarly after binarization, as you illustrated with your example. While we acknowledge your point, we believe that conventional binarization pipelines may also encounter this issue, albeit by comparing signals to a network's temporal mean activity. To address this concern and maintain consistency with previous studies, we will discuss this limitation in our revised manuscript. Additionally, if deemed necessary, we will explore implementing a percentile-based threshold above the baseline to further refine our binarization approach. Your suggestion provides a valuable perspective, and we appreciate your insights.

      (2) As the authors state (page 5, lines 145-148), the "energy" described in the energy landscape is not biological energy but rather a statistical transformation of probability distributions derived from the Boltzmann distribution. If this is the case, I believe that Figure 2A is potentially misleading and should be removed. This type of schematic may give the false impression that cortical state dynamics are governed by the energy landscape derived from the maximum entropy model (which is not validated).

      Thank you for your valuable feedback regarding Figure 2A. We apologize for any confusion it may have created. While we recognize that similar figures are commonly used in literature involving energy landscapes (maximum entropy model), we agree that Figure 2A may mislead readers into thinking that cortical state dynamics are directly governed by the energy landscape derived from the maximum entropy model, which has not been validated. In light of your comments, we will remove Figure 2A and instead emphasize the analytical strategy presented in Figure 2B. Additionally, we will provide a simplified line graph as an illustrative example to clarify the concepts without the potential for misinterpretation.

      Reviewer #3 (Public review):

      Summary:

      In this study, Liu et al. analyze fMRI data collected during movie watching, applied an energy landscape method with pairwise maximum entropy models. They identify a set of brain states defined at the level of canonical functional networks and quantify how the brain transitions between these states. Transitions are classified as "easy" or "hard" based on changes in the inferred energy landscape, and the authors relate transition probabilities to inter-subject correlation. A major emphasis of the work is the role of the thalamus, which shows transition-linked activity changes and dynamic connectivity patterns, including differential involvement of parvalbumin- and calbindin-associated thalamic subdivisions.

      Strengths:

      The study is methodologically complex and technically sophisticated. It integrates advanced analytical methods into high-dimensional fMRI data. The application of energy landscape analysis to movie-watching data appears to be novel as well. The finding on the thalamus involved energy state transition and provides a strong linkage to several theories on thalamic control functions, which is a notable strength.

      Thanks for your comments on the novelty of our study.

      Weaknesses:

      The main weakness is the conceptual clarity and advances that this otherwise sophisticated set of analyses affords. A central conceptual ambiguity concerns the energy landscape framework itself. The authors note that the "energy" in this model is not biological energy but a statistical quantity derived from the Boltzmann distribution. After multiple reads, I still have major trouble mapping this measure onto any biological and cognitive operations. BOLD signal is a measure of oxygenation as a proxy of neural activity, and correlated BOLD (functional connectivity) is thought to measure the architecture of information communication of brain systems. The energy framework described in the current format is very difficult for most readers to map onto any neural or cognitive knowledge base on the structure and function of brain systems. Readers unfamiliar with maximum entropy models may easily misinterpret energy changes as reflecting metabolic cost, neural effort, or physiological variables, and it is just very unclear what that measure is supposed to reflect. The manuscript does not clearly articulate what conceptual and mechanistic advances the energy formalism provides beyond a mathematical and statistical report. In other words, beyond mathematical description, it is very hard for most readers to understand the process and function of what this framework is supposed to tell us in regards to functional connectivity, brain systems, and cognition. The brain is not a mathematical object; it is a biological organ with cognitive functions. The impact of this paper is severely limited until connections can be made.

      Thank you for your insightful and constructive comments regarding the conceptual clarity of our energy landscape framework. We appreciate your perspective on the challenges of mapping the statistical measure of "energy" derived from the Boltzmann distribution onto biological and cognitive operations. To address these concerns, we will revise our manuscript to clarify our expressions surrounding "energy" and emphasize its probabilistic nature. Additionally, we will incorporate a series of analyses that explicitly relate the features of the energy landscape to cognitive processes and key parameters, such as brain integration and functional connectivity. We believe these changes will help bridge the gap between our mathematical framework and its relevance to understanding brain systems and cognitive functions.

      Relatedly, the use of metaphors such as "valleys," "hills," and "routes" in multidimensional measures lacks grounding. Valleys and hills of what is not intuitive to understand. Based on my reading, these features correspond to local minima and barriers in a probability distribution over binarized network activation patterns, but similar to the first point, the manuscript does not clearly explain what it means conceptually, neurobiologically, or computationally for the brain to "move" through such a landscape. The brain is not computing these probabilities; they are measurement tools of "something". What is it? To advance beyond mathematical description, these measurements must be mapped onto neurobiological and cognitive information.

      Thank you for your valuable feedback. In our revisions, we would aim to link the concept of rapid transition routes in the energy landscape to cognitive processes, such as narrative understanding and related features. By exploring these connections, we hope to provide a clearer context for how our framework can enhance understanding of cognitive functions and their neural correlates.

      This conceptual ambiguity goes back to the Introduction. At the level of motivation, the purpose and deliverables of the study are not defined in the Introduction. The stated goal is "Transitions between distinct cortical brain states modulate the degree of shared neural processing under naturalistic conditions". I do not know if readers will have a clear answer to this question at the end. Is the claim that state transitions cause changes in inter-subject correlation, that they index moments of narrative alignment, or that they reflect changes in attentional or cognitive mode? This level of explanation is largely dissociated from the methods in their current form.

      Thank you for highlighting this important point regarding the conceptual clarity in our Introduction. We appreciate your feedback about the motivation and objectives of the study. To clarify the stated goal of investigating how transitions between distinct cortical brain states modulate shared neural processing under naturalistic conditions, we will revise the manuscript to explicitly define the specific claims we aim to address. We will ensure that these explanations are closely tied to the methods employed in our study, providing a clearer framework for our readers.

      Several methodological choices can use clarification. The use of a 21-TR window centered on transition offsets is unusually long relative to the temporal scale of fMRI dynamics and to the hypothesized rapidity of state transitions. On a related note, what is the temporal scale of state transition? Is it faster than 21 TRs?

      Thank you for your insightful questions regarding our methodological choices. Our focus on specific state transitions necessitated the use of a 21-TR window. While it’s true that other transitions may occur within this window, averaging across the same transitions at different times allows us to identify distinctive thalamic BOLD patterns that precede cortical state transitions. This methodology enables us to capture relevant dynamics while ensuring that we focus on the transitions of interest. We appreciate your feedback, and this clarification will be included in our revised manuscript. We would also add a figure that describe the dwell time of cortical states.

      The choice of movie-watching data is a strength. But, many of the analyses performed here, energy landscape estimation, clustering of states, could in principle be applied to resting-state data. The manuscript does not clearly articulate what is gained, mechanistically or cognitively, by using movie stimuli beyond the availability of inter-subject correlation.

      Thank you for your question, which closely aligns with a concern raised by Reviewer #1. Our core hypothesis posits that naturalistic stimuli yield a broader set of brain states compared to those observed during resting-state conditions. To support this assertion, we will clearly articulate the findings from previous studies that relate to this hypothesis. Additionally, if appropriate, we will provide a comparative analysis between our data and resting-state data to highlight the differences and emphasize the uniqueness of the brain states elicited by naturalistic stimuli.

      Because of the above issues, a broader concern throughout the results is the largely descriptive nature of the findings. For example, the LASSO analysis shows that certain state transitions predict ISC in a subset of regions, with respectable R² values. While statistically robust, the manuscript provides little beyond why these particular transitions should matter, what computations they might reflect, or how they relate to known cognitive operations during movie watching. Similar issues arise in the clustering analyses. Clustering high-dimensional fMRI-derived features will almost inevitably produce structure, whether during rest, task, or naturalistic viewing. What is missing is an explanation of why these specific clusters are meaningful in functional or mechanistic terms.

      Thank you for your questions. In our revisions, we will perform additional analyses aimed at linking state transitions to cognitive processes more explicitly. Regarding clustering, we will provide a thorough discussion in the revised manuscript.

      Finally, the treatment of the thalamus, while very exciting, could use a bit more anatomical and circuit-level specificity. The manuscript largely treats the thalamus as a unitary structure, despite decades of work demonstrating big functional and connectivity differences across thalamic nuclei. A whole-thalamus analysis without more detailed resolution is increasingly difficult to justify. The subsequent subdivision into PVALB- and CALB-associated regions partially addresses this, but these markers span multiple nuclei with overlapping projection patterns.

      This suggestion aligns with the feedback from Reviewer #1. We believe that performing nuclei segmentation with 3T fMRI may not be ideal due to well-documented concerns regarding signal-to-noise ratio and spatial resolution. Therefore, investigating core and matrix cell projections across different thalamic nuclei using 7T fMRI presents a promising avenue for further study.

      (1) Van Der Meer J N, Breakspear M, Chang L J, et al. Movie viewing elicits rich and reliable brain state dynamics [J]. Nature Communications, 2020, 11(1): 5004.

      (2) Song H, Park B Y, Park H, et al. Cognitive and Neural State Dynamics of Narrative Comprehension [J]. Journal of Neuroscience, 2021, 41(43): 8972-8990.

      (3) Song H, Shim W M, Rosenberg M D. Large-scale neural dynamics in a shared low-dimensional state space reflect cognitive and attentional dynamics [J]. Elife, 2023, 12.

      (4) Shine J M, Lewis L D, Garrett D D, et al. The impact of the human thalamus on brain-wide information processing [J]. Nature Reviews Neuroscience, 2023, 24(7): 416-430.

      (5) Yang M Y, Keller D, Dobolyi A, et al. The lateral thalamus: a bridge between multisensory processing and naturalistic behaviors [J]. Trends in Neurosciences, 2025, 48(1): 33-46.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1(Public review):

      In this study, Acosta-Bayona et al. aim to better understand how environmental conditions could have influenced specific gene functions that may have been selected for during the domestication of teosinte parviglumis into domesticated maize. The authors are particularly interested in identifying the initial phenotypic changes that led to the original divergence of these two subspecies. They selected heavy metal (HM) stress as the condition to investigate. While the justification for this choice remains speculative, paleoenvironmental data would add value; the authors hypothesize that volcanic activity near the region of origin could have played a role.

      The justification of choice to investigate the effects of heavy metal stress is not speculative. As mentioned now in the Abstract, the elucidation of the genome from the Palomero toluqueño maize landrace revealed heavy metal effects during domestication (Vielle-Calzada et al., Science 2009). Our aim was to test the hypothesis that heavy metal (HM) stress influenced the evolutionary transition of teosinte parviglumis to maize.

      (1) Although the paper presents some interesting findings, it is difficult to distinguish which observations are novel versus already known in the literature regarding maize HM stress responses. The rationale behind focusing on specific loci is often lacking. For example, a statistically significant region identified via LOD score on chromosome 5 contains over 50 genes, yet the authors focus on three known HM-related genes without discussing others in the region. It is unclear why ZmHMA1 was selected for mutagenesis over ZmHMA7 or ZmSKUs5.

      We appreciated the depth and value of this comment.

      Maize phenotypic responses to sublethal concentrations to heavy metals – copper (Cu) and cadmium (Cd) in particular - are well characterized and published, and in agreement with our results. In the first section of the Results (pgs 7 and 8), we added pertinent references to clearly show which observations are already known. By contrast, teosinte parviglumis responses are in all cases novel. To our knowledge this is the first study that analyzed in detail the phenotypic response of teosinte to sublethal concentrations of heavy metals, specifically Cu and Cd. We have now emphasized the novelty of these observations (pg 8).

      To address the fact that we only focused on three known HM-related genes without discussing others in the statistically significant region identified via LOD score on chr.5, we have added a full section that reads as follows (pgs. 11 to 13 of the new version):

      “Large-scale genomic and transcriptomic comparisons indicate that many HM response genes were positively selected across the maize genome.

      To expand the results well beyond the analysis of the three genes previously described, we performed a detailed analysis of genetic diversity across the 11.47 Mb genomic region comprised between Z_mSKUs5_ and ZmHMA1. This additional analysis reveals general tendencies in the quantity and nature of loci that were affected by positive selection during the teosinte parviglumis to maize transition in a region identified via LOD score on chr.5. We compared nucleotide variability by using 100 bp bins covering loci composed of two 30 Kb segments up and downstream of coding sequences, respectively, and the coding sequence itself, for 173 genes present within the genomic region comprised between ZmSKUs5 and ZmHMA (Figure S1 and Supplementary File 6). Two types of statistical tests (ANOVA and Wilcoxon) were applied to nucleotide variability comparisons using the entirety of each locus. The Benjamini-Hochber procedure allowed an estimation of the false discovery rate (FDR<0.05) to avoid type I errors (false positives). Although some individual loci appear as differently classified depending on the statistical test applied (22 out of 173 loci), the general differences in nucleotide variability are consistently maintained within the subregions described below. We found that 166 out of 173 loci show signatures of positive selection and are roughly organized in five independent subregions of variable length. The first six loci are consecutively ordered in a 402 Kb subregion that includes ZmSKUs5. A second group of 13 consecutive loci expands over a 1.44 Mb subregion that contains NRAMP ALUMINUM TRANSPORTER1, also involved in HM response through uptake of divalent ions. A third group of 17 consecutive loci expands over 1.28 Mb; eleven contain genes encoding for uncharacterized proteins. The fourth group is composed of 57 consecutive loci expanding over 3.22 Mb and contains genes encoding for DEFECTIVE KERNEL55, AUXIN RESPONSE FACTOR16, and peroxydases involved in responses to oxydative stress. The fifth group contains 12 consecutive loci expanding over 713 Kb and contains ZmHMA1. An additional segment of approximately 1.17 Mb and containing 25 consecutive loci that were positively selected expands away from the ZmSKUs5-ZmHMA1 segment; it also contains several genes encoding for peroxydases. Although multiple loci include genes that could be involved in abiotic stress and oxidative responses, these results suggest that multiple factors other than HM stress could have played a role in the evolutionary mechanisms that affected the genetic diversity of chr.5 during the teosinte parviglumis to maize transition.

      To further analyze the possibility that HM response could have played a role in maize emergence and subsequent domestication, we analyzed large scale transcriptomic data corresponding to independent experiments aiming at understanding the response of maize roots to HM stress. Six available transcriptomes were selected for in-depth analysis because they presented a fold change strictly higher than 1, and their results were supported by false discovery rates (FDR<0.05). These six transcriptomes (Table S5) included HM response datasets corresponding to growth conditions that not only incorporated Cu, but also lead (Pb) and chromium (Cr) that were not included in the substrate of our experiments. Transcriptional profiles were obtained from roots of plants at different stages: maize seedlings (Shen et al., 2012; Gao et al., 2015; Zhang et al., 2024a), three week old plantlets (Yang et al., 2023), and plants at V2 stage (Zhang et al., 2024b; Fengxia et al., 2025). A total of 120 genes shared by all six transcriptomes were found to be differentially expressed under HM stress conditions (66 upegulated and 54 downregulated; Figure S3), including ZmSKUs5, ZmHMA1 and ZmHMA7; 52 of them (43.3%) are located in maize loci showing less than 70% of the nucleotide variability found in teosinte parviglumis, suggesting that they were affected by positive selection (Yamasaki et al., 2005; Supplementary File 7). Of 18 mapping in chr.5, twelve are within the 82 cM that fractionates into multiple QTLs under selection during the parviglumis to maize transition. Interestingly, five additional loci containing HM response genes completely lack SNPs within their total length in both parviglumis and maize, and 19 additional loci lack SNPs in at least one 30 Kb segment or their coding region (Supplementary File 7), suggesting the frequent presence of ultraconserved genomic regions in many loci containing HM response genes. When this same analysis was conducted in a set of loci comprising 63 genes previously identified as differentially expressed in response to abiotic stress not directly related to HM responses (hypoxia; nutritional deficiency; soil alkalinity; drought; soil salinity), 18 loci (28.6%) showed less than 70% of the nucleotide variability found in teosinte parviglumis. Only one of them maps in chr.5 and none contained segments or coding regions lacking SNPs in parviglumis or maize. These results suggest that in contrast to other types of abiotic stress response genes, loci comprising a large set of genes that unambiguously respond to HM stress caused by chemical elements of diverse nature were affected by positive selection during the parviglumis to maize transition, irrespectively of their position in the genome.”

      The detailed analysis of genetic diversity across 11.47 Mb of chr.5 in the genomic region comprised between ZmSKUs5 and ZmHMA1 in presented as Supplementary File 6.

      The analysis of genetic diversity in loci encompassing heavy metal response genes shared by six transcriptomes and abiotic stress controls are described in Supplementary File 7.

      In the Discussion (pgs. 21 and 22), we added a paragraph section that reads as follows:

      “Although loss of genetic diversity is usually the result of human selection during domestication, it can also represent a consequence of natural selective pressures favoring fitness of specific teosinte parviglumis allelic variants better adapted to environmental changes and subsequently affected by human selection during the domestication process. This possibility is reflected by widely spread selective sweeps affecting a large portion of chr.5 that contains hundreds of genes showing signatures of positive selection. The analysis of 11.47 Mb covering the ZmHMA1ZmSKUs5 segment confirms the presence of large but discrete genomic subregions that were positively selected during the teosinte parviglumis to maize transition. Although several contain genes involved in HM response and oxidative stress, the diversity of gene functions does not necessarily favor abiotic stress over other factors that could be at the origin of selective forces affecting these regions. By contrast, a large scale transcriptomic survey indicates that genes consistently responding to HMs (Cu, Cd, Pb and Cr ) show signatures of positive selection at unusual high frequencies (43.3%) as compared to loci containing genes responding to other types of abiotic stress (28.6%). Our identification of HM response genes affected by positive selection is far from being exhaustive. Nevertheless, it agrees with the expected effects of a widespread selective sweep caused by environmental changes that influenced the parviglumis to maize transition at the genetic level. Of intriguing interest are 24 loci that partially or completely lack SNPs in both teosinte parviglumis and maize, suggesting possible genetic bottlenecks occurred before the teosinte to maize transition. Examples of other edaphological factors driving genetic divergence either in the teosintes or maize include local adaptation to phosphorus concentration in mexicana and parviglumis (Aguirre-Liguori et al. 2019), and fast maize adaptation to changing iron availability through the action of genes involved in its mobilization, uptake, and transport (Benke and Stich 2011). Our results reveal a teosinte parviglumis environmental plasticity that could be related to the function of HM response genes positively selected during the teosinte parviglumis to maize transition. Previous studies have demonstrated that transposable elements (TEs) contribute to activation of maize genes in response to abiotic stress, affecting up to 20% of the genes upregulated in response to abiotic stress, and as many as 33% of genes that are only expressed in response to stress (Makarevitch et al., 2015). It is therefore possible that the HM response of some specific genes that influenced maize emergence or domestication could be mediated by TEs influencing or driving their transcriptional regulation.”

      The mutagenic analysis of ZmHMA7 and ZmSKUs5 will be included in a different publication.

      (2) The idea that HM stress impacted gene function and influenced human selection during domestication is of interest. However, the data presented do not convincingly link environmental factors with human-driven selection or the paleoenvironmental context of the transition. While lower nucleotide diversity values in maize could suggest selective pressure, it is not sufficient to infer human selection and could be due to other evolutionary processes. It is also unclear whether the statistical analysis was robust enough to rule out bias from a narrow locus selection. Furthermore, the addition of paleoclimate records (Paleoenvironmental Data Sources as a starting point) or conducting ecological niche modeling or crop growth models incorporating climate and soil scenarios would strengthen the arguments.

      We think that the detailed analysis of genetic diversity across 11.46 Mb covering the ZmSKUs5 to ZmHMA1 genomic segment – and its statistical validation - provides a precise understanding of the selective sweep dimensions in chr.5.

      We do agree that lower nucleotide diversity values in maize are not sufficient to infer human selection. Because many HM response loci show unusually low nucleotide variability in teosinte parviglumis (see the results of the transcriptomic analysis presented above), we cannot discard the possibility that natural selection forces related to environmental changes could have affected native populations of teosinte parviglumis.

      To further explore the link between environmental factors, natural or human-driven selection, and the paleoenvironmental context of the parviglumis to maize transition, we revised paleoenvironmental and geological records and added results in two sections that read as follows (pgs. 17 to 20):

      “Paleoenvironmental studies reveal periods of climatic instability in the presumed region of maize emergence during the early Holocene.

      It is well accepted that temperature fluctuations, volcanism and anthropogenic impact shaped the distribution and abundance of plant species in the Transmexican Volcanic Belt (TMVB) during the last 14,000 years (Torrescano-Valle et al. 2019). The TMVB has produced close to 8000 volcanic structures (Ferrari et al., 2011), transforming the relief multiple times, and causing hydrographic and soil changes that actively modified the distribution and composition of plant communities in Central Mexico. Detailed paleoenvironmental data for the Pleistocene and Holocene is available for several lacustrine zones located within the 50 to 100 km range of the region currently considered the cradle of maize domestication (Matzuoka et al. 2002; Figure 5a). In Lake Zirahuén (102°44′ W; 19°26′ N and approximately 2075 meters above sea level; index [i] in Figure 5a), pollen, microcharcoal and magnetic susceptibility analyses of two sedimentary sequences reveals three periods of major ecological change during the early and middle Holocene.

      Between 9500 and 9000 calibrated years before present (cal yr BP), pine forests seem to have been associated with summer insolation increases. A second peak of forest change occurred at around 8200 cal yr BP, coinciding with cold oscillations documented in the North Atlantic. Finally, events occurred between 7500 and 7100 cal yr BP shows an abrupt change in the plant community related to humid Holocene climates and a presumed volcanic event (Lozano-García et al., 2013). The environmental history of the central Balsas watershed has also been documented by pollen, charcoal, and sedimentary analysis conducted in three lakes and a swamp of the Iguala valley (Piperno et al. 2007). Paleoecological records of lake Ixtacyola (8°20N, 99°35W and approximately 720 meters above sea level; index [ii] in Figure 5a) and lake Ixtapa (8°21N, 99°26W) indicate that an important increase in temperature and precipitation occurred between 13000 and 10000 cal yr BP. The pollen record of Ixtacyola showed that members of the genus Zea were already part of the vegetation coverage by 12900 to 13000 cal yr BP, suggesting that some teosintes – likely including parviglumis - were commonly found at elevation areas where they do not presently occur. Lake Almoloya (also named Chignahuapan; 19°05N, 99°20E and approximately 2575 meters above sea level; index [iii] in Figure 5a) in the upper Lerma basin is only 20 Km from the crater of the Nevado de Toluca that is responsible for creating the late Pleistocene Upper Toluca Pumice layer over which the Lerma basin is deposited. Pollen records indicate the presence of Zea species by 11080 to 10780 cal yr BP. As for other locations, an important period of climatic instability prevailed between 11500 and 8500 cal yr BP (Ludlow-Wiechers et al., 2005). Humidity fluctuations occurred until 8000 cal yr BP, with a stable temperate climate between 8500 and 5000 cal yr BP. Although pollen and diatom studies are often difficult to interpret at a regional scale, the overall results presented above suggest consistent periods of Zea plants present in periods of environmental and climatic instability that correlate with the history of volcanic activity during the early Holocene, as described in the next section.

      Temporal and geographical convergence between volcanic eruptions and maize emergence during the Holocene.

      Current evidence indicates that the emergence and domestication of maize initiated in Mesoamerica some time around 9,000 yr BP (Matsuoka et al. 2002). The current location of teosinte parviglumis populations that are phylogenetically most closely allied with maize are currently distributed in a region located between the Michoacan-Guanajuato Volcanic Field (MGVF) at their northwest, and the Nevado de Toluca and Popocatéptl volcanoes at their east and northeast (Figure 5a; Matsuoka et al. 2002). Precise records of field data indicate that ten accessions were collected in the Balsas river drainage near Teloloapan and Sierra de Huautla (Guerrero), at approximately 100 km south of the Nevado de Toluca crater. Three other accessions were collected near Tejupilco de Hidalgo and Zacazonapan (Estado de México), at approximately 50 to 60 km from the Nevado de Toluca crater (8762, JSG y LOS-161, and JSG-391). And four other accessions were located in Michoacan, at a location within the MGVF (accession 8763), or at mid-distance between the MGVF and the Nevado de Toluca crater (accessions JSG y LOS-130, 8761, and 8766).

      The most important source of HMs in ancient soils of Mesoamerica is TMBV-dependent volcanic activity through short- and long-term effects related to lava deposits, ores, hydrothermal flow, and ash (Torrescano-Valle et al. 2019). The Nevado de Toluca volcano produced one of the most powerful eruptions from central Mesoamerica in the Holocene, giving rise to the Upper Toluca Pumice deposit at 12621 to 12025 cal yr BP (Arce et al., 2003; Figure 5b). The pumice fallout blanketed the Lerma and Mexico basins with 40 cm of coarse ash (Bloomfield and Valastro 1977; Arce et al. 2003). A second eruption dated by 36Cl exposure occurred at 9700 cal yr BP (Arce et al. 2003; Figure 5b), and the most recent eruption occurred at 3580 to 3831 cal yr BP (Macías et al. 1997). During the early and middle Holocene, the Popocatéptl volcano produced at least four eruptions dated 13037-12060, 10775–9564, 8328-7591, and 6262-5318 cal yr BP (Siebe et al. 1997); three other important eruptions occurred during the late Holocene, between 2713 and 733 cal yr BP (Siebe and Macías, 2006). In addition, the MGFV is a monogenetic volcanic field for which 23 independent eruptions have been documented during the Holocene, 21 of them located towards the southern part of the field, in close proximity to the region harboring some of the teosinte parviglumis populations most closely related to maize. Three of these eruptions occurred in the early Holocene (El Huanillo 1130 to 9688 cal yr BP; La Taza 10649 to 10300 cal yr BP; Cerro Grande 10173 to 9502 cal yr BP; Figure 5b), and three others during the initial period of the middle Holocene, between 8400 and 7696 cal yr BP (La Mina, Los Caballos, and Cerro Amarillo; Figure 5b). On average, a new volcano forms every ~435 years in the MGFV (Macías and Arce, 2019). No less than 16 other eruptions occurred between 7159 cal yr BP and the present time (Figure 5b). Soils of volcanic origin (andosols) are currently distributed in regions north-west from the Nevado de Toluca and Popocatéptl craters, in close proximity with teosinte parviglumis populations most closely related to maize (Figure S5). Although modern distribution of teosinte populations may differ from their distribution around 9000 yr BP, and unknown populations more closely related to maize may yet to be discovered, this data indicates that the date and region where maize emerged is convergent with the dates and locations of several volcanic eruptions occurred during the Holocene in that same region.”

      (3) Despite the interest in examining HM stress in maize and the presence of a pleiotropic phenotype, the assessment of the impact of gene expression is limited. The authors rely on qPCR for two ZmHMA genes and the locus tb1, known to be associated with maize architecture. A transcriptomic analysis would be necessary to 1- strengthen the proposed connection and 2- identify other genes with linked QTLs, such as those in the short arm of chromosome 5.

      Real-time qPCR is an accurate and reliable approach to assess the expression of specific genes such as ZMHMA1 and Tb1, but we agree that our results do not allow to establish a direct regulatory link between the function of Tb1, the pleiotropic parviglumis phenotype under HM stress, and the function of ZmHMA1. We also concede that the large transcriptional analysis of HM response in maize (presented above) does not allow to elucidate a possible connection between these two genes. We have substantially downplayed our conclusion in this section by modifying the end of the section in pg. 17, that now reads:

      “These results do not allow to directly link the regulation of ZmHMA1 expression to the function of Tb1; however, they open an opportunity to further investigate the possibility that under HM stress, the formation of secondary ramifications in teosinte parviglumis could be repressed by transcription factors of the TCP family, including Tb1.”

      This is also emphasized in the Discussion (pg 21) as follows:

      “Under HM stress, we also show that Tb1 is overexpressed in the apical meristem of teosinte parviglumis, suggesting that formation of secondary ramifications is repressed by Tb1 function under HM stress, as in extant maize. At this stage we cannot discard the possibility that Tb1 upregulation in parviglumis reflects a more generalized response to abiotic stress; however, the expression ZmHMA1 is downregulated in W22 wild-type maize meristems in the presence of HMs but upregulated in teosinte parviglumis meristems, suggesting that a specific regulatory shift relating HM responses and ZmHMA1 function occurred during the teosinte parviglumis to maize transition.”

      On the other hand, the transcriptional analysis the identification of 52 additional HM response genes showing signatures of positive selection occurred during the parviglumis to maize transition; 12 of them map to chr.5 within the region having linked QTLs within the short arm of chr.5. So far, genes involved in HM response and oxidative stress represent the most prevalent class of genes identified within the genomic region showing pleiotropic effects on domestication and multiple linked QTLs in chr.5.

      Reviewer #2 (Public review):

      Summary:

      This work explores the phenotypic developmental traits associated with Cu and Cd responses in teosinte parviglumis, a species evolutionary related to extant maize crops. Cu and Cd could serve as a proxy for heavy metals present in the soils. The manuscript explores potential genetic loci associated with heavy metal responses and domestication identified in previous studies. This includes heavy metal transporters, which are unregulated during stress. To study that, the authors compare the plant architecture of maize defective in ZmHMA1 and speculate on its association with domestication.

      Strengths:

      Very few studies covered the responses of teosintes to heavy metal stress. The physiological function of ZmHMA1 in maize also gives some novelty in this study. The idea and speculation section is interesting and well-implemented.

      Weaknesses:

      The authors explored Cu/Cd stress but not a more comprehensive panel of heavy metals, making the implications of this study quite narrow. Some techniques used, such as end-point RT-PCR and qPCR, are substandard for the field. The phenotypic changes explored are not clearly connected with the potential genetic mechanisms associated with them, with the exception of nodal roots. If teosintes in response to heavy metal have phenotypic similarity with modern landraces of maize, then heavy metal stress might have been a confounding factor in the selection of maize and not a potential driving factor. Similar to the positive selection of ZmHMA1 and its phenotypic traits. In that sense, there is no clear hypothesis of what the authors are looking for in this study, and it is hard to make conclusions based on the provided results to understand its importance. The authors do not provide any clear data on the potential influence of heavy metals in the field during the domestication of maize. The potential role of Tb-1 is not very clear either.

      Thank you for these comments. We have now emphasized our hypothesis in the abstract and the last paragraph of the Introduction (pg. 6):

      “To test the hypothesis that heavy metal (HM) stress influenced the evolutionary transition of teosinte to maize, we exposed both subspecies to sublethal concentrations of copper and cadmium etc…”

      A comprehensive panel of heavy metals would not be more accurate in terms of simulating the composition of soils evolving across 9,000 years in the region where maize presumably emerged. Copper (Cu) and cadmium (Cu) correspond each to a different affinity group for proteins of the ZmHMA family. ZmHMA1 has preferential affinity for Cu and Ag (silver), whereas ZmHMA7 has preferential affinity to Cd, Zn (zinc), Co (cobalt), and Pb (lead). Since these P1b-ATPase transporters mediate the movement of divalent cations, their function remains consistent regardless of the specific metal tested, provided it belongs to the respective affinity group. By applying sublethal concentrations of Cd (16 mg/kg) and Cu (400 mg/kg), we caused a measurable physiological response while allowing plants to complete their life cycle, including the reproductive phase, facilitating a comprehensive analysis of metal stress adaptation. Whereas higher doses impair flowering or are lethal, lower Cu/Cd concentrations do not consistently show conventional phenotypic responses such as reduced plant growth (AbdElgawad et al. 2020; Atta et al., 2023)

      Based on comments by both reviewers, we present now a large transcriptional analysis that incorporates HM responses to lead (Pb) and chromium (Cr), in addition to Cu. Results show that many genes responding to Pb and Cr were also positively selected across the maize genome, suggesting that HM stress led to a ubiquitous rather than a specific evolutionary response to heavy metals (please see our response to Reviewer#1 and sections in pgs. 11 to 13) .

      Real-time qPCR is an accurate and reliable approach to assess the expression of specific genes such as ZMHMA1 and Tb1, but we agree that our results do not allow to establish a direct regulatory link between the function of Tb1, the pleiotropic parviglumis phenotype under HM stress, and the function of ZmHMA1. We also concede that the large transcriptional analysis of HM response in maize (presented above) does not allow to elucidate a possible connection between these two genes. Therefore, we have substantially downplayed our conclusion in this section by modifying the end of the section in pg. 17, that now reads:

      “These results do not allow to directly link the regulation of ZmHMA1 expression to the function of Tb1; however, they open an opportunity to further investigate the possibility that under HM stress, the formation of secondary ramifications in teosinte parviglumis could be repressed by transcription factors of the TCP family, including Tb1.”

      There are two phenotypic changes clearly connected with the genetic mechanisms involved in the parviglumis to maize transition: plant height and the number of seminal roots (not nodal roots). These changes have been now emphasized in the Abstract and the description of the results.

      Regarding the possibility for HM stress to represent a confounding factor in the selection of maize and not a driving factor, we expanded the genomic analysis of genetic diversity well beyond the analysis of the three genes under initial study, to cover a segment of 11.47 Mb comprised between ZmSKUs5 and ZmHMA1. We compared nucleotide variability by using 100 bp bins covering loci composed of two 30 Kb segments up and downstream of coding sequences, respectively, and the coding sequence itself, for 173 genes present within the genomic region comprised between ZmSKUs5 and ZmHMA (Figure S1 and Supplementary File 6). The full analysis is presented in a new section pgs. 11 and 12. We found that 166 out of 173 loci show signatures of positive selection and are roughly organized in five independent subregions of variable length. Four out of five subregions contain more than one HM or oxidative stress response gene within loci showing signatures of positive selection. Although multiple factors other than HM stress could have played a role in the evolutionary mechanisms that affected the genetic diversity of chr.5, large scale transcriptomic data corresponding to independent experiments aiming at understanding the response of maize roots to HM stress allowed the identification of 49 additional HM response genes within loci showing positive selection across the genome, a proportion (43.3%) far greater than the proportion of loci containing response genes to other types of abiotic stress not related to HMs (28.6%). These results are described in detail in pgs. 12 and 13 (Figure S3 and Supplementary File 7). These results provide strong evidence in favor of HM stress and not another factor driving positive selection.

      We now provide precise and pertinent paleoenvironmental data on the potential influence of heavy metals in the field. In sections pgs. 17 to 20 we review paleoenvironmental studies revealing periods of climatic instability in the presumed region of maize emergence during the early Holocene, and data indicating that the date and region where maize emerged is convergent with the dates and locations of several volcanic eruptions occurred during the early and middle Holocene in that same region. Please see responses to Reviewer#1 for details.

      We agree that our results do not allow to establish a direct regulatory link between the function of Tb1, the pleiotropic parviglumis phenotype under HM stress, and the function of ZmHMA1. We also concede that the large transcriptional analysis of HM response in maize (presented above) does not allow to elucidate a possible connection between these two genes. Therefore, we have substantially downplayed our conclusion in this section by modifying the end of the section in pg. 17, that now reads:

      “These results do not allow to directly link the regulation of ZmHMA1 expression to the function of Tb1; however, they open an opportunity to further investigate the possibility that under HM stress, the formation of secondary ramifications in teosinte parviglumis could be repressed by transcription factors of the TCP family, including Tb1.”

      This is also emphasized in the Discussion (pg 21) as follows:

      “Under HM stress, we also show that Tb1 is overexpressed in the apical meristem of teosinte parviglumis, suggesting that formation of secondary ramifications is repressed by Tb1 function under HM stress, as in extant maize. At this stage we cannot discard the possibility that Tb1 upregulation in parviglumis reflects a more generalized response to abiotic stress; however, the expression ZmHMA1 is downregulated in W22 wild-type maize meristems in the presence of HMs but upregulated in teosinte parviglumis meristems, suggesting that a specific regulatory shift relating HM responses and ZmHMA1 function occurred during the teosinte parviglumis to maize transition.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      While the dataset generated provides an interesting foundation for hypothesis testing on HM stress and domestication, the current data do not sufficiently support the conclusions of the manuscript.

      (1) The description of maize and teosinte architecture under HM stress is well presented.

      However, traits like shoot height, leaf size reduction, and biomass loss also occur under other environmental stresses such as drought and salinity. Additional evidence beyond shoot and root architecture would help validate the link between tb1 expression and specific ZmHMA genes under HM stress, or whether it reflects a more generalized stress response.

      We have already addressed in detail this point in the public response to Reviewer#1.

      (2) The nucleotide variability analysis is interesting, but I would have liked to see additional information to clarify the choice of the data selection and the strength of the conclusions with human selection.

      We have already addressed in detail this point in the public response to Reviewer#1.

      a) The choice of Tripsacum dactyloides as the outgroup to determine nucleotide variability seems to be distant, and I wonder whether other combinations with a closer outgroup or multiple outgroups were tried to provide a more accurate context.

      Nucleotide variability in Tripsacum dactyloides is used to graphically illustrate an external reference and not as an outgroup in the extended analysis of genetic diversity at the locus and genomic level. We did not used Tripsacum dactyloides as an outgroup in our statisticalm analysis. We could have indeed a closer teosinte subspecies as an outgroup, but at this stage no data warrants that environmentally-related selective pressures could have affected genetic diversite in other teosintes. This possibility in currently being investigated.

      b) Evolutionary differences not related to human influence could affect the results. The phrase "order of magnitude difference in π values" needs statistical validation (e.g., confidence intervals, p-values).

      We agree and have eliminated the sentence, as it is no longer relevant at the light of the detailed genomic analysis of genetic diversity prsented in Supplementary File 6.

      c) The comparison with ZmGLB1, a neutral control locus, suggests that domestication-related changes in nucleotide variability are specific to the three candidate genes. However, the concept of neutrality is complex, and while ZmGLB1 may be considered neutral in this case, the argument does not address the possibility of other factors, such as linked selection, that could influence variability in these genes. Referencing Hufford et al. is insufficient and would require a deeper argument.

      We also agree with this comment. We think that the influence and consequences of linked selection are now well documented for 11.46 Mb analyzed in chr.5 (pgs 11 and 12) in the main text and Supplementary File 6).

      (3) The statement: "Our evidence indicates that HM stress revealed a teosinte parviglumis environmental plasticity that is directly related to the function of specific HM response genes that were affected by domestication through human selection" is not supported by the presented data. The rationale for the specific Cd/Cu dosage used is unclear. A dose-response gradient would better demonstrate the nature and strength of the plastic response.

      Previous reports support the rationale for the specific HM dosage in this study; Cu/Cd dosage response gradients have been conducted in maize (AbdElgawad et al. 2020; Atta et al., 202), but since no studies have been conducted in teosinte, we reasoned that it was important to apply the same treatment to both subspecies. We have now emphasized this rationale by adding the following in pg XX: “Whereas higher doses impair flowering or are lethal, lower Cu/Cd concentrations do not consistently show conventional phenotypic responses such as reduced plant growth (AbdElgawad et al. 2020; Atta et al., 2023)”.

      We agree that the statement raised by the reviewer needed revision at the light of our results. We did revise the statement to accurately reflect our current evidence as follows: “Our results reveal a teosinte parviglumis environmental plasticity that is likely related to the function of HM response genes positively selected during the teosinte parviglumis to maize transition.”

      (4) In maize, TEs are known to influence gene expression under abiotic stress, including for tb1 (PMID: 25569788). Since the author appears to make a causative conclusion between ZmHMA1, TB1, and HM stress, I would have liked to see a whole-transcriptome analysis and not a curation of two genes to determine whether other factors, such as TEs, can have that would lead to similar outcomes.

      We agree that is definetely a possibility that we have not investigated at this stage. However, we added a pargraph to reflect this pertinent suggestion:

      “Previous studies have demonstrated that transposable elements (TEs) contribute to activation of maize genes in response to abiotic stress, affecting up to 20% of the genes upregulated in response to abiotic stress, and as many as 33% of genes that are only expressed in response to stress (Makarevitch et al., 2015). It is therefore possible that the HM response of some specific genes that influenced maize emergence or domestication could be mediated by TEs influencing or driving their transcriptional regulation.”

      (5) I would suggest that the authors carefully review the tables, figures, and the corresponding legends. For example :

      a) Table 2 is called before Table 1, I would therefore suggest changing the numbering to reflect the paragraph order.

      Thank you for your help, we did change the order of the Tables in the new version.

      b) In Table 2, it is not clear whether the P value applies to the mean difference between WT and the mutant zmhma1, either in the presence or the absence of heavy metals. In addition, the authors need to use the P-value to estimate the differences between WT in the absence vs presence of HM, and WT in the absence of HM versus the mutant in the absence of HM (idem for presence).

      We did address this issue in detail and added P-values and specific pairwise comparisons to that Table (now Table 1). Data are presented as mean ± standard deviation and were tested by a paired Student’s T-Test. When the effects were significant according to T-Test, the treatments were compared with the Welch two sample T-Test at P < 0.05.

      c) Table 1 and Table 2: Indicate what type of statistical test was used and the number of plants used for each experiment (n). Also, I recommend the use of scientific notation for the P-values.

      The statistical tests have now been indicated, scientific notation has been added to the P-values; the number of plants and biological replicates are indicated in the Methods section.

      d) Lines 202 and 204: I assume Table 1 should be called instead of Table 2.

      This error has been corrected.

      e) General: In the text, when significance is highlighted along with measurements, the p-value needs to be added.

      We have added the P-value along the measurement for all significant differences.

      f) In the text, it is also mentioned that "the expression of ZMHMA1 was significantly increased in the presence of HMs (Figure 3c)". We are looking here at an RT-PCR, which is qualitative and without a robust quantitative comparison and statistics, I cannot conclude this assessment based on the presented evidence. No statistical measure is indicated here.

      Panel 3c is not RT-PCR but a real-time qPCR, showing relative fold-change, normalized to actin, with a 3-technical triplicate per 3 biological replicates). We have added error bars (SD) and P-values represented by asterisks (calculated with Student's t statistic) to support significant differences (P<0.05 and P<0.01). ZmHMA1 expression was significantly increased in the presence of HMs only in teosinte; there was no significant difference in maize.

      g) Figure 3 should at least have the gene name in the figure to quickly understand the figure panel. The key conserved domains should also be identified.

      We agree and apologize for the omission. The gene names have been added adjacent to the structures.

      h) Sentence at lines 459-460 lacks words and punctuation.

      This unfortunate rror has also been corrected.

      i) Figure S1, the reference Lemmon and Doebley, 2024 should be Lemmon and Doebley, 2014 to harmonize with the text.

      The correct year is 2014. We have corrected this error.

      Reviewer #2 (Recommendations for the authors):

      (1) The narrative should be clearer, starting with a clearer hypothesis that is later sustained or not in the results, and then discussed in the idea and speculation section.

      Thank you for the comment. We have clarified the hypothesis, it is included in the abstract and the last paragraph of the Introduction. We hope it is now clear that the evidence presented supports our hypothesis

      (2) Focus more on traits that are relevant, for example, nodal and seminal roots.

      We modified the text to emphasize three relevant traits. In the case of teosinte under HM stress, absence of tillering and increase in the number of female inflorescences. In the case of the zmha1 mutant under HM stress, differences in the number of nodal roots, and differences in height.

      (3) RNA-seq in Cu/Cd stress could make the work much more useful and complete.

      As previously mentioned, we have incorporated a large scale transcriptional analysis on the basis of six transcriptomes statistically validated (Table S5). Please see sections pgs. 11 to 13 for details.

    1. Lady Susan to Mrs. Johnson. Churchhill. Never, my dearest Alicia, was I so provoked in my life as by a letter this morning from Miss Summers. That horrid girl of mine has been trying to run away. I had not a notion of her being such a little devil before, she seemed to have all the Vernon milkiness; but on receiving the letter in which I declared my intention about Sir James, she actually attempted to elope; at least, I cannot otherwise account for her doing it. She meant, I suppose, to go to the Clarkes in Staffordshire, for she has no other acquaintances. But she shall be punished, she shall have him. I have sent Charles to town to make matters up if he can, for I do not by any means want her here. If Miss Summers will not keep her, you must find me out another school, unless we can get her married immediately. Miss S. writes word that she could not get the young lady to assign any cause for her extraordinary conduct, which confirms me in my own previous explanation of it. Frederica is too shy, I think, and too much in awe of me to tell tales, but if the mildness of her uncle should get anything out of her, I am not afraid. I trust I shall be able to make my story as good as hers. If I am vain of anything, it is of my eloquence. Consideration and esteem as surely follow command of language as admiration waits on beauty, and here I have opportunity enough for the exercise of my talent, as the chief of my time is spent in conversation. Reginald is never easy unless we are by ourselves, and when the weather is tolerable, we pace the shrubbery for hours together. I like him on the whole very well; he is clever and has a good deal to say, but he is sometimes impertinent and troublesome. There is a sort of ridiculous delicacy about him which requires the fullest explanation of whatever he may have heard to my disadvantage, and is never satisfied till he thinks he has ascertained the beginning and end of everything. This is one sort of love, but I confess it does not particularly recommend itself to me. I infinitely prefer the tender and liberal spirit of Mainwaring, which, impressed with the deepest conviction of my merit, is satisfied that whatever I do must be right; and look with a degree of contempt on the inquisitive and doubtful fancies of that heart which seems always debating on the reasonableness of its emotions. Mainwaring is indeed, beyond all compare, superior to Reginald—superior in everything but the power of being with me! Poor fellow! he is much distracted by jealousy, which I am not sorry for, as I know no better support of love. He has been teazing me to allow of his coming into this country, and lodging somewhere near incog.; but I forbade everything of the kind. Those women are inexcusable who forget what is due to themselves, and the opinion of the world. Yours ever, S. VERNON.

      There is a lot to debrief in this passage. We see how she has received word from Miss Summer over Fredrica, who has tried to run away to a possible friend's house after hearing about her mother's intention with her to marry Sir James. She is explaining to her dear friend, Alicia/Mrs. Johnson that she feels Fredrica is too scared of her to tell her anything, so she has sent her uncle to 'truly scare' her in the hopes she'll start behaving correctly. She also addresses how she's sure that Fredrica will speak "lies" of her to her uncle, so Lady Susan is going to have to find a way to make Fredrica's stories sound misunderstood and victimize herself. After that first part of the passage, she then switches into telling her friend about all the new romantical aspects in her life. I believe she's making Reginald out to sound like a possible interesting affair but she would never plan to marry him as he isn't serious and far too cocky. She then goes back to her yearning for Mr. Mainwaring.. the married man...and she sounds semi delusional addressing his jealously. She then mentions how she refuses to bring him home near Incog, as the women there are nosy and have their opinion on everything.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reply to the Reviewers __

      We thank the Reviewers for their positive assessment and recognition of the paper achievements. The insightful comments will strengthen the data and manuscript.

      Referee #1* *

      Minor comments

      1. Fig 1B - add arrows showing mRNAs being translated or not (the latter mentioned in line 113 is not so easy to see). We have magnified the inset of the colocalisation in the right column; we added arrows and arrowheads to differentiate colocalised and non-colocalised bcd with translating SunTag.

      2. Fig 2A - add a sentence explaining why 1,6HD, 2,5HD and NaCl disrupt P bodies. *

      We have added the information on the use of 1,6HD, 2,5HD, and NaCl to disrupt P-bodies as below. Revised line 158: “To further show that bcd storage in P bodies is required for translational repression, we treated mature eggs with chemicals known to disrupt RNP granule integrity (31, 37, 69-72). Previous work has shown that the physical properties of P bodies in mature Drosophila oocytes can be shifted from an arrested to a more liquid-like state by addition of the aliphatic alcohol hexanediol (HD) (Sankaranarayanan et al., 2021, Ribbeck and Görlich, 2002; Kroschwald et al., 2017). While 1,6 HD has been widely used to probe the physical state of phase-separated condensates both in vivo and in vitro (Alberti et al., 2019; McSwiggen et al., 2019; Gao et al., 2022), in some cells it appears to have unwanted cellular consequences (Ulianov et al., 2021). These include a potentially lethal cellular consequences that may indirectly affect the ability of condensates to form (Kroschwald et al., 2017) and wider cellular implications thought to alter the activity of kinases (Düster et al., 2021). While we did not observe any noticeable cellular issues in mature Drosophila oocytes with 1,6 HD, we also used 2,5 HD, known to be less problematic in most tissues (Ulianov et al., 2021) and the monovalent salt sodium chloride (NaCl), which changes electrostatic interactions (Sankaranarayanan et al., 2021).”

      *Fig 4C - explain in the legend what the white lines drawn over the image represent. And why is there such an obvious distinction in the staining where suddenly the DAPI is much more evident (is the image from tile scans)? *

      Figure 4C is the tile scan image of a n.c.10 embryo and the white line classified the image into four quadrants. We used this image to quantify the extent of bcd (magenta) colocalisation to SunTag (green) in the anterior and posterior domains of the embryo in the bar graph shown in panel C’. There is a formatting error in the image. We will correct this in the revised version. We will also include the details of white lines in the legends. Finally, based on further reviewer comments, in the revised version this data is shifted to the supplementary information.

      • Line 215 - 'We did not see any significant differences in the translation of bcd based on their position, however, there appears an enhanced translation of bcd localised basally to the nuclei (Figure S5).' Since the difference is not significant, I do not think the authors should conclude that translation is enhanced basally. *

      We agree with the reviewer. In this preliminary revision we have changed this statement to: “We did not see any differences in the translation of bcd based on their position with respect to the nuclei position (Figure S5)” (revised line 238-239).

      *Line 218: 'The interphase nuclei and their subsequent mitotic divisions appeared to displace bcd towards the apical surface (Figure S6B).' Greater explanation is needed in the legend to Fig S6B to support this statement as the data just seem to show a nuclear division - I would have thought an apical-basal view is needed to conclude this. *

      We have rearranged this figure and shown in clarity the apical-basal view of the blastoderm nuclei and the displacement of bcd from the surface of the blastoderm in Figure S8.

      New Figure S8: n.c.8 - pre-cortical migration; n.c.12,14- post cortical migration; Mitosis stages of n.c.9-10. The cortical interphase nuclei at n.c. 12,14 displaces bcd. The nuclear area (DAPI, cyan) does not show any bcd particles (magenta) indicated by blue stars. The mitotic nuclei (yellow arrowheads, yellow stars) displace bcd along the plane of nuclear division (doubled headed yellow arrows).

      Fig 5B - the authors compare Bcd protein distribution across developmental time. However, in the early time points cytoplasmic Bcd is measured (presumably as it does not appear nuclear until nc8 onwards) and compare the distribution to nuclear Bcd intensities from nc9 onwards. Is most/all of the Bcd protein nuclear localised form nc9 to validate the nuclear quantitation? Does the distribution look the same if total Bcd protein is measured per volume rather than just the nuclear signal? Are the authors assuming a constant fast rate of nuclear import?

      From n.c.8 onwards, the Bcd signal in interphase nuclei builds up, with the nuclear intensity becoming very high compared to cytoplasmic Bcd. However, we do see significant Bcd signal in the cytoplasm (i.e., above background). In earlier work, gradients of the nuclear Bcd and nuclear-import mutant Bcd overlapped closely (Figure 1B, Grimm et al., 2010). This essentially suggests the nuclear Bcd gradient reflects the corresponding gradient of cytoplasmic Bcd. Further, the nuclear import of Bcd occurs rapidly after photobleaching (Gregor et al., 2007). Based on these observations, and our own measurements, prior to n.c. 9, the cytoplasmic gradient is likely a good approximation of the overall shape, whereas post n.c. 9 the Bcd signal is largely nuclear localised. Further, the overall profile is not dependent on the nuclear volume.

      • Line 259 - 'We then asked if considering the spatiotemporal pattern of bcd translation' - the authors should clarify what new information was included in the model. Similarly in line 286, 'By including more realistic bcd mRNA translation' - what does this actually mean? In line 346, 'We see that the original SDD model .... was too simple.' It would be nice to compare the outputs from the original vs modified SDD models to support the statement that the original model was too simple. *

      We will improve the linking of the results to the model. The important point is that when and where Bcd production occurs is more faithfully used, compared with previous approximations. By including more realistic production domains, we can replicate the observed Bcd gradient within the SDD paradigm without resorting to more complex models.

      Fig S1A - clarify what the difference is between the 2 +HD panels shown.__ __

      The two +HD panels at stage 14 indicate that upon the addition of HD, there are no particles in 70% of the embryos, and 30% show reduced particles. We will add this information to the figure legend.

      • Fig S2E - the graph axis label/legend says it is intensity/molecule. Since intensity/molecule is higher in the anterior for bcd RNAs, is this because there are clumps of mRNAs (in which case it's actually intensity/puncta)? *

      The density of mRNA is very high in the anterior pole; there is a chance that more than one bcd particle is within the imaged puncta (due to optical resolution limitations). We will change the y-axis to average intensity per molecule to average intensity per puncta.


      • Fig S4 - I think this line is included in error: '(B) The line plots of bcd spread on the Dorsal vs. Ventral surfaces.'*

      Yes, we will correct this in the revision.

      • In B, D, E - is the plot depth from the dorsal surface? I would have preferred to see actual mRNA numbers rather than normalised mRNAs. In Fig S4D moderate, from 10um onwards there are virtually no mRNA counts based on the normalised value, but what is the actual number? The equivalent % translated data in Fig S4E look noisy so I wonder if this is due to there being a tiny mRNA number. The same is true for Figs S4D, E 10um+ in the low region.*

      Beyond 10um from the dorsal surface, the number of bcdsun10 counts is very low. It becomes negligible at the moderate and low domains. We will attach the actual counts of mRNA in all these domains as a supplementary table in the revised version.

      General assessment Strengths are: 1) the data are of high quality; 2) the study advances the field by directly visualising Bcd mRNA translation during early Drosophila development; 3) the data showing re-localisation of bcd mRNAs to P bodies nc14 provides new mechanistic insight into its degradation; 4) a new SDD model for Bcd gradient formation is presented. Limitations of the study are: 1) there was already strong evidence (but no direct demonstration) that bcd mRNA translation was associated with release from P bodies at egg activation; 2) it is not totally clear to me how exactly the modified SDD model varies from the original one both in terms of parameters included and model output.

      This is the first direct demonstration of the translation of bcd mRNA released as a single mRNA from P bodies. Previously, we have shown that P bodies disruption releases single bcd from the condensates (31). We have captured a comprehensive understanding of the status of individual bcd translation events, from their release from P bodies at the end of oocyte maturation until the end of blastoderm formation.

      The underlying SDD model – that of localised production, diffusion, and degradation – is still the same (up to spatially varying diffusion). Yet the model as originally formulated did not fit all aspects of the data, especially with regards to the system dynamics. Here, we demonstrate that by including more accurate approximations of when and where Bcd is produced, we can explain the formation of the Bcd morphogen gradient without recourse to any further mechanism.


      Referee #2

      1. Line 114: The authors claim to have validated the SunTag using a fluorescent reporter, but do not show any data. Ref 60 is a general reference to the SunTag, and not the Bcd results in this paper. Perhaps place their data into a supplemental figure or movie? To show the validation of our bcdSun32 line, we have composed a new Figure S1 that shows the translating bcdSun32 (magenta) colocalising to the ScFV-mSGFP2 (green). Yellow arrowheads in the zoom (right panel) points to the translating bcdSun32 (magenta) and red arrowheads points to the untranslated bcdSun32. In addition, we have also shown the validation of bcdSun32 with the anti-GCN4 staining in the main Figure 1B.

      Further, we have dedicated supplementary Figure S3 (previously Figure S2) for the validation of our bcdSun10 construct. Briefly, bcdSun10 is inserted into att40 site of chr.2. We did a rescue experiment, where bcdSun10 rescued the lethality of homozygous bcdE1 null mutant. We then performed a colocalisation experiment using smFISH, where we demonstrated that almost all bcd in the anterior pole are of type bcdSun10. We targeted specific fluorescent FISH probes against 10xSunTag sequence (magenta, Figure S2A) and bcd coding sequence (magenta, Figure S2A). Upon colocalisation, we found ~90% of the mRNA are of bcdSun10 type. The remaining 10% could likely be contributed by the noise level (Figure S2B). We will make sure these points are clear in the revised manuscript.

      Line 128 and Fig. 1E: The claim that bcd becomes dispersed is difficult to verify by looking at the image. The language could also be more precise. What does it mean to lose tight association? Perhaps the authors could quantify the distribution, and summarize it by a length scale parameter? This is one of the main claims of the paper (cf. Line 23 of the abstract) but it is described vaguely and tersely here.

      We have changed the text from, “We also confirmed that bcd becomes dispersed, losing its tight association with the anterior cortex (Figure 1E) (31)” to, “We also confirmed that bcd is released from the anterior cortex at egg activation (Figure 1E) (31, 21).” (Revised line 131).

      The release of bcd mRNA at egg activation was first shown in 2008 (Ref 21, Figure 4, D-E) and again in 2021 (Ref 31, Figure 7 B and E). The main point in line 127-128, “P bodies disassembled and bcd was no longer colocalised with P bodies” and the novel aspect of line 23 is “translation observed”. The distribution of bcd mRNA after egg activation was not the point of this section. We have improved the writing in the revision to make this clearer.

      Line 146, Fig. 1G: This is a really important figure in the paper, but it is confusing because it seems the authors use the word "translation," when they mean "presence of Bcd protein." In other places in the paper, the authors give the impression that "bcd translation" means translation in progress (assayed by the colocalization of GCN4 and bcd mRNA). However, in Fig. 1G, the focus is only on GCN4. Detecting Bcd protein only at the anterior does not mean that translation happens only at the anterior (e.g., diffusion or spatially-restricted degradation could be in play).

      In Figure 1G, we have shown only the “translated” Bcd by staining with a-GCN4. We have changed line 146 from, “Consistent with previous findings, we only observed bcd translation at the anterior of the activated egg and early embryo (Figure 1G-H) (3, 68)” to, “Consistent with previous findings, we only observed the presence of Bcd protein at the anterior of the activated egg and early embryo (Figure 1G-H) (3, 68). (Revised line 151-153). We will use “translating bcd” or “bcd in translation” where we show colocalisation of bcd with BcdSun10 or BcdSun32 elsewhere in the manuscript.

      We did not mean to claim that translation occurred only in the anterior pole. We show that the abundance of bcd is very high in the anterior pole (in agreement with previous work) and that this is where the majority of observed translation events took place. Indeed, we have also shown that posteriorly localised mRNAs have the same BcdSun10 intensity per bcd puncta from the posterior pole (Figure 3B & 4C’ and Figure S2 E), but these are much fewer in number.

      *It would also be helpful to show a plot with quantification of Bcd detection (or translation) on the y-axis and a continuous AP coordinate on the x-axis, instead of just two points (anterior and posterior poles, the latter of which is uninteresting because observing no Bcd at the posterior pole is expected). *

      In Figure 1G,H, our aim was to test whether release from P bodies allowed for bcd mRNA to be translated. We used the presence of Bcd protein at the anterior domain of the oocytes to show this. The posterior pole was included as an internal control. To show the spatial distribution of bcd mRNA and its translation, we used early blastoderm (Figure 3, Figure S4).

      • *

      Another issue with Fig. 1G is that the A and P panels presumably have different brightness and contrast. If not, just from looking at the A and P panels, the conclusion would be that Bcd protein is diffuse (and abundant) in the posterior and concentrated into puncta in the anterior. The authors should either make the brightness and contrast consistent or state that the P panel had a much higher brightness than the A panel.

      We agree with this shortcoming. We have now added the following to Figure 1 legend to clarify this observation. “G: Representative fixed 10 µm Z-stack images (from 10 samples) showing BcdSun32 protein (anti-GCN4) is only present at the anterior of an in vitro activated egg or early embryo 30-minute post fertilization. BcdSun32 protein is not detected in these samples at the posterior pole (image contrast increased to highlight the lack of distinct particles at the posterior). BcdSun32 protein is also not detected at the anterior or posterior of a mature oocyte or an in vitro activated egg incubated with NS8953 (images have the contrast increased to highlight the lack of distinct particles). Scale bar: 20 mm; zoom 2 mm.” (Revised line 623).

      • Line 176: This section is very confusing, because at this point the authors already addressed the spatial localization of translation in Fig. 1G,H (see my above comment). However, here it seems the authors have switched the definition of translation back to "translation in progress." Therefore, the confusion here could be eliminated by addressing the above point.*

      In the revised version, we will use Bcd protein when shown with anti-GCN4 staining. We will use “translating bcd” or “bcd in translation” where we show colocalisation of bcd with a-GCN4 (BcdSun10 or BcdSun32). We will change this in the corresponding text.

      Line 185: The sentence here is seemingly contradictory: "most...within 100 microns" implies that at least some are beyond 100 microns, while the sentence ends with "[none]...more than 100 microns." The language could perhaps be altered to be less vague/contradictory.

      We will clarify this in the revised version. There are few particles visible beyond 100 um. In the lower panel of Figure 3B, the posterior domain shows few particles. However, their actual number compared to bcd counts within the 100 um is negligible (Figure3C). Nonetheless, the few bcd particles we observe do seem to be under translation (quantified in Figure 4C’ and Figure S2E).

      • Line 204: It would be really nice to have quantification of the translation events, such as curves of rate of translation as a function of a continuous AP coordinate, and a curve for each nc.*__ __

      In the revised version we will provide the results quantifying the translation events across the anterior- posterior axis. This will provide a clarity to the presence of bcd and their translation in the posterior domain with time.

      Our colocalisation analysis is semi-automated. It includes an automated counting of the individual bcd particle counts and a manual judgement of the colocalised BcdSun10 protein (distinct spots, above noise) to bcd particles (Figure S3D). The bcd particle counts ran into thousands in each cyan square box (measuring 50um radius and ~ 20um deep from the dorsal surface). We selected three such boxes covering 150um (continuously) from the anterior pole across A-P axis and 20um deep of the flattened embryo mounts across D-V axis (Figure 3A-C, Figure S4). We have also scanned scarce particles in the posterior; however, bcd counts are very low compared to the anterior. Further, in Figure 4 we have repeated the same technique to measure translation of bcd particles in embryos at different nuclear cycles.

      We have also shown continuous intensity measurements of bcd particles with their respective BcdSun10 gradient in Figure 5 across the A-P axis at different nuclear cycles. Here, we know BcdSun10 intensity is not only from the “translating” bcd (colocalised BcdSun10 to bcd particles) but also from the translated BcdSun10 freely diffusing (non-colocalised BcdSun10 to bcd particles). As asked by the reviewer, in the revised version we will add bcd counts and their translation status from anterior to posterior axis for each of the nuclear cycles.

      In our future work, we planned to generate MS2 tagged bcdSun10 to measure the rates of translation in live across all nuclear cycles.

      • *

      *Line 209 and Fig 4C: The authors use the terms "intensity of translation events" or "translation intensity" without clearly defining them. From the figure (specifically from the y-axis label), it looks like the authors are quantifying the intensity per molecule (which is not clearly the same thing as "translation intensity"), but it would be nice if that were stated explicitly. *

      In the relevant result section, we have changed the results text to “the intensity of translation events” for explaining the results of Figure 4C’.

      • In addition, the authors again quantify only two points. This is a continuously frustrating part of the manuscript, which applies to nearly all figures where the authors looked only at two points in space. At a typical sample size of N = 3, it seems well within time constraints to image at multiple points along the AP axis.*__ __

      In addition to the quantification shown at the anterior and posterior locations of the embryo in the Figure 3 and 4, we will show in the revised version, the quantification of translation events across all locations from the anterior to the posterior. We will use three embryos for each nuclear cycle from n.c.1 to 14.

      • Furthermore, it sounds like the authors are saying the "translation intensity" is the same in anterior and the posterior, which is counterintuitive. The expectation is that translation would be undetectable at the posterior end, in part because bcd mRNA would not be present. (Note that this expectation is even acknowledged by the authors on Line 185, which I comment on above, and also on Line 197). There should also be very low levels of Bcd protein (possibly undetectable) at the posterior pole. As such, the authors should explain how they think their claim of the same "translation intensities" in the anterior vs posterior fits into the bigger picture of what we know about Bcd and what they have already stated in the manuscript. They should also explain how they observed enough molecules to quantify at the posterior end. The authors should also disclose how many points are in each box in the boxplot. For example, the sample size is N = 3 embryos. In just three embryos, how many bcd/GCN4 colocalizations did the authors observe at the posterior end of the embryo?*

      In n.c.4 in Figure3, we saw few bcd particles in the posterior. However, at n.c.10 in Figure 4C’ the number of posterior bcd particles are higher than at the early stages. We have quantified them in Figure 4C’. We will clarify this from the new set of quantification we are undertaking now to quantify translation across the A-P axis in the revision.

      Finally, we will also provide the number of bcd particle counts and their colocalisation with a-GCN4 as a supplementary table.

      • Line 215: The sentence that starts on this line seems self-contradictory: I cannot tell whether or not there is a difference in translation based on position. *

      We have not observed any difference in the translation of bcd particles depending on the position along the Z-axis. We will edit this in our revised version.

      • Line 229: Long-ranged is a relative term. From the graph, one could state there is some spatial extent to the mRNA gradient, so it is unclear what the authors mean when they say it is not "long-ranged." Could the mRNA gradient be quantified, such as with a spatial length scale? This would provide more information for readers to make their own conclusions about whether it is long-ranged.*

      We have quantified the bcd mRNA gradient for each n.c. (Figure 5B-C); absolute bcd intensities in Figure 5B, left panel and the normalised intensities in Figure 5C. The length of the mRNA spread appears constant with the half-length maximum of ~75um across all nuclear cycles. Our conclusion of a long ranged Bcd gradient is based on the comparisons of the half-length maximum measurements of bcd particles and BcdSun10 (Figure 5D).

      *Line 230: When the authors claim the Bcd gradient is steeper earlier, a quantification of the spatial extent (exponential decay length scale) would be appropriate. Indeed, lambda as a function of time would be beneficial. It should also be placed in context of earlier papers that claim the spatial length scale is constant. *

      We will show this effectively from the live movies of bcdSun10/nanos-scFv-sGFP2 in the revised version.

      • Lines 235-236: The two sentences that start on these two lines are vague and seemingly contradictory. The first sentence says there is a spatial shift, but the second sentence sounds like it is saying there is no spatial change. The language could be more precise to explain the conclusions. *

      We agree with the reviewer. We will edit this in revision.

      Minor comments

        • Line 81: Probably meant "evolutionarily conserved" * Yes, we have changed, “P bodies are an evolutionarily cytoplasmic RNP granule” to, “P bodies are an evolutionarily conserved cytoplasmic RNP granule.”(Revised line 84-85).

      *Figure 1 legend: part B says "from 15 samples" but also says N = 20. Which is it, or do these numbers refer to different things? *

      We have edited this from, “early embryo (from 15 samples)” to, “early embryo (from 20 samples)”. (Revised line 602).

      • Line 217: migration of what? *

      Edited to “cortical nuclear migration”.

      • Line 228: "early embryo" is vague. The authors should give specific time points or nuclear cycle numbers.*

      Edited to “nuclear cycles 1-8”.

      • Line 301: Other locations in the paper say 75 microns or 100 microns. *

      We will make the changes. It is 100 um.

      • Fig. 5: all images should be oriented such that the dorsal midline is on the upper half of the embryo/image. *

      We will flip the image to match.

      • Fig. 5B: There are light tan and/or light orange curves (behind the bold curves) that are not explained. *

      It is the standard deviation. This will be explained.

      • Fig. 5C: the plot says "normalized" but nowhere do the authors describe what the curves are normalized to. There is also no explanation for what the broad areas of light color correspond to.*__ __

      Normalised to the bcd intensity maxima. This will be explained.

      Significance

      The results, if upheld, are highly significant, as they are foundational measurements addressing a longstanding question of how morphogen gradients are formed, using Bcd (the foundational morphogen gradient) as a model. They also address fundamental questions in genetics and molecular biology: namely, control of mRNA distribution and translation.__ __

      We thank Reviewer 2 for highlighting the importance of our work in the field. We are confident that we address the issues raised by Reviewer 2 with the new set of quantifications we are currently working on.

      Referee #3

        • It is not evident from the main results and methods text that the new SDD model incorporates the phenomenon reported in figure 4B. From my reading, the parameter beta accounts for the Bcd translation rate, which according to figure 7B(ii) effectively switches from off to on around fertilization and thereafter remains constant. Figure 4B shows that the fraction of bcd mRNA engaged in translation decreases beginning around NC12/13, and this is one of the more powerful results that comes from monitoring translation in addition to RNA localization/abundance/stability. My expectation based on figure 4B would be that parameter beta should decrease over time beginning around 90-100 minutes and approach zero by ~150 minutes. This rate could be fit to the experimental data that yields figure 4B. The modeling should be repeated while including this information. This is a good observation. Currently, the reduced rate of bcd translation is modelled by incorporating an increased rate of bcd *mRNA degradation. Of course, this could also be reduced by a change in the rate of translation directly. As stated already, the beta parameter is the least well characterised. In the revision, we will include a model where beta changes but not the mRNA degradation rate. We will improve the discussion to make this point clearer.
      1. The presentation of the SDD model should be expanded to address how well the characteristic decay length fits A) measured Bcd protein distributions, B) measured at different nuclear cycles. This would strengthen the claim that the new SDD model better captures gradient dynamics given the addition of translation and RNA distribution. These experimental data already exist as reported in Figure 5. In the current Figure 7, panels D and D' add little to the story and could be moved to a supplement if the authors want to include it (in any case, please fix the typo on the time axis of fig 7D' to read "hours"). The model per cell cycle and the comparison of experimental and modeled decay lengths could replace current D and D'.*

      Originally, we kept discussion of the SDD model only to core points. It is clear from all Reviewers that expanding this discussion is important. In the revision, we will refocus Figure 7 on describing new results that we can learn. As outlined in the responses above, this paper reveals an important insight: the SDD model – with suitable modifications such as temporally restricted Bcd production – can explain all observed properties of Bcd gradient formation. Other mechanisms – such as bcd mRNA gradients – are not required.

      • The exposition of the manuscript would benefit significantly by including a section either in the introduction or the appropriate section of the results that defines the competing models for gradient formation. In the current version, these models are only cited, and the key details only come out late (e.g., lines 302 onward, in the Discussion). Nevertheless, some of the results are presented as if in dialog with these models, but it reads as a one-sided conversation. For instance: Figure 3. The undercurrent in this figure is the RNA-gradient model. In the context of this model, the results clearly show that translation of bcd is restricted to the anterior. Without this context, Figure 3 could read as a fairly unremarkable observation that translation occurs wherever there is mRNA. Restructuring the manuscript to explicitly name competing models and to address how experimental results support or detract from each competing model would greatly enhance the impact of the exposition.*

      We thank the reviewer for this suggestion. We will add the current models of Bcd gradient formation in the introduction section and will change the narrative of results in the section explaining the models.

      (4A) Related to point 3: The entire results text surrounding Figure 2 should be revised to include more detail about A) what specific hypotheses are being tested; and B) to critically evaluate the limitations of the experimental approaches used to evaluate these hypotheses. Hexanediol and high salt conditions are not named explicitly in the text, but the text touts these as "chemicals" that "disrupt P-body integrity." This implies that the treatments are specific to P-bodies. Neither of these approaches are only disrupting P Body integrity. This does not invalidate this approach, but the manuscript needs to state what hypothesis HD and NaCl treatment addresses, and acknowledge the caveats of the approach (such as the non-specificity and the assumptions about the mechanism of action for HD).

      We have made the following edits to resolve this point. Revised line 158: “To further show that bcd storage in P bodies is required for translational repression, we treated mature eggs with chemicals known to disrupt RNP granule integrity (31, 37, 69-72). Previous work has shown that the physical properties of P bodies in mature Drosophila oocytes can be shifted from an arrested to a more liquid-like state by addition of the aliphatic alcohol hexanediol (HD) (Sankaranarayanan et al., 2021, Ribbeck and Görlich, 2002; Kroschwald et al., 2017). While 1,6 HD has been widely used to probe the physical state of phase-separated condensates both in vivo and in vitro (Alberti et al., 2019; McSwiggen et al., 2019; Gao et al., 2022), in some cells it appears to have unwanted cellular consequences (Ulianov et al., 2021). These include a potentially lethal cellular consequences that may indirectly affect the ability of condensates to form (Kroschwald et al., 2017) and wider cellular implications thought to alter the activity of kinases (Düster et al., 2021). While we did not observe any noticeable cellular issues in mature Drosophila oocytes with 1,6 HD, we also used 2,5 HD, known to be less problematic in most tissues (Ulianov et al., 2021) and the monovalent salt sodium chloride (NaCl), which changes electrostatic interactions (Sankaranarayanan et al., 2021).”

      (4B) Continuing the comment above: it is good that the authors checked that HD and NaCl treatment does not cause egg activation. But no one outside of the field of Drosophila egg activation knows what the 2-minute bleach test is and shouldn't have to delve into the literature to understand this sentence. Please explain in one sentence that "if eggs are activated, then x happens following a short exposure to bleach (citations). We exposed HD and NaCl treated eggs to bleach and observed... ."

      We have made the following edits to resolve this point. Revised line 174: “After treating mature eggs with these solutions, we observed BcdSun32 protein in the oocyte anterior (Figure 2A-B). One caveat to this experiment could be that treating mature eggs with these chemicals results in egg activation which would in turn generate Bcd protein. To eliminate this possibility, we first screened for phenotypic egg activation markers, including swelling and a change in the chorion (73). We also applied the classic approach of bleaching eggs for two minutes which causes lysis of unactivated eggs (74). All chemically treated eggs failed this bleaching test meaning they were not activated (74). While we unable to rule out non-specific actions of these treatments, these experiments corroborate that storage in P bodies that adopt an arrested physical state is crucial to maintain bcd translational repression (31).”

      (4C) Continuing the comment above: The section of the results related to the endos mutation needs additional information. It is not apparent to the average reader how the endos mutation results in changes in RNP granules, nor what the expected outcome of such an effect would "further test the model" set up by the HD and NaCl experiments. The average reader needs more hand-holding throughout this entire section (related to figure 2) to follow the exposition of the results.

      We have made the following edits to resolve this point. Edited line 185: “Finally, we used a genetic manipulation to change the physical state of P bodies in mature oocytes. Mutations in Drosophila Endosulfine (Endos), which is part of the conserved phosphoprotein ⍺-endosulfine (ENSA) family (75), caused a liquid-like P body state after oocyte maturation, similar to that observed with chemical treatment (Figure 2C) (31). This temporal effect matched the known roles of Endos as the master regulator of oocyte maturation (75, 76). endos mutant oocytes lost the colocalisation of bcd mRNA and P bodies, concurrent with P bodies becoming less viscous during oocyte maturation (Figure 2D, Figure S1). Particle size and position analysis showed that bcd mRNA prematurely exhibits an embryo distribution in these mutants (Figure 2E). Due to genetic and antibody constraints, we are unable to test for translation of bcd in the endos mutant. However, it follows that bcd observed in this diffuse distribution outside of P bodies would be translationally active (Figure 2E-F).”

      • (4D) Continuing the comment above: The average reader also needs a better explanation of what hypothesis is being tested in Figure 1 with the pharmacological inhibition of calcium. *

      We have made the following edits to resolve this point. Revised line 138: “We next sought to maintain the relationship between bcd mRNA and P bodies through egg activation. This would act as a control to further test if colocalisation of bcd to P bodies was necessary for its translational repression. Previous work has shown that a calcium wave is required at egg activation for further development (references to add Kaneuchi et al., 2015; York-Anderson et al., 2019; Hu and Wolfner, 2019). Chemical treatment with NS8593 disrupts this calcium wave, while other phenotypic markers of egg activation are still observed (58). Using NS8593 to disrupt the calcium wave in the activated egg, we show P bodies are retained during ex vivo egg activation (Figure 1E). In these treated eggs, bcd mRNA remains colocalised with the retained P bodies (Figure 1F). Based on these results and previous observations (31, 66), we hypothesised that the loss of colocalisation between bcd and P bodies correlates with bcd translation.”

      *It is unclear why Bcd translation could not be measured in the endos mutant background, but it would be necessary to measure Bcd translation in the endos background. If genotypically it is not possible/inconvenient to invoke the suntag reporter in the endos background, would it not be sufficient to immunostain against Bcd itself? Different Bcd antisera have recently been reported and distributed by the Wieschaus and the Zeitlinger groups. *

      We have recently received the Bcd antibody from the Zeitlinger group. This has not been shown to work for immunostaining. It remains unclear if it will be successful in this capacity, but we are currently testing it and will include this experiment in the revision if successful.

      *Figure 4 overall is glorious, but there is a problem with panel C. What are the white lines? Why does the intensity for the green and magenta channel change abruptly in the middle of the embryo? *

      These white lines divide the embryo into 4 compartments. We used this method to quantify the intensity of Bcd translation with respect to the bcd puncta. We will correct this image as there is a problem in formatting.

      *It is noted that neither the methods section or the supplement does not contain any mention of how the modeling was performed. How was parameter beta fit? At least a brief section should be added to the methods describing how beta was fit (pending adjustments suggested in comment 1 above). A platinum-level addition would include a modeling supplement that reports the sensitivity of model outcomes to changes in parameters. *

      We apologise for this omission and will include full methodological details in the revision.

      Minor Comments:

        • Line 28: "Source-Diffusion-Degradation" should be changed to "Synthesis-..."* We will edit in the revised version.

      *Line 39: "blastocyst" should be "blastoderm stage embryo". *

      We will edit in the revised version.

      • Line 81: "P bodies are an evolutionarily cytoplasmic RNP granule." is "conserved" missing here? *

      We will edit in the revised version.

      • Throughout the manuscript, there should be better reporting of the imaged genotypes and whether the suntag is being visualized by indirect immunostaining of fixed tissues or through an encoded nanobody-GFP fusion. *

      We will explain in detail in the revised version.

      • Figure 1G: Why is the background staining so different across conditions? Is this a normalization artifact?*__ __

      We agree with this shortcoming. We have now added the following to the figure legend to clarify this observation. “G: Representative fixed 10 µm Z-stack images (from 10 samples) showing BcdSun32 protein (anti-GCN4) is only present at the anterior of an in vitro activated egg or early embryo 30-minute post fertilization. BcdSun32 protein is not detected in these samples at the posterior pole (image contrast increased to highlight the lack of distinct particles at the posterior). BcdSun32 protein is also not detected at the anterior or posterior of a mature oocyte or an in vitro activated egg incubated with NS8953 (images have the contrast increased to highlight the lack of distinct particles). Scale bar: 20 mm; zoom 2 mm.” (Revised line 623).

      Figure 2 legend: what is +Sch in the x-axis labels of figure 2B? The legend says that 2B is the quantification of the data in 2A, but there is no (presumed control) +Sch image in 2A.__ __

      Thank you for this suggestion we have added the data to Figure 2A.

      • Figure 5A largely repeats information presented in figure 4A. Please consider moving to a supplement. Also, please re-orient embryos to follow the convention that dorsal-most surfaces be presented on the top of the displayed images. *

      Thank you for this suggestion. We will consider moving Figure 5A to the supplementary.

      • The lower-case roman numerals referred to in the text for figure 7B are not included in the corresponding figure panel. *

      We will edit in the revised version.

      • Figure 7C y-axis typo (concentration). *

      We will edit in the revised version.

      • Line 222: "make a long-range functional gradient": more accurate to say, "but also marks mature, Bcd protein which resolves in the expected long-range gradient." *

      We will edit in the revised version.

      • Methods: Please check that all buffers referred to as acronyms are both compositionally defined in the reagents table, and that full names are written out at the time of first mention in the presented order. For instance, Schneider's media is referred to a few times before defining the acronym about midway through the methods section.*__ __

      We have added to Figure 2B: “Quantification of experiments shown in A. The number of oocytes that displayed Bcd protein at the anterior as measured by the presence of BcdSun32 at the anterior of the oocyte, but not the posterior. Schneider’s Insect Medium (+Sch) used as a negative control. N = 30 oocytes for each treatment. Scale bar: 5 um.” (Revised line 646).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Lai and Doe address the integration of spatial information with temporal patterning and genes that specify cell fate. They identify the Forkhead transcription factor Fd4 as a lineage-restricted cell fate regulator that bridges transient spatial transcription factors to terminal selector genes in the developing Drosophila ventral nerve cord. The experimental evidence convincingly demonstrates that Fd4 is both necessary for lateborn NB7-1 neurons, but also sufficient to transform other neural stem cell lineages toward the NB7-1 identity. This work addresses an important question that will be of interest to developmental neurobiologists: How can cell identities defined by initial transient developmental cues be maintained in the progeny cells, even if the molecular mechanism remains to be investigated? In addition, the study proposes a broader concept of lineage identity genes that could be utilized in other lineages and regions in the Drosophila nervous system and in other species.

      Thanks for the accurate summary and positive comments!

      While the spatial factors patterning the neuroepithelium to define the neuroblast lineages in the Drosophila ventral nerve cord are known, these factors are sometimes absent or not required during neurogenesis. In the current work, Lai and Doe identified Fd4 in the NB7-1 lineage that bridges this gap and explains how NB7-1 neurons are specified after Engrailed (En) and Vnd cease their expression. They show that Fd4 is transiently co-expressed with En and Vnd and is present in all nascent NB7-1 progenies. They further demonstrate that Fd4 is required for later-born NB7-1 progenies and sufficient for the induction of NB7-1 markers (Eve and Dbx) while repressing markers of other lineages when force-expressed in neural progenitors, e.g., in the NB56 lineage and in the NB7-3 lineage. They also demonstrate that, when Fd4 is ectopically expressed in NB7-3 and NB5-6 lineages, this leads to the ectopic generation of dorsal muscle-innervating neurons. The inclusion of functional validation using axon projections demonstrates that the transformed neurons acquire appropriate NB7-1 characteristics beyond just molecular markers. Quantitative analyses are thorough and well-presented for all experiments.

      Thanks for the positive comments!

      (1) While Fd4 is required and sufficient for several later-born NB7-1 progeny features, a comparison between early-born (Hb/Eve) and later-born (Run/Eve) appears missing for pan-progenitor gain of Fd4 (with sca-Gal4; Figure 4) and for the NB7-3 lineage (Figure 6). Having a quantification for both could make it clearer whether Fd4 preferentially induces later-born neurons or is sufficient for NB7-1 features without temporal restriction.

      We quantified the percentage of Hb+ and Runt+ cells among Eve+ cells with sca-gal4, and the results are shown in Figure 4-figure supplement 1. We found that the proportion of early-born cells is slightly reduced but the proportion of later-born cells remain similar. Interestingly, we also found a subset of Eve+ cells with a mixed fate (Hb+Runt+) but the reason remains unclear.

      (2) Fd4 and Fd5 are shown to be partially redundant, as Fd4 loss of function alone does not alter the number of Eve+ and Dbx+ neurons. This information is critical and should be included in Figure 3.

      Because every hemisegment in an fd4 single mutant is normal, we just added it as the following text: “In fd4 mutants, we observe no change in the number of Eve+ neurons or Dbx+ neurons (n=40 hemisegments).”

      (3) Several observations suggest that lineage identity maintenance involves both Fd4dependent and Fd4-independent mechanisms. In particular, the fact that fd4-Gal4 reporter remains active in fd4/fd5 mutants even after Vnd and En disappear indicates that Fd4's own expression, a key feature of NB7-1 identity, is maintained independently of Fd4 protein. This raises questions about what proportion of lineage identity features require Fd4 versus other maintenance mechanisms, which deserves discussion.

      We agree, thanks for raising this point. We add the following text to the Discussion. “Interestingly, the fd4 fd5 mutant maintains expression of fd4:gal4, suggesting that the fd4/fd5 locus may have established a chromatin state that allows “permanent” expression in the absence of Vnd, En, and Fd4/Fd5 proteins.”

      (4) Similarly, while gain of Fd4 induces NB7-1 lineage markers and dorsal muscle innervation in NB5-6 and NB7-3 lineages, drivers for the two lineages remain active despite the loss of molecular markers, indicating some regulatory elements retain activity consistent with their original lineage identity. It is therefore important to understand the degree of functional conversion in the gain-of-function experiments. Sparse labeling of Fd4 overexpressing NB5-6 and NB7-3 progenies, as was done in Seroka and Doe (2019), would be an option.

      We agree it is interesting that the NB7-3 and NB5-6 drivers remain on following Fd4 misexpression. To explore this, we used sca-gal4 to overexpress Fd4 and observed that Lbe expression persisted while Eg was largely repressed (Author response image 1). The results show that Lbe and Eg respond differently to Fd4. A non-mutually exclusive possibility is that the continued expression of lbe-Gal4 UAS-GFP or eg-Gal4 UAS-GFP may be due to the lengthy perdurance of both Gal4 and GFP.

      Author response image 1.

      (5) The less-penetrant induction of Dbx+ neurons in NB5-6 with Fd4-overexpression is interesting. It might be worth the authors discussing whether it is an Fd4 feature or an NB56 feature by examining Dbx+ neuron number in NB7-3 with Fd4-overexpression.

      In the NB7-3 lineages misexpressing Fd4, only 5 lineages generated Dbx+ cells (0.1±0.4, n=64 hemisegments), suggesting that the low penetrance of Dbx+ induction is an intrinsic feature of Fd4 rather than lineage context. We have added this information in the results section.

      (6) It is logical to hypothesize that spatial factors specify early-born neurons directly, so only late-born neurons require Fd4, but it was not tested. The model would be strengthened by examining whether Fd4-Gal4-driven Vnd rescues the generation of laterborn neurons in fd4/fd5 mutants.

      When we used en-gal4 driver to express UAS-vnd in the fd4/fd5 mutant background, we found an average 7.4±2.2 Eve+ cells per hemisegment (n=36), significantly higher than fd4/fd5 mutant alone (3.9±0.8 cells, n=52, p=2.6x10<sup>-11</sup>) (Figure 3J). In addition, 0.2±0.5 Eve+ cells were ectopic Hb+ (excluding U1/U2), indicating that Vnd-En integration is sufficient to generate both early-born and late-born Eve+ cells in the fd4/fd5 mutants. We have added the results to the text.

      (7) It is mentioned that Fd5 is not sufficient for the NB7-1 lineage identity. The observation is intriguing in how similar regulators serve distinct roles, but the data are not shown. The analysis in Figure 4 should be performed for Fd5 as supplemental information.

      Thanks for the suggestion. Because the results are exactly the same as the wild type, we don’t think it is necessary to provide an additional images or analysis as supplemental information.

      Reviewer #2 (Public review):

      Via a detailed expression analysis, they find that Fd4 is selectively expressed in embryonic NB7-1 and newly born neurons within this lineage. They also undertake a comprehensive genetic analysis to provide evidence that fd4 is necessary and sufficient for the identity of NB7-1 progeny.

      Thanks for the accurate summary!

      The analysis is both careful and rigorous, and the findings are of interest to developmental neurobiologists interested in molecular mechanisms underlying the generation of neuronal diversity. Great care was taken to make the figures clear and accessible. This work takes great advantage of years of painstaking descriptive work that has mapped embryonic neuroblast lineages in Drosophila.

      Thanks for the positive comments!

      The argument that Fd4 is necessary for NB7-1 lineage identity is based on a Fd4/Fd5 double mutant. Loss of fd4 alone did not alter the number of NB7-1-derived Eve+ or Dbx+ neurons. The authors clearly demonstrate redundancy between fd4 and fd5, and the fact that the LOF analysis is based on a double mutant should be better woven through the text.The authors generated an Fd5 mutant. I assume that Fd5 single mutants do not display NB7-1 lineage defects, but this is not stated. The focus on Fd4 over Fd5 is based on its highly specific expression profile and the dramatic misexpression phenotypes. But the LOF analysis demonstrates redundancy, and the conclusions in the abstract and through the results should reflect the existence of Fd5 in the conclusions of this manuscript.

      We agree, and have added new text to clarify the single mutant phenotypes (there are none) and the double mutant phenotype (loss of NB7-1 molecular and morphological features. The following text is added to the manuscript: “Not surprisingly, we found that fd4 single mutants or fd5 single mutants had no phenotype (Eve+ neurons were all normal). Thus, to assess their roles, we generated a fd4 and fd5 double mutant. Because many Eve+ and Dbx+ cells are generated outside of NB7-1 lineage, it was also essential to identify the Eve+ or Dbx+ cells within NB7-1 lineage in wild type and fd4 mutant embryos. To achieve this, we replaced the open reading frame of fd4 with gal4 (called fd4-gal4) (see Methods); this stock simultaneously knocked out both fd4 and fd5 (called fd4/fd5 mutant hereafter) while specifically labeling the NB7-1 lineage. For the remainder of this paper we use the fd4/fd5 double mutant to assay for loss of function phenotypes.”

      It is notable that Fd4 overexpression can rewire motor circuits. This analysis adds another dimension to the changes in transcription factor expression and, importantly, demonstrates functional consequences. Could the authors test whether U4 and U5 motor axon targeting changes in the fd4/fd5 double mutant? To strengthen claims regarding the importance of fd4/fd5 for lineage identity, it would help to address terminal features of U motorneuron identity in the LOF condition.

      Thanks for raising this important point. We examined the axon targeting on body wall muscles in both wild type and in fd4/fd5 mutant background and added the results in Figure 3-figure supplement 2. We found that the axon targeting in the late-born neuron region (LL1) is significantly reduced, suggesting that the loss of late-born neurons in fd4/fd5 mutant leads to the absence of innervation of corresponding muscle targets.

      Reviewer #3 (Public review):

      The goal of the work is to establish the linkage between the spatial transcription factors (STFs) that function transiently to establish the identities of the individual NBs and the terminal selector genes (typically homeodomain genes) that appear in the newborn postmitotic neurons. How is the identity of the NB maintained and carried forward after the spatial genes have faded away? Focusing on a single neuroblast (NB 7-1), the authors present evidence that the fork-head transcription factor, fd4, provides a bridge linking the transient spatial cues that initially specified neuroblast identity with the terminal selector genes that establish and maintain the identity of the stem cell's progeny.

      Thanks for the positive comments!

      The study is systematic, concise, and takes full advantage of 40+ years of work on the molecular players that establish neuronal identities in the Drosophila CNS. In the embryonic VNC, fd4 is expressed only in the NB 7-1 and its lineage. They show that Fd4 appears in the NB while the latter is still expressing the Spatial Transcription Factors and continues after the expression of the latter fades out. Fd4 is maintained through the early life of the neuronal progeny but then declines as the neurons turn on their terminal selector genes. Hence, fd4 expression is compatible with it being a bridging factor between the two sets of genes.

      Thanks for the accurate summary!

      Experimental support for the "bridging" role of Fd4 comes from a set of loss-of-function and gain-of-function manipulations. The loss of function of Fd4, and the partially redundant gene Fd5, from lineage 7-1 does not aoect the size of the lineage, but terminal markers of late-born neuronal phenotypes, like Eve and Dbx, are reduced or missing. By contrast, ectopic expression of fd4, but not fd5, results in ectopic expression of the terminal markers eve and Dbx throughout diverse VNC lineages.

      Thanks for the accurate summary!

      A detailed test of fd4's expression was then carried out using lineages 7-3 and 5-6, two well-characterized lineages in Drosophila. Lineage 7-3 is much smaller than 7-1 and continues to be so when subjected to fd4 misexpression. However, under the influence of ectopic Fd4 expression, the lineage 7-3 neurons lost their expected serotonin and corazonin expression and showed Eve expression as well as motoneuron phenotypes that partially mimic the U motoneurons of lineage 7-1.

      Thanks for the positive comments!

      Ectopic expression of Fd4 also produced changes in the 5-6 lineage. Expression of apterous, a feature of lineage 5-6, was suppressed, and expression of the 7-1 marker, Eve, was evident. Dbx expression was also evident in the transformed 5-6 lineages, but extremely restricted as compared to a normal 7-1 lineage. Considering the partial redundancy of fd4 and fd5, it would have been interesting to express both genes in the 5-6 lineage. The anatomical changes that are exhibited by motoneurons in response to Fd4 expression confirm that these cells do, indeed, show a shift in their cellular identity.

      We appreciate the positive comments. We agree double misexpression of Fd4 and Fd5 might give a stronger phenotype (as the reviewer says) but the lack of this experiment does not change the conclusions that Fd4 can promote NB7-1 molecular and morphological aspects at the expense of NB5-6 molecular markers.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      The title of Figure 4 may be intended to include the term "Widespread", not "Wild spread". (Though the expansion of the Eve and Dbx with Fd4 is quite remarkable…).

      Done!

      Reviewer #3 (Recommendations for the authors):

      (1) Line 138. Is part of the sentence missing? Did the authors mean to say "that fd5 is coexpressed with fd4 in NB7-1 and its .....".

      Done!

      (2) ln 237: In trying to explain the "U-like" phenotype of the transformed motoneurons in lineage 7-3, the authors speculate that "perhaps their late birth did not give them time to extend to the most distant dorsal muscles ". It is very difficult to convince a motoneuron to stop growing in the absence of a target! An alternate possibility is that since there is only one or two U neurons made instead of the normal five, the growing motoneuron has enough information to direct them to the dorsal domain, but they lack the specification that allows them to recognize a specific muscle target.

      We agree there are additional possibilities, and now update the text to say: “We observed that these transformed neurons did not innervate the dorsal muscles, perhaps their late birth did not give them time to extend to the most distant dorsal muscles, or they were incompletely specified.”

      (3) In the References, I think that the Anderson et al. reference should also include "BioRxiv" before the DOI.

      Done!

      (4) Figure 6A for wild-type 7-3 lineage. The corazonin expression appears to be expressed in EW2 as well as EW3. This should be explained.

      We agree it looks that way, due to the 3D rotation used; we now replace it with a more representative image. Note that our quantification always shows a single Cor+ neuron per hemisegment.

      (5) Figure 7: Issues of terminology. The designation of "longitudinal" for muscles is traditionally in reference to the body axis, such as the Dorsal Longitudinal Muscles (DLM) of the adult thorax. The "longitudinal" muscles in the figure are really "transverse" muscles. I also suggest using "axon" or "neurites" rather than "filament". For the middle and bottom parts of E and F, are these lateral and ventral views? They should be designated as such.

      Thanks, we agree and have made the changes, using Axon instead of Filament, and labeling the views (lateral and ventro-lateral).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Weaknesses:

      The technical approach is strong and the conceptual framing is compelling, but several aspects of the evidence remain incomplete. In particular, it is unclear whether the reported changes in connectivity truly capture causal influences, as the rank metrics remain correlational and show discrepancies with the manipulation results.

      We agree that our functional connectivity ranking analyses cannot establish causal influences. As discussed in the manuscript, besides learning-related activity changes, the functional connectivity may also be influenced by neuromodulatory systems and internal state fluctuations. In addition, the spatial scope of our recordings is still limited compared to the full network implicated in visual discrimination learning, which may bias the ranking estimates. In future, we aim to achieve broader region coverage and integrate multiple complementary analyses to address the causal contribution of each region.

      The absolute response onset latencies also appear slow for sensory-guided behavior in mice, and it is not clear whether this reflects the method used to define onset timing or factors such as task structure or internal state.

      We believe this may be primarily due to our conservative definition of onset timing. Specifically, we required the firing rate to exceed baseline (t-test, p < 0.05) for at least 3 consecutive 25-ms time windows. This might lead to later estimates than other studies, such as using the latency to the first spike after visual stimulus onset (Siegle et al., 2021) or the time to half-max response (Goldbach, Akitake, Leedy, & Histed, 2021).

      The estimation of response onset latency in our study may also be affected by potential internal state fluctuations of the mice. We used the time before visual stimulus onset as baseline firing, since firing rates in this period could be affected by trial history, we acknowledge this may increase the variability of the baseline, thus increase the difficulty to statistically detect the onset of response.

      Still, we believe these concerns do not affect the observation of the formation of compressed activity sequence in CR trials during learning.

      Furthermore, the small number of animals, combined with extensive repeated measures, raises questions about statistical independence and how multiple comparisons were controlled.

      We agree that a larger sample size would strengthen the robustness of the findings. However, as noted above, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve sufficient unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. This will allow us to both increase the number of animals and extract more precise insights into mesoscale dynamics during learning.

      The optogenetic experiments, while intended to test the functional relevance of rank increasing regions, leave it unclear how effectively the targeted circuits were silenced. Without direct evidence of reliable local inhibition, the behavioral effects or lack thereof are difficult to interpret.

      We appreciate this important point. Due to the design of the flexible electrodes and the implantation procedure, bilateral co-implantation of both electrodes and optical fibers was challenging, which prevented us from directly validating the inhibition effect in the same animals used for behavior. In hindsight, we could have conducted parallel validations using conventional electrodes, and we will incorporate such controls in future work to provide direct evidence of manipulation efficacy.

      Details on spike sorting are limited.

      We have provided more details on spike sorting in method section, including the exact parameters used in the automated sorting algorithm and the subsequent manual curation criteria.

      Reviewer #2 (Public review):

      Weaknesses:

      I had several major concerns:

      (1) The number of mice was small for the ephys recordings. Although the authors start with 7 mice in Figure 1, they then reduce to 5 in panel F. And in their main analysis, they minimize their analysis to 6/7 sessions from 3 mice only. I couldn't find a rationale for this reduction, but in the methods they do mention that 2 mice were used for fruitless training, which I found no mention in the results. Moreover, in the early case, all of the analysis is from 118 CR trials taken from 3 mice. In general, this is a rather low number of mice and trial numbers. I think it is quite essential to add more mice.

      We apologize for the confusion. As described in the Methods section, 7 mice (Figure 1B) were used for behavioral training without electrode array or optical fiber implants to establish learning curves, and an additional 5 mice underwent electrophysiological recordings (3 for visual-based decision-making learning and 2 for fruitless learning).

      As we noted in our response to Reviewer #1, the current dataset has inherent limitations in both the number of recorded regions and the behavioral paradigm. Given the considerable effort required to achieve high-quality unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. These improvements will enable us to collect data from a larger sample size and extract more precise insights into mesoscale dynamics during learning.

      (2) Movement analysis was not sufficient. Mice learning a go/no-go task establish a movement strategy that is developed throughout learning and is also biased towards Hit trials. There is an analysis of movement in Figure S4, but this is rather superficial. I was not even sure that the 3 mice in Figure S4 are the same 3 mice in the main figure. There should be also an analysis of movement as a function of time to see differences. Also for Hits and FAs. I give some more details below. In general, most of the results can be explained by the fact that as mice gain expertise, they move more (also in CR during specific times) which leads to more activation in frontal cortex and more coordination with visual areas. More needs to be done in terms of analysis, or at least a mention of this in the text.

      Due to the limitation in the experimental design and implementation, movement tracking was not performed during the electrophysiological recordings, and the 3 mice shown in Figure S4 (now S5) were from a separate group. We have carefully examined the temporal profiles of mouse movements and found it did not fully match the rank dynamics for all regions, and we have added these results and related discussion in the revised manuscript. However, we acknowledge the observed motion energy pattern could explain some of the functional connection dynamics, such as the decrease in face and pupil motion energy could explain the reduction in ranks for striatum.

      Without synchronized movement recordings in the main dataset, we cannot fully disentangle movement-related neural activity from task-related signals. We have made this limitation explicit in the revised manuscript and discuss it as a potential confound, along with possible approaches to address it in future work.

      (3) Most of the figures are over-detailed, and it is hard to understand the take-home message. Although the text is written succinctly and rather short, the figures are mostly overwhelming, especially Figures 4-7. For example, Figure 4 presents 24 brain plots! For rank input and output rank during early and late stim and response periods, for early and expert and their difference. All in the same colormap. No significance shown at all. The Δrank maps for all cases look essentially identical across conditions. The division into early and late time periods is not properly justified. But the main take home message is positive Δrank in OFC, V2M, V1 and negative Δrank in ThalMD and Str. In my opinion, one trio map is enough, and the rest could be bumped to the Supplementary section, if at all. In general, the figure in several cases do not convey the main take home messages. See more details below.

      We thank the reviewer for this valuable critique. The statistical significance corresponding to the brain plots (Figure 4 and Figure 5) was presented in Figure S3 and S5 (now Figure S5 and S7 in the revised manuscript), but we agree that the figure can be simplified to focus on the key results.

      In the revised manuscript, we have condensed these figures to focus on the most important comparisons to make the visual presentation more concise and the take-home message clearer.

      (4) The analysis is sometimes not intuitive enough. For example, the rank analysis of input and output rank seemed a bit over complex. Figure 3 was hard to follow (although a lot of effort was made by the authors to make it clearer). Was there any difference between the output and input analysis? Also, the time period seems redundant sometimes. Also, there are other network analysis that can be done which are a bit more intuitive. The use of rank within the 10 areas was not the most intuitive. Even a dimensionality reduction along with clustering can be used as an alternative. In my opinion, I don't think the authors should completely redo their analysis, but maybe mention the fact that other analyses exist

      We appreciate the reviewer’s comment. In brief, the input- and output-rank analyses yielded largely similar patterns across regions in CR trials, although some differences were observed in certain areas (e.g., striatum) in Hit trials, where the magnitude of rank change was not identical between input and output measures. We have condensed the figures to only show averaged rank results, and the colormap was updated to better covey the message.

      We did explore dimensionality reduction applied to the ranking data. However, the results were not intuitive as well and required additional interpretation, which did not bring more insights. Still, we acknowledge that other analysis approaches might provide complementary insights.

      Reviewer #3 (Public review):

      Weaknesses:

      The weakness is also related to the strength provided by the method. It is demonstrated in the original method that this approach in principle can track individual units for four months (Luan et al, 2017). The authors have not showed chronically tracked neurons across learning. Without demonstrating that and taking advantage of analyzing chronically tracked neurons, this approach is not different from acute recording across multiple days during learning. Many studies have achieved acute recording across learning using similar tasks. These studies have recorded units from a few brain areas or even across brain-wide areas.

      We appreciate the reviewer’s important point. We did attempt to track the same neurons across learning in this project. However, due to the limited number of electrodes implanted in each brain region, the number of chronically tracked neurons in each region was insufficient to support statistically robust analyses. Concentrating probes in fewer regions would allow us to obtain enough units tracked across learning in future studies to fully exploit the advantages of this method.

      Another weakness is that major results are based on analyses of functional connectivity that is calculated using the cross-correlation score of spiking activity (TSPE algorithm). Functional connection strengthen across areas is then ranked 1-10 based on relative strength. Without ground truth data, it is hard to judge the underlying caveats. I'd strongly advise the authors to use complementary methods to verify the functional connectivity and to evaluate the mesoscale change in subnetworks. Perhaps the authors can use one key information of anatomy, i.e. the cortex projects to the striatum, while the striatum does not directly affect other brain structures recorded in this manuscript

      We agree that the functional connectivity measured in this study relies on statistical correlations rather than direct anatomical connections. We plan to test the functional connection data with shorter cross-correlation delay criteria to see whether the results are consistent with anatomical connections and whether the original findings still hold.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) The small number of mice, each contributing many sessions, complicates the  interpretation of the data. It is unclear how statistical analyses accounted for the small  sample size, repeated measures, and non-independence across sessions, or whether  multiple comparisons were adequately controlled.

      We realized the limitation from the small number of animal subjects, yet the difficulty to achieve sufficient unit yields across all regions in the same animal restricted our sample size. Though we agree that a larger sample size would strengthen the robustness of the findings, however, as noted below the current dataset has inherent limitations in both the scope of recorded regions and the behavioral paradigm.

      Given the considerable effort required to achieve sufficient unit yields across all targeted regions, we wish to adjust the set of recorded regions, improve behavioral task design, and implement better analyses in future studies. This will allow us to both increase the number of animals and extract more precise insights into mesoscale dynamics during learning.

      (2) The ranking approach, although intuitive for visualizing relative changes in  connectivity, is fundamentally descriptive and does not reflect the magnitude or  reliability of the connections. Converting raw measures into ordinal ranks may obscure  meaningful differences in strength and can inflate apparent effects when the underlying  signal is weak.

      We agree with this important point. As stated in the manuscript, our motivation in taking the ranking approach was that the differences in firing rates might bias cross-correlation between spike trains, making raw accounts of significant neuron pairs difficult to compare across conditions, but we acknowledge the ranking measures might obscure meaningful differences or inflate weak effects in the data.

      We added the limitations of ranking approach in the discussion section and emphasized the necessity in future studies for better analysis approaches that could provide more accurate assessment of functional connection dynamics without bias from firing rates.

      (3) The absolute response onset latencies also appear quite slow for sensory-guided  behavior in mice, and it remains unclear whether this reflects the method used to  determine onset timing or factors such as task design, sensorimotor demands, or  internal state. The approach for estimating onset latency by comparing firing rates in  short windows to baseline using a t-test raises concerns about robustness, as it may  be sensitive to trial-to-trial variability and yield spurious detections.

      We agree this may be primarily due to our conservative definition of onset timing. Specifically, we required the firing rate to exceed baseline (t-test, p < 0.05) for at least 3 consecutive 25-ms time windows. This might lead to later estimates than other studies, such as using the latency to the first spike after visual stimulus onset (Siegle et al., 2021) or the time to half-max response (Goldbach, Akitake, Leedy, & Histed, 2021).

      The estimation of response onset latency in our study may also be affected by potential internal state fluctuations of the mice. We used the time before visual stimulus onset as baseline firing, since firing rates in this period could be affected by trial history, we acknowledge this may increase the variability of the baseline, thus increase the difficulty to statistically detect the onset of response.

      Still, we believe these concerns do not affect the observation of the formation of compressed activity sequence in CR trials during learning.

      (4) Details on spike sorting are very limited. For example, defining single units only by  an interspike interval threshold above one millisecond may not sufficiently rule out  contamination or overlapping clusters. How exactly were neurons tracked across days  (Figure 7B)?

      We have added more details on spike sorting, including the processing steps and important parameters used in the automated sorting algorithm. Only the clusters well isolated in feature space were accepted in manual curation.

      We attempted to track the same neurons across learning in this project. However, due to the limited number of electrodes implanted in each brain region, the number of chronically tracked neurons in each region was insufficient to support statistically robust analyses.

      This is now stated more clearly in the discussion section.

      (5) The optogenetic experiments, while designed to test the functional relevance of  rank-increasing regions, also raise questions. The physiological impact of the inhibition  is not characterized, making it unclear how effectively the targeted circuits were  actually silenced. Without clearer evidence that the manipulations reliably altered local  activity, the interpretation of the observed or absent behavioral effects remains  uncertain.

      We appreciate this important point. Due to the design of the flexible electrodes and the implantation procedure, bilateral co-implantation of both electrodes and optical fibers was challenging, which prevented us from directly validating the inhibition effect in the same animals used for behavior. In hindsight, we could have conducted parallel validations using conventional electrodes, and we will incorporate such controls in future work to provide direct evidence of manipulation efficacy. 

      (6) The task itself is relatively simple, and the anatomical coverage does not include  midbrain or cerebellar regions, limiting how broadly the findings can be generalized to more flexible or ethologically relevant forms of decision-making.

      We appreciate this advice and have expanded the existing discussion to more explicitly state that the relatively simple task design and anatomical coverage might limit the generalizability of our findings.

      (7) The abstract would benefit from more consistent use of tense, as the current mix of  past and present can make the main findings harder to follow. In addition, terms like  "mesoscale network," "subnetwork," and "functional motif" are used interchangeably in  places; adopting clearer, consistent terminology would improve readability.

      We have changed several verbs in abstract to past form, and we now adopted a more consistent terminology by substituting “functional motif” as “subnetwork”. We still feel the use of

      “mesoscale network” and “subnetwork” could emphasize different aspects of the results according to the context, so these words are kept the same.

      (8) The discussion could better acknowledge that the observed network changes may  not reflect task-specific learning alone but could also arise from broader shifts in  arousal, attention, or motivation over repeated sessions.

      We have expanded the existing discussion to better acknowledge the possible effects from broader shifts in arousal, attention, or motivation over repeated sessions.

      (9) The figures would also benefit from clearer presentation, as several are dense and  not straightforward to interpret. For example, Figure S8 could be organized more  clearly to highlight the key comparisons and main message

      We have simplified the over-detailed brain plots in Figure 4-5, and the plots in Figure 6 and S8 (now S10 in the revised manuscript).

      (10) Finally, while the manuscript notes that data and code are available upon request,  it would strengthen the study's transparency and reproducibility to provide open access  through a public repository, in line with best practices in the field.

      The spiking data, behavior data and codes for the core analyses in the manuscript are now shared in pubic repository (Dryad). And we have changed the description in the Data Availability secition accordingly.

      Reviewer #2 (Recommendations for the authors):

      (A) Introduction:

      (1) "Previous studies have implicated multiple cortical and subcortical regions in visual  task learning and decision-making". No references here, and also in the next sentence.

      The references were in the following introduction and we have added those references here as well.

      We also added one review on cortical-subcortical neural correlates in goal-directed behavior (Cruz et al., 2023).

      (2) Intro: In general, the citation of previous literature is rather minimal, too minimal.  There is a lot of studies using large scale recordings during learning, not necessarily  visual tasks. An example for brain-wide learning study in subcortical areas is Sych et  al. 2022 (cell reports). And for wide-field imaging there are several papers from the  Helmchen lab and Komiyama labs, also for multi-area cortical imaging.

      We appreciate this advice. We included mainly visual task learning literature to keep a more focused scope around the regions and task we actually explored in this study. We fear if we expand the intro to include all the large-scale imaging/recording studies in learning field, the background part might become too broad.

      We have included (Sych, Fomins, Novelli, & Helmchen, 2022) for its relevance and importance in the field.

      (3) In the intro, there is only a mention of a recording of 10 brain regions, with no  mention of which areas, along with their relevance to learning. This is mentioned in the  results, but it will be good in the intro.

      The area names are now added in intro.

      (B) Results:

      (1) Were you able to track the same neurons across the learning profile? This is not  stated clearly.

      We did attempt to track the same neurons across learning in this project. However, due to the limited number of electrodes implanted in each brain region, the number of chronically tracked neurons in each region was insufficient to support statistically robust analyses.

      We now stated this more clearly in the discussion section.

      (2) Figure 1 starts with 7 mice, but only 5 mice are in the last panel. Later it goes down  to 3 mice. This should be explained in the results and justified.

      We apologize for the confusion. As described in the Methods section, 7 mice (Figure 1B) were used for behavioral training without electrode array or optical fiber implants to establish learning curves, and an additional 5 mice underwent electrophysiological recordings (3 for visual-based decision-making learning and 2 for fruitless learning).

      (3) I can't see the electrode tracks in Figure 1d. If they are flexible, how can you make  sure they did not bend during insertion? I couldn't find a description of this in the  methods also.

      The electrode shanks were ultra-thin (1-1.5 µm) and it was usually difficult to recover observable tracks or electrodes in section.

      The ultra-flexible probes could not penetrate brain on their own (since they are flexible), and had to be shuttled to position by tungsten wires through holes designed at the tip of array shanks. The tungsten wires were assembled to the electrode array before implantation; this was described in the section of electrode array fabrication and assembly. We also included the description about the retraction of the guiding tungsten wires in the surgery section to avoid confusion.

      As an further attempt to verify the accuracy of implantation depth, we also measured the repeatability of implantation in a group of mice and found a tendency for the arrays to end in slightly deeper location in cortex (142.1 ± 55.2 μm, n = 7 shanks), and slightly shallower location in subcortical structure (-122.6 ± 71.7 μm, n = 7 shanks). We added these results as new Figure S1 to accompany Figure 1.

      (4) In the spike rater in 1E, there seems to be ~20 cells in V2L, for example, but in 1F,  the number of neurons doesn't go below 40. What is the difference here? 

      We checked Figure 1F, the plotted dots do go below 40 to ~20. Perhaps the file that reviewer received wasn’t showing correctly?

      (5) The authors focus mainly on CR, but during learning, the number of CR trials is  rather low (because they are not experts). This can also be seen in the noisier traces  in Figure 2a. Do the authors account for that (for example by taking equal trials from  each group)? 

      We accounted this by reconstructing bootstrap-resampled datasets with only 5 trials for each session in both the early stage and the expert stage. The mean trace of the 500 datasets again showed overall decrease in CR trial firing rate during task learning, with highly similar temporal dynamics to the original data.

      The figure is now added to supplementary materials (as Figure S3 in the revised manuscript).

      (6) From Figure 2a, it is evident that Hit trials increase response when mice become  experts in all brain areas. The authors have decided to focus on the response onset  differences in CRs, but the Hit responses display a strong difference between naïve  and expert cases.

      Judged from the learning curve in this task the mice learned to inhibit its licking action when the No-Go stimuli appeared, which is the main reason we focused on these types of trials.

      The movement effects and potential licking artefacts in Hit trials also restricted our interpretation of these trials.

      (7) Figure 3 is still a bit cumbersome. I wasn't 100% convinced of why there is a need  to rank the connection matrix. I mean when you convert to rank, essentially there could  be a meaningful general reduction in correlation, for example during licking, and this  will be invisible in the ranking system. Maybe show in the supp non-ranked data, or  clarify this somehow

      We agree with this important point. As stated in the manuscript and response to Reviewer #1, our motivation in taking the ranking approach was that the differences in firing rates could bias cross-correlation between spike trains, making raw accounts of significant neuron pairs difficult to compare across conditions, but we acknowledge the ranking measures might obscure meaningful differences or inflate weak effects in the data.

      We added the limitations of ranking approach in the discussion section and emphasized the necessity in future studies for better analysis approaches that could provide more accurate assessment of functional connection dynamics without bias from firing rates.

      (8) Figure 4a x label is in manuscript, which is different than previous time labels,  which were seconds.

      We now changed all time labels from Figure 2 to milliseconds.

      (9) Figure 4 input and output rank look essentially the same.

      We have compressed the brain plots in Figures 4-5 to better convey the take-home message.

      (10) Also, what is the late and early stim period? Can you mark each period in panel A? Early stim period is confusing with early CR period. Same for early respons and late response.

      The definition of time periods was in figure legends. We now mark each period out to avoid confusion.

      (11) Looking at panel B, I don't see any differences between delta-rank in early stim,  late stim, early response, and late response. Same for panel c and output plots.

      The rankings were indeed relatively stable across time periods. The plots are now compressed and showed a mean rank value.

      (12) Panels B and C are just overwhelming and hard to grasp. Colors are similar both  to regular rank values and delta-rank. I don't see any differences between all  conditions (in general). In the text, the authors report only M2 to have an increase in  rank during the response period. Late or early response? The figure does not go well  with the text. Consider minimizing this plot and moving stuff to supplementary.

      The colormap are now changed to avoid confusion, and brain plots are now compressed.

      (13) In terms of a statistical test for Figure 4, a two-way ANOVA was done, but over  what? What are the statistics and p-values for the test? Is there a main effect of time  also? Is their a significant interaction? Was this done on all mice together? How many  mice? If I understand correctly, the post-hoc statistics are presented in the  supplementary, but from the main figure, you cannot know what is significant and what  is not.

      For these figures we were mainly concerned with the post-hoc statistics which described the changes in the rankings of each region across learning.

      We have changed the description to “t-test with Sidak correction” to avoid the confusion.

      (14) In the legend of Figure 4, it is reported that 610 expert CR trials from 6 sessions,  instead of 7 sessions. Why was that? Also, like the previous point, why only 3 mice?

      Behavior data of all the sessions used were shown in Figure S1. There were only 3 mice used for the learning group, the difficulty to achieve sufficient unit yields across all regions in the same animal restricted our sample size

      (15) Body movement analysis: was this done in a different cohort of mice? Only now  do I understand why there was a division into early and late stim periods. In supp 4,  there should be a trace of each body part in CR expert versus naïve. This should also  be done for Hit trials as a sanity check. I am not sure that the brightness difference  between consecutive frames is the best measure. Rather try to calculate frame-to frame correlation. In general, body movement analysis is super important and should  be carefully analyzed.

      Due to the limitation in the experimental design and implementation, movement tracking was not performed during the electrophysiological recordings, and the 3 mice shown in Figure S4 (now S5) were from a separate group. We have carefully examined the temporal profiles of mouse movements and found it did not fully match the rank dynamics for all regions, and we have added these results and related discussion in the revised manuscript. However, we acknowledge the observed motion energy pattern could explain some of the functional connection dynamics, such as the decrease in face and pupil motion energy could explain the reduction in ranks for striatum.

      Without synchronized movement recordings in the main dataset, we cannot fully disentangle movement-related neural activity from task-related signals. We have made this limitation explicit in the revised manuscript and discuss it as a potential confound, along with possible approaches to address it in future work.

      (16) For Hit trials, in the striatum, there is an increase in input rank around the  response period, and from Figure S6 it is clear that this is lick-related. Other than that,  the authors report other significant changes across learning and point out to Figure 5b,c. I couldn't see which areas and when it occurred.

      We did naturally expect the activity in striatum to be strongly related to movement.

      With Figure S6 (now S7) we wished to show that the observed rank increase for striatum could not simply be attributed to changes in time of lick initiation.

      As some readers may argue that during learning the mice might have learned to only intensely lick after response signal onset, causing the observed rise of input rank after response signal, we realigned the spikes in each trial to the time of the first lick, and a strong difference could still be observed between early training stage and expert training stage.

      We still cannot fully rule out the effects from more subtle movement changes, as the face motion energy did increase in early response period. This result and related discussion has been added to the results section of revised manuscript.

      (17) Figure 6, again, is rather hard to grasp. There are 16 panels, spread over 4 areas,  input and output, stim and response. What is the take home message of all this?  Visually, it's hard to differentiate between each panel. For me, it seems like all the  panels indicate that for all 4 areas, both in output and input, frontal areas increase in  rank. This take-home message can be visually conveyed in much less tedious ways.  This simpler approach is actually conveyed better in the text than in the figures  themselves. Also, the whole explanation on how this analysis was done, was not clear  from the text. If I understand it, you just divided and ranked the general input (or  output) into individual connections? If so, then this should be better explained.

      We appreciate this advice and we have compressed the figures to better convey the main message.The rankings for Figure 6 and Figure S8 (now Figure S9) was explained in the left panel of Figure 3C. Each non-zero element in the connection matrix was ranked to value from 1-10, with a value of 10 represented the 10% strongest non-zero elements in the matrix.

      We have updated the figure legends of Figure 3, and we have also updated the description in methods (Connection rank analyses) to give a clearer description of how the analyses were applied in subsequent figures.

      (18) Figure 7: Here, the authors perform a ROC analysis between go and no-go  stimuli. They balance between choice, but there is still an essential difference between  a hit and a FA in terms of movement and licks. That is maybe why there is a big  difference in selective units during the response period. For example, during a Hit trial  the mouse licks and gets a reward, resulting in more licking and excitement. In FAs,the mouse licks, but gets punished, which causes a reduction in additional licking and  movements. This could be a simple explanation why the ROC was good in the late  response period. Body movement analysis of Hit and FA should be done as in Figure  S4.

      We appreciate this insightful advice.

      Though we balanced the numbers of basic trial types, we couldn’t rule out the difference in the intrinsic movement amount difference in FA trials and Hit trials, which is likely the reason of large proportion of encoding neurons in response period.

      We have added this discussion both in result section and discussion section along with the necessity of more carefully designed behavior paradigm to disentangle task information.

      (19) The authors also find selective neurons before stimulus onset, and refer to trial  history effects. This can be directly checked, that is if neurons decode trial history.

      We attempted encoding analyses on trial history, but regrettably for our dataset we could not find enough trials to construct a dataset with fully balanced trial history, visual stimulus and behavior choice.

      (20) Figure 7e. What is the interpretation for these results? That areas which peaked  earlier had more input and output with other areas? So, these areas are initiating  hubs? Would be nice to see ACC vs Str traces from B superimposed on each other.  Having said this, the Str is the only area to show significant differences in the early  stim period. But is also has the latest peak time. This is a bit of a discrepancy.

      We appreciate this important point.

      The limitation in the anatomical coverage of brain regions restricted our interpretation about these findings. They could be initiating hubs or earlier receiver of the true initiating hubs that were not monitored in our study.

      The Str trace was in fact above the ACC trace, especially in the response period. This could be explained by the above advice 18: since we couldn’t rule out the difference in the intrinsic movement amount difference in FA trials and Hit trials, and considering striatum activity is strongly related to movement, the Str trace may reflect more in the motion related spike count difference between FA trials and Hit trials, instead of visual stimulus related difference.

      This further shows the necessity of more carefully designed behavior paradigm to disentangle task information.

      The striatum trace also in fact didn’t show a true double peak form as traces in other regions, it ramped up in the stimulus region and only peaked in response period. This description is now added to the results section.

      In the early stim period, the Striatum did show significant differences in average percent of encoding neurons, as the encoding neurons were stably high in expert stage. The striatum activity is more directly affected Still the percentage of neurons only reached peak in late stimulus period.

      (21) For the optogenetic silencing experiments, how many mice were trained for each  group? This is not mentioned in the results section but only in the legend of Figure 8. This part is rather convincing in terms of the necessity for OFC and V2M

      We have included the mice numbers in results section as well.

      (C) Discussion

      (1) There are several studies linking sensory areas to frontal networks that should be  mentioned, for example, Esmaeili et a,l 2022, Matteucci et al., 2022, Guo et a,l 2014,Gallero Salas et al, 2021, Jerry Chen et al, 2015. Sonja Hofer papers, maybe. Probably more.

      We appreciate this advice. We have now included one of the mentioned papers (Esmaeili et al., 2022) in the results section and discussion section for its direct characterization of the enhanced coupling between somatosensory region and frontal (motor) region during sensory learning.The other studies mentioned here seem to focus more on the differences in encoding properties between regions along specific cortical pathways, rather than functional connection or interregional activity correlation, and we feel they are not directly related to the observations discussed.

      (2) The reposted reorganization of brain-wide networks with shifts in time is best  described also in Sych et al. 2021.

      We regret we didn’t include this important research and we have now cited this in discussion section.

      (3) Regarding the discussion about more widespread stimulus encoding after learning,  the results indicate that the striatum emerges first in decoding abilities (Figure 7c left  panel), but this is not discussed at all.

      We briefly discussed this in the result section. We tend to attribute this to trial history signal in striatum, but since the structure of our data could not support a direct encoding analysis on trial history, we felt it might be inappropriate to over-interpret the results.

      (4) An important issue which is not discussed is the contribution of movement which  was shown to have a strong effect on brain-wide dynamics (Steinmetz et al 2019;  Musall et al 2019; Stringer et al 2019; Gilad et al 2018) The authors do have some movement analysis, but this is not enough. At least a discussion of the possible effects of movement on learning-related dynamics should be added.

      We have included these studies in discussion section accordingly. Since the movement analyses were done in a separate cohort of mice, we have made our limitation explicit in the revised manuscript and discuss it as a potential confound, along with possible approaches to address it in future work.

      (D) Methods

      (1) How was the light delivery of the optogenetic experiments done? Via fiber  implantation in the OFC? And for V2M? If the red laser was on the skull, how did it get  to the OFC?

      The fibers were placed on cortex surface for V2M group, and were implanted above OFC for OFC manipulation group. These were described in the viral injection part of the methods section.

      (2) No data given on how electrode tracking was done post hoc

      As noted in our response to the advice 3 in results section, the electrode shanks were ultra-thin (1-1.5 µm) and it was usually difficult to recover observable tracks or electrodes in section.

      As an attempt to verify the accuracy of implantation depth, we measured the repeatability of implantation in a group of mice and found a tendency for the arrays to end in slightly deeper location in cortex (142.1 ± 55.2 μm, n = 7 shanks), and slightly shallower location in subcortical structure (-122.6 ± 71.7 μm, n = 7 shanks). We added these results as new Figure S1 to accompany Figure 1.

      Reviewer #3 (Recommendations for the authors):

      (1) The manuscript uses decision-making in the title, abstract and introduction.  However, nothing is related to decision learning in the results section. Mice simply  learned to suppress licking in no-go trials. This type of task is typically used to study behavioral inhibition. And consistent with this, the authors mainly identified changes  related to network on no-go trials. I really think the title and main message is  misleading. It is better to rephrase it as visual discrimination learning. In the  introduction, the authors also reviewed multiple related studies that are based on  learning of visual discrimination tasks.

      We do view the Go/No-Go task as a specific genre of decision-making task, as there were literature that discussed this task as decision-making task under the framework of signal detection theory or updating of item values (Carandini & Churchland, 2013; Veling, Becker, Liu, Quandt, & Holland, 2022).

      We do acknowledge the essential differences between the Go/No-Go task and the tasks that require the animal to choose between alternatives, and since we have now realized some readers may not accept this task as a decision task, we have changed the title to visual discrimination task as advised.

      (2) Learning induced a faster onset on CR trials. As the no-go stimulus was not  presented to mice during early stages of training, this change might reflect the  perceptual learning of relevant visual stimulus after repeated presentation. This further  confirms my speculation, and the decision-making used in the title is misleading. 

      We have changed the title to visual discrimination task accordingly.

      (3) Figure 1E, show one hit trial. If the second 'no-go stimulus' is correct, that trial  might be a false alarm trial as mice licked briefly. I'd like to see whether continuous  licking can cause motion artifacts in recording. 

      We appreciate this important point. There were indeed licking artifacts with continuous licking in Hit trials, which was part of the reason we focused our analyses on CR trials. Opto-based lick detectors may help to reduce the artefacts in future studies.

      (4) What is the rationale for using a threshold of d' < 2 as the early-stage data and d'>3  as expert stage data?

      The thresholds were chosen as a result from trade-off based on practical needs to gather enough CR trials in early training stage, while maintaining a relatively low performance.

      Assume the mice showed lick response in 95% of Go stimulus trials, then d' < 2 corresponded to the performance level at which the mouse correctly rejected less than 63.9% of No-Go stimulus trials, and d' > 3 corresponded to the performance level at which the mouse correctly rejected more than 91.2% of No-Go stimulus trials.

      (5) Figure 2A, there is a change in baseline firing rates in V2M, MDTh, and Str. There  is no discussion. But what can cause this change? Recording instability, problem in  spiking sorting, or learning?

      It’s highly possible that the firing rates before visual stimulus onset is affected by previous reward history and task engagement states of the mice. Notably, though recorded simultaneously in same sessions, the changes in CR trials baseline firing rates in the V2M region were not observed in Hit trials.

      Thus, though we cannot completely rule out the possibility in recording instability, we see this as evidence of the effects on firing rates from changes in trial history or task engagement during learning.

      References:

      Carandini, M., & Churchland, A. K. (2013). Probing perceptual decisions in rodents. Nat Neurosci, 16(7), 824-831. doi:10.1038/nn.3410.

      Cruz, K. G., Leow, Y. N., Le, N. M., Adam, E., Huda, R., & Sur, M. (2023).Cortical-subcortical interactions in goal-directed behavior. Physiol Rev, 103(1), 347-389. doi:10.1152/physrev.00048.2021

      Esmaeili, V., Oryshchuk, A., Asri, R., Tamura, K., Foustoukos, G., Liu, Y., Guiet, R., Crochet, S., & Petersen, C. C. H. (2022). Learning-related congruent and incongruent changes of excitation and inhibition in distinct cortical areas. PLOS Biology, 20(5), e3001667. doi:10.1371/journal.pbio.3001667

      Goldbach, H. C., Akitake, B., Leedy, C. E., & Histed, M. H. (2021). Performance in even a simple perceptual task depends on mouse secondary visual areas. Elife, 10, e62156. doi:10.7554/eLife.62156.

      Siegle, J. H., Jia, X., Durand, S., Gale, S., Bennett, C., Graddis, N., Heller, G.,Ramirez, T. K., Choi, H., Luviano, J. A., Groblewski, P. A., Ahmed, R., Arkhipov, A., Bernard, A., Billeh, Y. N., Brown, D., Buice, M. A., Cain, N.,Caldejon, S., Casal, L., Cho, A., Chvilicek, M., Cox, T. C., Dai, K., Denman, D.J., de Vries, S. E. J., Dietzman, R., Esposito, L., Farrell, C., Feng, D., Galbraith, J., Garrett, M., Gelfand, E. C., Hancock, N., Harris, J. A., Howard, R., Hu, B.,Hytnen, R., Iyer, R., Jessett, E., Johnson, K., Kato, I., Kiggins, J., Lambert, S., Lecoq, J., Ledochowitsch, P., Lee, J. H., Leon, A., Li, Y., Liang, E., Long, F., Mace, K., Melchior, J., Millman, D., Mollenkopf, T., Nayan, C., Ng, L., Ngo, K., Nguyen, T., Nicovich, P. R., North, K., Ocker, G. K., Ollerenshaw, D., Oliver, M., Pachitariu, M., Perkins, J., Reding, M., Reid, D., Robertson, M., Ronellenfitch, K., Seid, S., Slaughterbeck, C., Stoecklin, M., Sullivan, D., Sutton, B., Swapp, J., Thompson, C., Turner, K., Wakeman, W., Whitesell, J. D., Williams, D., Williford, A., Young, R., Zeng, H., Naylor, S., Phillips, J. W., Reid, R. C., Mihalas, S., Olsen, S. R., & Koch, C. (2021). Survey of spiking in the mouse visual system reveals functional hierarchy. Nature, 592(7852), 86-92. doi:10.1038/s41586-020-03171-x

      Sych, Y., Fomins, A., Novelli, L., & Helmchen, F. (2022). Dynamic reorganization of the cortico-basal ganglia-thalamo-cortical network during task learning. Cell Rep, 40(12), 111394. doi:10.1016/j.celrep.2022.111394

      Veling, H., Becker, D., Liu, H., Quandt, J., & Holland, R. W. (2022). How go/no-go training changes behavior: A value-based decision-making perspective. Current Opinion in Behavioral Sciences, 47,101206.

      doi:https://doi.org/10.1016/j.cobeha.2022.101206.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The authors' goal was to arrest PsV capsids on the extracellular matrix using cytochalasin D. The cohort was then released, and interaction with the cell surface, specifically with CD151, was assessed.

      The model that fragmented HS associated with released virions mediates the dominant mechanism of infectious entry has only been suggested by research from a single laboratory and has not been verified in the 10+ years since publication. The authors are basing this study on the assumption that this model is correct, and these data are referred to repeatedly as the accepted model despite much evidence to the contrary.

      We stated in the introduction on line 65/66 ´Two release mechanisms are discussed, that mutually are not exclusive´. This implies that we do not consider the shedding model as ‘the accepted model’. Furthermore, we do not state in the discussion neither that the shedding model is the preferred one. However, we referred to the shedding model in the discussion, because we find HS associated with transferred PsVs, which is in line with this model.

      The discussion in lines 65-71 concerning virion and HSPG affinity changes is greatly simplified. The structural changes in the capsid induced by HS interaction and the role of this priming for KLK8 and furin cleavage have been well researched. Multiple laboratories have independently documented this. If this study aims to verify the shedding model, additional data need to be provided.

      Our findings are compatible with both models, and we do not aim to verify the shedding model neither want to disprove the priming model. However, as we understand, the referee wishes more visibility of the priming model. Therefore, using inhibitors previously used in the field, we tested whether inhibition of KLK8 or furin reduces PsV translocation to the cell body (after CytD wash off). Leupeptin blocks transport, while Furin inhibitor I still allows some initial translocation. We incorporated this new data as Figure 2 (line 265): “…we would expect that inhibition of L1 processing during the CytD incubation prevents the recovery of PsV translocation from the ECM to the cell body (Figure 2A and D). To test for this possibility, as employed in earlier studies, the protease inhibitor leupeptin was used to inhibit proteases including KLK8 which is required for L1 cleavage (Cerqueira et al. 2015). Employing this inhibitor, the PCC between PsV-L1 and F-actin staining remains negative after CytD removal, showing that for translocation indeed the action of proteases is required (Figure 2B and D). In contrast, inhibition of L2 cleavage by a furin specific inhibitor has no effect on the PCC (Figure 2C and D). However, it should be noted that we occasionally observe PsVs not completely translocating but accumulating at the border of the F-actin stained area (for example see Figure 2C (60 min)). This results in an increase of the PCC almost equal to complete translocation, explaining why the PCC remains unaffected despite a furin inhibitory effect. Hence, furin inhibition may have some effect on translocation that, however, is undetected in this type of analysis.’

      Moreover, we have added a paragraph discussing how our data integrates into the established model of the HPV infection cascade (line 604): ‘HPV infection is the result of several steps, starting with the initial binding of virions via electrostatic and polar interactions (Dasgupta et al. 2011) to the primary attachment site HS (Richards et al. 2013), which induces capsid modification (Feng et al. 2024; Cerqueira et al. 2015) and HS cleavage (Surviladze et al. 2015), enabling the virion to be released from the ECM or the glycocalyx. Next, virions bind to the cell surface to a secondary receptor complex that forms over time, and become internalized via endocytosis, before they are trafficked to the nucleus (Ozbun and Campos 2021; Mikuličić et al. 2021). Regarding the transition from the primary attachment site to cell surface binding, as already outlined in the introduction, two models are discussed. In one model, proteases cleave the capsid proteins. After priming, the capsids are structurally modified and the virion can dissociate from its HS attachment site. It has been suggested that capsid priming is mediated by KLK8 (Cerqueira et al. 2015) and furin (Richards et al. 2006). In our system, KLK8 inhibition blocks PsV transport, while furin inhibition has some effect that, however, cannot be detected in this analysis (Figure 2) suggesting furin engagement at later steps in the infection cascade. This is in line with earlier in vitro studies on the role of cell surface furin (Surviladze et al. 2015; Day et al. 2008; Day and Schiller 2009). In any case, our results align with both models of ECM detachment: one involving HS cleavage (HS co-transfer) and another involving capsid modification (by e.g., KLK8).’

      The model should be fitted into established entry events,…

      Please see our reply above.

      or at minimum, these conflicting data, a subset of which is noted below, need to be acknowledged.

      (1) The Sapp lab (Richards et al., 2013) found that HSPG-mediated conformational changes in L1 and L2 allowed the release of the virus from primary binding and allowing secondary receptor engagements in the absence of HS shedding.

      (2) Becker et al. found that furin-precleaved capsids could infect cells independently of HSPG interaction, but this infection was still inhibited with cytochalasin D.

      (3) Other work from the Schelhaas lab showed that cytochalasin D inhibition of infection resulted in the accumulation of capsids in deep invaginations from the cell surface, not on the ECM

      (4) Selinka et al., 2007, showed that preventing HSPG-induced conformational changes in the capsid surface resulted in noninfectious uptake that was not prevented with cytochalasin D.

      (5) The well-described capsid processing events by KLK8 and furin need to be mechanistically linked to the proposed model. Does inhibition of either of these cleavages prevent engagement with CD151?

      The authors need to consider an explanation for these discrepancies.

      We do not see any discrepancies; our observations are compatible with aspects of both the shedding and the priming model. That PsVs carry HS-cleavage products doesn´t imply that HS cleavage is sufficient or required for infection, or that the priming model would be wrong. We do not view our data as being in conflict with the priming model. Most of the above-mentioned papers are now cited.

      Altogether, we acknowledge that the study gains importance by directly testing the priming model within our experimental system. We are thankful for the above comments and addressed this issue.

      Other issues:

      (1) Line 110-111. The statement about PsVs in the ECM being too far away from the cell surface to make physical contact with the cell surface entry receptors is confusing. ECM binding has not been shown to be an obligatory step for in vitro infection.

      Not obligatory, but strongly supportive (Bienkowska-Haba et al., Plos Path., 2018; Surviladze et al., J. Gen. Viro., 2015). As recently published by the Sapp lab (Bienkowska-Haba et al., Plos Path., 2018), ´Direct binding of HPV16 to primary keratinocytes yields very inefficient infection rates for unknown reasons.´ Moreover, the paper shows that HaCaT cell ECM binding of PsVs increases the infection of NHEK by 10-fold and of HFK by almost 50-fold.

      This idea is referred to again on lines 158-159 and 199. The claim (line 158) that PsV does not interact with the cell within an hour needs to be demonstrated experimentally and seems at odds with multiple laboratories' data. PsV has been shown to directly interact with HSPG on the cell surface in addition to the ECM. Why are these PsVs not detected?

      The reviewing editor speculated that HaCaT cells may be a model system in which the in vivo relevant binding to the ECM can be better studied as in non-polarized cell types. This is because binding to the ECM cannot be bypassed by direct cell surface binding. The observation that only few PsVs bind to the basal cell membrane indeed suggests restricted diffusional access of PsVs to binding receptors of the basal membrane. The reviewing editor asked for an experiment showing that more PsVs bind after cell detachment. We performed this experiment and indeed find more PsVs binding to the cell surface of detached cells. This point is very important for the understanding of the study and now we mention it in several sections of the manuscript, as outlined in the following.

      Line 125: ‘Many PsVs that bind to the ECM may locate distal from the cell surface and are thus unable to establish direct contact with entry receptors. However, they are capable of migrating by an actindependent transport along cell protrusions towards the cell body (Smith et al. 2008; Schelhaas et al. 2008). We aimed for blocking this transport in HaCaT cells, a cell line that is widely used as a cell culture model for HPV infection. HaCaT cells closely resemble primary keratinocytes in key aspects: they are not virally transformed and produce large amounts of ECM that facilitates infection (Bienkowska-Haba et al. 2018; Gilson et al. 2020). In addition, HaCaT cells exhibit cellular polarity that enforces binding of virus particles to the ECM, as the virions cannot bind to receptors/entry components, such as CD151, Itgα6 and HSPGs that co-distribute on the basolateral membrane of polarized keratinocytes (Sterk et al. 2000; Cowin et al. 2006; Mertens et al. 1996), making them inaccessible by diffusion.’

      Line 205: ‘During the CytD incubation, PsVs bind to HSPGs of the basolateral membrane for 5 h. Still, in the cell body area hardly any PsVs are present (0.14 PsV/µm<sup>2</sup>, Supplementary Figure 1B). In the control, the PsV density is several-fold larger (Supplementary Figure 1B). This is expected, as the PsVs bind to the ECM and translocate to the cell body. We wondered whether there are more binding sites at the basal membrane that remain inaccessible to PsVs by diffusion because of the insufficient space between glass-coverslip and basolateral membrane. For clarification, we incubated EDTA detached HaCaT cells in suspension with PsVs for 1 h at 4 °C, followed by re-attachment for 1 h. Under these conditions, we find a PsV density 12.4-fold larger than after 5 h of CytD incubation of adhered cells (Supplementary Figure 1B and D). However, it should be noted that these values cannot be directly compared. Aside from the different treatments, another difference lies in the size of the basal membrane, as re-attachment of cells is not complete after only 1 h (compare size of adhered membranes in Supplementary Figure 1A and C). Therefore, the imaged membranes are likely strongly ruffled, which results in the underestimation of the size of the adhered membrane. As a result, we overestimate the PsVs per µm<sup>2</sup> (please note that we cannot re-attach cells for longer times as we would then lose PsVs due to endocytosis). On the other hand, we would underestimate the PsV density at the basal membrane if after re-attachment we image in part also some apical membrane. In any case, the experiment suggests that PsVs bind more efficiently if membrane surface receptors are accessible by diffusion. This is in support of the above notion that the basal membrane may provide more entry receptors than one would expect from the low density of PsVs bound after 5 h CytD (Supplementary Figure 1B). This suggests that under our assay conditions, PsVs cannot easily bypass the translocation from the ECM to the cell body by diffusing directly to the basal membrane. Hence, the large majority of PsVs that enter the cell were previously bound to the ECM. Therefore, HaCaT cells serve as an ideal model for studying the transfer of ECM bound HPV particles to the cell surface, which is similar to in vivo infection of basal keratinocytes after binding to the basement membrane (Day and Schelhaas 2014; Kines et al. 2009; Schiller et al. 2010; Bienkowska-Haba et al. 2018).’

      Line 529: ‘Filopodia usage not only facilitates infection but also increases the likelihood of virions to reach their target cells during wound healing, namely the filopodia-rich basal dividing cells. In fact, several types of viruses exploit filopodia during virus entry (Chang et al. 2016), hinting at the possibility that for HPV and other types of viruses actin-driven virion transport may play a more important role than it is currently assumed. If this is the case, sub-confluent HaCaT cells, or even better single HaCaT cells, would be an ideal model system for the study of these very early infection steps that involve ECM attachment and subsequent filopodia-dependent transport. As shown in Supplementary Figure 1, HaCaT cells have many binding sites for the HPV16 PsVs. However, as they are polarized and the binding receptors are only at the basal membrane, they remain relatively inaccessible by diffusion. Therefore, the ECM binding that is also observed in vivo (Day and Schelhaas 2014) and subsequent transport via filopodia are used upon infection of HaCaT cells that locate at the periphery of cell patches. Here, PsVs bind to the ECM which strongly enhances infection of primary keratinocytes (Bienkowska-Haba et al. 2018). In contrast, HPV can readily bind to HSPGs on the cell surface of nonpolarized cells, and by this bypasses ECM mediated virus priming and the filopodia dependency. We propose that HaCaT cells are a valuable system for studying the very early events in HPV infection that allows for dissecting capsid interaction with ECM resident priming factors and cell surface receptors.’

      Finally, please note that in the previous version of the manuscript, we did not question that in many cellular systems PsVs interact with heparan sulfate proteoglycans (HSPGs) present on the cell surface, or both on the cell surface and the ECM. We stated on line 59 ´While in cell culture virions bind to HS of the cell surface and the ECM, it has been suggested that in vivo they bind predominantly to HS of the extracellular basement membrane (Day and Schelhaas, 2014; Kines et al., 2009; Schiller et al., 2010).´

      We hope that after adding the above explanations and the experiment requested by the reviewing editor it is now clear why only few PsVs bind directly (not via the ECM) to the cell surface. We appreciate the reviewer’s and the reviewing editor’s input that has significantly improved the manuscript.

      (2) The experiments shown in Figure 5 need to be better controlled. Why is there no HS staining of the cell surface at the early timepoints? This antibody has been shown to recognize N-sulfated glucosamine residues on HS and, therefore, detects HSPG on the ECM and cell surface.

      There is staining. However, as the staining at the periphery is stronger and images are shown at the same settings of brightness and contrast, the impression is given that the cell surface is not stained. We have added more images showing HS cell surface staining.

      (i) Supplementary Figure 4C shows an enlarged view of the CytD/0 min cell shown in Figure 6A. In the area stained by Itgα6, that marks the cell body, HS staining is present, although less abundant in comparison to the ECM.

      (ii) In Figure 8, CytD/30 min, a cell is shown with abundant HS in the cell body region (compare cyan and green LUT).

      (iii) In newly added Figure 3A, lower panel, another cell with HS in the cell body region is shown.

      Please note that the staining is highly variable. We indicate this by stating on Line 373: ‘The pattern of the HS staining (cyan LUT) and the overlap of HS with PsVs and Itgα6 are highly variable (Figure 6A).’

      Therefore, the conclusion that this confirms HS coating of PsV during release from the ECM (line 430431) is unfounded. How do the authors distinguish between "HS-coated virions" and HSPG-associated virions?

      The transient increase in the PCC at CytD/30 min can be interpreted as PsV/HS co-transport or as direct binding of PsVs to cell surface HSPGs. However, two arguments support co-transport.

      First, we find that CytD/PsVs increases the HS intensity (see newly added Figure 3, confirming old Figure 5 that is now Figure 6). We state on line 290 ‘… that without actin-dependent PsV translocation HS cleavage products are retained in the ECM, consistent with the hypothesis that cleaved HS remains associated with PsVs (Ozbun and Campos 2021).

      Second, the distance between HS and Itgα6 (the cell body marker) decreases over time after CytD removal, which suggests movement of HS to the cell body (Supplementary Figure 8D). We state on line 422: ‘The movement of HS towards the cell body after removal of CytD, which indirectly demonstrates that PsVs are coated with HS, is suggested by a shortening of the HS-Itgα6 distance over time (Supplementary Figure 8D).’

      It is difficult to comprehend how the addition of 50 vge/cell of PsV could cause such a global change in HS levels.

      Some areas are covered with confluent cells, to which hardly any PsVs are bound, because accessing their basolateral membrane is nearly impossible, and PsVs do not bind to the exposed apical membrane as well. We assume this is a major difference to cultures of unpolarized cells, where PsVs should distribute more or less equally over cells. This means that in our experiments the vge/cell is not a suitable parameter for relating the magnitude of an effect to a defined number of PsVs. In the ECM, the PsV density is very high, enabling one cell to collect, in theory, several hundred PsVs, much more than expected from the 50 vge/cell.

      We state on line 135: ‘Frequently, we observe patches of confluent cells which are common to HaCaT cells. Cells at the center of these patches are dismissed during imaging, because there are no anterogradely migrating PsVs at these cells. A second reason for our dismissal of these cells is that hardly any PsVs are bound to them, possibly because their basal membranes are inaccessible by diffusion. Instead, we focus on isolated HaCaT cells or cells at the periphery of cell patches. In these cells, we find more PsVs per cell than one would expect from the employed 50 viral genome equivalents (vge) per cell, indicating that PsVs are unequally distributed between the cells.’

      The claim that the HS levels are decreased in the non-cytochalasin-treated cells due to PsV-induced shedding needs to be demonstrated.

      We did not claim that PsVs induce shedding, we rather believe they retain shedded HS. Without PsVs, the shedded HS is washed off from the ECM. We have reproduced the observation made in old Figure 5 (now Figure 6) in the newly added Figure 3 that also shows that PsVs alone have no effect on the HS intensity, only when present together with CytD. We state on line 277: ‘As outlined above, during the 5 h incubation with CytD, proteases in the ECM are expected to cleave HS chains. These cleavage products should be able to diffuse out of the ECM, unless they remain associated with nontranslocating PsVs. In the control, PsV associated HS cleavage products would leave the ECM through PsV translocation…. Using an antibody that reacts with an epitope in native heparan sulfate chains, only after CytD and if PsVs are present, the level of HS staining is significantly increased (Figure 3B). As shown in Figure 3A, stronger HS staining at PsVs (open arrows) and as well in PsV free areas (closed arrows) was observed… Collectively, our findings indicate that without actin-dependent PsV translocation HS cleavage products are retained in the ECM, consistent with the hypothesis that cleaved HS remains associated with PsVs (Ozbun and Campos 2021).’

      If HS is actually shed, staining of the cell periphery could increase with the antibody 3G10, which detects the HS neoepitope created following heparinase cleavage.

      We have tested the antibody by which we obtain only a very weak staining (Supplementary Figure 2), not allowing to differentiate between an increase in the cell periphery and the cell body area. We still include the experiment as it suggests that CytD has no effect on HS processing. We state on line 286: ‘As additional control and shown in Supplementary Figure 2, we use an antibody that reacts with a HS neo-epitope generated by heparitinase-treated heparan sulfate chains (Yokoyama et al. 1999; for details see methods). This neo-epitope staining is independent of the presence of CytD and the incubation time, suggesting that CytD does not directly affect HS processing.’

      Reviewer #2 (Public review):

      Summary:

      Massenberg and colleagues aimed to understand how Human papillomavirus particles that bind to the extracellular matrix (ECM) transfer to the cell body for later uptake, entry, and infection. The binding to ECM is key for getting close to the virus's host cell (basal keratinocytes) after a wounding scenario for later infection in a mouse vaginal challenge model, indicating that this is an important question in the field.

      Strengths:

      The authors take on a conceptually interesting and potentially very important question to understand how initial infection occurs in vivo. The authors confirm previous work that actin-based processes contribute to virus transport to the cell body. The superresolution microscopy methods and data collection are state-of-the art and provide an interesting new way of analysing the interaction with host cell proteins on the cell surface in certain infection scenarios. The proposed hypothesis is interesting and, if substantiated, could significantly advance the field.

      Weaknesses:

      As a study design, the authors use infection of HaCaT keratinocytes, and follow virus localisation with and without inhibition of actin polymerisation by cytochalasin D (cytoD) to analyse transfer of virions from the ECM to the cell by filopodial structures using important cellular proteins for cell entry as markers.

      First, the data is mostly descriptive besides the use of cytoD, and does not test the main claim of their model, in which virions that are still bound to heparan sulfate proteoglycans are transferred by binding to tetraspanins along filopodia to the cell body.

      The study identifies a rapid translocation step from the ECM to CD151 assemblies. We have no data that demonstrates a physical interaction between PsVs and CD151. In the model figure, we draw CD151 as part of the secondary receptor complex. We are sorry for having raised the impression that PsVs would bind directly to CD151 and have modified the model Figure accordingly. In the new model figure (Figure 9), the first contact established is to a CD151 free receptor.

      Second, using cytoD is a rather broad treatment that not only affects actin retrograde flow, but also virus endocytosis and further vesicular transport in cells, including exocytosis. Inhibition of myosin II, e.g., by blebbistatin, would have been a better choice as it, for instance, does not interfere with endocytosis of the virus.

      As we focus on early events, we are not concerned about CytD blocking as well late steps in the infection cascade, like endocytosis. However, we agree that a comparison between CytD and blebbistatin would be very interesting. We added Figure 8, showing that blebbistatin only partially stops migration.

      Line 429: ‘Actin retrograde transport, which underlies the here observed virion transport, is the integrative result of three components (Smith et al. 2008; Schelhaas et al. 2008)…. As CytD broadly interferes with F-actin dependent processes, we investigated the effects upon inhibition of only one of the three components, namely the myosin II mediated retrograde movement towards the cell body. Instead of CytD, we employed in the 5 h preincubation the myosin II inhibitor blebbistatin. For the control (0 min), we show in Figure 8A one example of a cell with comparatively many PsVs at the periphery (as mentioned above, the PsV pattern is highly variable) to better illustrate the difference to the PsV pattern occasionally seen with blebbistatin. After blebbistatin treatment (0 min), PsVs are still distal to the cell body but less dispersed than after CytD treatment, seemingly as if translocation started but stopped in the midst of the pathway (Figure 8A, blebbistatin). The PCC between PsVs and HS, like after CytD (Figure 6C), is elevated after blebbistatin, albeit the effect is not significant (Figure 8C). The cell body PCC, is not at 30 min (CytD) but already at 0 min elevated (compare Figure 6D to Figure 8D), which can be explained by partial translocation. This is further supported by the fact that only 8% of PsVs are closely associated with HS (Figure 8E; blebbistatin, 0 min) compared to 15% after CytD treatment (Figure 6E; 0 min). Furthermore, after 0 min PsV incubation with blebbistatin we observe no effect on the HS intensity (compare Figure 8B to Figure 3B and Figure 6B). Hence, in contrast to CytD, blebbistatin does not trap the PsVs in the ECM where they associate with HS, but ongoing actin polymerization pushes actin filaments along with PsVs towards the cell body.’

      Third, the authors aim to study transfer from ECM to the cell body and the effects thereof. However, there are substantial, if not the majority of, viruses that bind to the cell body compared to ECM-bound viruses in close vicinity to the cells.

      Please see our detailed reply to referee #1 that has raised the same issue. In brief, we agree that in multiple cell culture systems viruses bind preferentially to the cell surface directly. However, in HaCaT cells, the majority of PsVs does not bind directly to the basal membrane but gets there after initial binding to the ECM. Thus, we believe our system appropriately models the physiologically relevant scenario of ECM-to-cell transfer, as also speculated by the reviewing editor that has suggested an experiment showing that more PsVs bind to detached cells (please see above).

      This is in part obscured by the small subcellular regions of interest that are imaged by STED microscopy, or by the use of plasma membrane sheets. As a consequence, the obtained data from time point experiments is skewed, and remains for the most part unconvincing due to the fact that the origin of virions in time and space cannot be taken into account. This is particularly important when interpreting association with HS, the tetraspanin CD151, and integral alpha 6, as the low degree of association could originate from cell-bound and ECM-transferred virions alike.

      As already stated above, we observe massive binding of PsVs to the ECM, in contrast to very few PsVs that diffuse beneath the basolateral membrane of the polarized HaCaT cells and do bind directly to the cell surface. In other cellular systems, cells may hardly secrete ECM, are not polarized, and therefore virions can easily bypass ECM binding. Therefore, it is reasonable to assume that in HaCaT cells the large majority of PsVs found on the cell body originates from the ECM.

      Fourth, the use of fixed images in a time course series also does not allow for understanding the issue of a potential contribution of cell membrane retraction upon cytoD treatment due to destabilisation of cortical actin. Or, of cell spreading upon cytoD washout.

      The newly added blebbistatin experiment suggests that the initial translocation is exclusively dependent on retrograde actin flow. However, we agree that we are not able to unravel more details regarding the different possible contributions to the movement. Importantly, the lack of PCC increase after CytD/leupeptin removal (Figure 2D) suggest there is not much cell spreading into the area of accumulated PsVs. Please see our more detailed reply to the same issue raised by the same referee in the recommendations for the authors.

      The microscopic analysis uses an extension of a plasma membrane stain as a marker for ECM-bound virions, which may introduce a bias and skew the analysis.

      The dye TMA-DPH stains exclusively cellular membranes and not the ECM. The stain is actually used to delineate the cell body from the ECM area (please see Figure 1).

      Fifth, while the use of randomisation during image analysis is highly recommended to establish significance (flipping), it should be done using only ROIs that have a similar density of objects for which correlations are being established.

      We agree that the way of how randomization is done is very important. Regarding the association of PsVs with CD151 and HS, we corrected for random background association, which is now explained in more detail in in the Figure legend of Supplementary Figure 7: “On flipped images, we often find values more than half of the values of the original images, demonstrating that many PsVs have a distance ≤ 80 nm to CD151 merely by chance (background association)… (C) Each time point in (A) and (B) obtained from flipped images is the average of three biological replicates. We use these altogether 24 data points, plotting the fraction of closely associated PsVs against the CD151 maxima density. The fraction increases with the maxima density, as the chance of random association increases with the maxima density. The fitted linear regression line describes the dependence of the background association from the maxima density. As a result, the background association (y) can be calculated for any maxima density (x) in original images with the equation y = 2.04x. Please note that the CytD/0 min may be overcorrected as we subtract background association with reference to the CD151 maxima density of the entire ROI (for an example ROI see Supplementary Figure 6A), although the local maxima density at distal PsVs is lower. On the other hand, PsVs at the cell border may have a larger local CD151 maxima density and consequently are undercorrected.’

      For instance, if one flips an image with half of the image showing the cell body, and half of the image ECM, it is clear that association with cell membrane structures will only be significant in the original.

      We are aware of this problem. For instance, it would produce ‘artificially’ low PCCs after flipping images of PsV/HS stainings (please see negative PCC value after flipping in Supplementary Figure 8). In this case, we do not use as argument that in flipped images the PCC is lower. Instead, we would argue that over time the PCC changes in the original images. We still provide the PCC values of flipped images, as additional information, showing that in most cases we obtain after flipping a PCC of zero, as expected

      Hence, we fully agree that careful controls in image analysis is required, and used the above-described method for the correction of background association when the fraction of closely associated PsVs is analyzed. We do not use a lower PCC value in flipped images as argument if not appropriate.

      I am rather convinced that using randomisation only on the plasma membrane ROIs will not establish any clear significance of the correlating signals.

      Figure 6D and 8D show the PCC specifically of the cell body (only of plasma membrane ROIs). In flipped images (not shown in the previous version for clarity), we obtain significantly lower PCCs (Supplementary Figure 8F/G and Supplementary Figure 10C/D. We propose that in this case it would be appropriate to use a lower PCC of flipped images as argument for specific association. Still, also in this experiment we argue with a change in the PCC over time, and not with a PCC of zero after flipping. As above, we still provide the PCC values of flipped images as additional information.

      Also, there should be a higher n for the measurements.

      One replicate is based on the average of 14-15 cells for each condition (more for figure 4). Hence, in a typical experiment (Control and CytD with 4 time points) about 120 cells are analyzed, which is a broad basis for the averages of one replicate.

      We realize that with three biological replicates we find significant effects only if we have strong effects or moderate effects with very low variance.

      Recommendations for the authors:

      Reviewing Editor:

      The focus on the events of HPV infection between ECM binding and keratinocyte-specific receptor binding is unique and interesting. However, I agree with the reviewers that some of the conclusions could use more experimental support, as detailed in their comments. The failure to detect direct binding of the PsV to HSPGs on the cell surface in in vitro assays contradicts much of the published literature. For example, others have found that HPV capsids bind cultured cell lines in suspension, i.e, in the absence of ECM. Do EDTA-suspended HaCaT cells bind PsV? Is the binding HSPG dependent? If the authors think that failure to detect direct cell binding of HaCaTs is an unusual feature of these cell lines or culture condition,s then it would be helpful to provide an explanation. However, it is worth noting that an in vitro system where the cells do not directly bind capsids through HSPG interactions would be a much better model for studying the stages of HPV infection that are the focus of this study, since there is no direct binding of keratinoctyes in vivo.

      We are thankful for this comment that had a strong influence on the revision. The suggested experiment has been incorporated as new Supplementary Figure 1. It shows that many more PsVs bind to the cell surface of cells in suspension than to adhered cells. As suggested by the reviewing editor, we explain now that HaCaT cells are a suitable model system for studying the in vivo transport from the ECM to the cell body that in these cells, due to their polarization, cannot be bypassed (for more details please see our replies above addressing these issues).

      Because conclusions drawn regarding HS interactions are largely based on experiments using a single HS mAb, it is important that the specificity of this mAb is described in more detail, either based on the literature or further experimentation.

      We provide now detailed information about the HS antibodies used in the study. We state on line 282 ‘Using an antibody that reacts with an epitope in native heparan sulfate chains…’ and on line 286 ‘we use an antibody that reacts with a HS neo-epitope generated by heparitinase-treated heparan sulfate chains…’ and in the methods section ‘For Heparan sulfate (HS) a mouse IgM monoclonal antibody (1:200) (amsbio, cat# 370255-S) was used that reacts with an epitope in native heparan sulfate chains and not with hyaluronate, chondroitin or DNA, and poorly with heparin (mAb 10E4 (David et al., 1992)). For HS neo-epitope (Yokoyama et al., 1999) detection, a mouse monoclonal antibody (1:200) (amsbio, cat#370260-S) was used that reacts only with heparitinase-treated heparan sulfate chains, proteoglycans, or tissue sections, and not with heparinase treated HSPGs. The antibody recognizes desaturated uronic acid residues (mAb 3G10 (David et al., 1992)).’

      Reviewer #1 (Recommendations for the authors):

      (1) The phrase "tight association" or similar is repeatedly used and is not acceptable for microscopic studies; use "close association", which has no affinity connotations.

      Has been changed as suggested by the referee.

      (2) Why are lysine-coated coverslips used for microscopy? HaCaT cells adhere tightly to untreated glass, and this coating could affect the distribution of ECM and extracellular PsV.

      We believe a tight association of the basal cell membrane to its substrate, as in vivo, where the basal membrane is tightly adhered to other cells, is important in these experiments. In weakly adherent cells more PsVs may bind to the cell surface, bypassing the transport step. Hence, although HaCaT cells may not require the coat and would be able to adhere to glass, the association may not be tight enough to mimic in vivo conditions.

      (3) What is the reason to use detection of the pseudogenome for some of the experiments instead of L1 detection throughout? The process of EdU detection is sufficiently denaturing to affect some protein epitopes. The introduction of this potential artifact doesn't seem warranted for capsid detection experiments.

      The L1 and the Itgα6 antibody are from the same species, wherefore we have used in Figures 4 and 6 click-labeling of the reporter plasmid. We do not disagree with the notion of the referee, that EdU detection may denature the epitope of some proteins. For instance, we have observed a different staining pattern for CD151; for Itgα6 and HS we saw no obvious difference in the staining patterns. In double staining experiments using L1 antibody and click-labeling, both staining patterns overlapped very well, indicating that click-labeling is suitable to visualize PsVs.

      (4) What concentration of TMA-DPH was used?

      TMA-DPH is a poorly water-soluble dye that becomes strongly fluorescent upon insertion into a membrane. Because of its poor water solubility, a precise concentration cannot be given. We added 50 µl of a saturated TMA-DPH solution in PBS to 1 ml of PBS in the imaging chamber. We state this now in the methods section.

      (5) Line 419: This statement is misleading. Although PsV interaction with HSPG on the ECM is crucial for infectious transfer to cells, the majority of the PsV binding on the ECM has been attributed to interaction with laminin 332. Treatment of PsV with heparin causes sequestration to the ECM.

      We are sorry for the confusion and have removed the misleading statement.

      (6) Some reference choices are poor:

      Line 54: Ozbun and Campos, this is not the correct reference

      In the review we cited, in the introduction it is stated that PsVs establish infection via a break in the epithelial barrier? However, we have replaced this reference by a review that focuses more on epithelial wounding: ‘Ozbun, Michelle A. (2019): Extracellular events impacting human papillomavirus infections: Epithelial wounding to cell signaling involved in virus entry. In Papillomavirus research (Amsterdam, Netherlands) 7, pp. 188–192. DOI: 10.1016/j.pvr.2019.04.009.’

      Line 2012: Doorbar et al., this is not the correct reference.

      Thank you for pointing this out (..we assume the referee refers to line 104 and not line 2012). We have noticed this error during revision. As it is difficult to get a specialized review on this topic, we now cite Ozbun and Campus, 2021 that states PsVs are ‘structurally and immunologically indistinguishable from lesion- and tissue-derived HPVs.’

      Minor issues:

      (1) It is difficult to appreciate the ECM and cell surface binding pattern from the provided images, which do not even contain an entire cell. We need to see a few representative field views with the ECM delineated with laminin 332 staining, as HS antibodies stain both the ECM and cell surface.

      We now provide overview images in Supplementary Figure 4. The only experiment requiring a clear delineation between ECM and cell surface is the experiment of Figure 4. Here, we do not use the HS as a reference staining because it stains both the ECM and the cell surface.

      (2) For Figure 1E, the cells were only infected for 24 hours. The half-time for infectious internalization of HaCaT cells was shown to be 8 hours for cell-associated PsV and closer to 20 hours for PsV that was associated with the ECM prior to cell association (Becker et al., 2018). Why was such a short infection time chosen?

      During assay establishment it has been observed that after 24 h the luciferase activity is optimal.

      (3) Figure 5, the staining of uninfected cells +/- cyto treatment needs to be included.

      Now visible in new Figure 3.

      I am confused by lines 54-57. It seems as if the authors are claiming that HSPGs are not present on the ECM. This sentence, as written, is misleading.

      We agree, and state now on line 58 ‘Here, virions bind to the linear polysaccharide heparan sulfate (HS) that is present in the extracellular matrix (ECM) but as well on the plasma membrane surface. HS is attached to proteins forming so called heparan sulfate proteoglycans (HSPGs).’

      Reviewer #2 (Recommendations for the authors):

      There are further issues that are not pertaining to the study design that I find important.

      (1) It remains speculative whether the virions that are transferred from the ECM are actually structurally modified.

      The newly added Figure 2, showing that leupeptin blocks infection in our assay, suggests that virions indeed are primed.

      (2) The origin of HS correlated with virions on the cell body after transfer is also not clear: does the virus associate with cell surface HS, or does it bring HS from the ECM? Simply staining HS against Nsulfated moieties does not allow such conclusions.

      This issue has been already raised in the public review to which we replied above. In brief, we agree that the transient increase of the PCC between PsVs and HS in the cell body region can be also explained by PsVs coming from the ECM without HS and binding to cell surface HS, or from PsVs binding directly (not via the ECM) to cell surface HSPGs. However, there are two more arguments indicating that PsVs are coated with HS. Please see our detailed reply above.

      (3) Figure 1: There are few, if any, filopodia in untreated cells. It would be good to quantify their abundance to substantiate that resting HaCat cells are indeed a good model for filopodial transport bs. membrane retraction / spreading. In HaCat ECM, the virus also binds to laminin-332 for a good part. Would this not also confound the analysis?

      At first glance, the number of filopodia appears to be too low to account for such an efficient transport. However, please note that the formation of filopodia is very dynamic, and that they can form and disappear within minutes (see below). We also often observe many PsVs aligned at one filopodium. Moreover, not every cell periphery exhibits large accumulations of PsVs. Therefore, we believe it is in principle possible that filopodia are largely responsible for the transport. We cannot exclude that we overestimate the transport rate due to partial cell spreading after CytD removal, which, however, we consider as rather unlikely as in Figure 2 we observe no increase in the PCC when leupeptin was present during the CytD incubation. Under these conditions, PsVs do not translocate but cells could spread, and this would increase he PCC between PsVs and F-actin if cells would spread into the area of accumulated PsVs.

      We now state on line 304: ‘This suggests that the half-time of PsV translocation from the periphery to the cell body is about 15 min. In fact, the half-time maybe longer, as we cannot exclude that cell spreading after CytD removal contributes to less PsVs measured in the cell periphery.’ and on line 477 ‘As mentioned above, the half-time could be longer if cell spreading is in part responsible for the translocation of PsVs onto the cell body. However, we assume that this is rather unlikely, as cell spreading would increase the PCC between PsVs and F-actin under a condition where filopodia mediated transport is blocked but not cell spreading, which is not the case (Figure 2B and D, CytD/leupeptin).’

      (4) Figure 2: This would benefit from live cell analysis. There are considerable amounts of virions on the cell body, which partially contradicts statements from Figure 1.

      Does the referee refer to the images shown in Figure 4 (old Figure 2)? Please note that at CytD/0 min there are hardly any PsVs in the cell body region, the fluorescence (magenta LUT) is autofluorescence (this is explained in the results section). Only at later time points PsVs are in the cell body region.

      The fast transfer to the cell body after cyto D washout is based on the assumption that filopodia formation and transport along them (and not membrane extension) occur quickly. Is this reasonable?

      We are no experts on filopodia, but one finds references suggesting that they grow at rates of several µm per minutes and have lifetimes between a few seconds and several minutes. Hence, within the 15 min we determine for the transport, cells may need a few minutes to recover from CytD, a few minutes to form filopodia that reach out into the ECM, and a few minutes for the transport itself. However, we agree that we cannot exclude membrane extension contributing to our observed transport, although we consider this as rather unlikely (see above).

      (5) Figure 3: The rationale of claiming the existence of 'endocytic structures' needs to be better explained and quantified in the according supplementary figure.

      We now state in the legend ‘We propose that the agglomerated CD151 maxima close to PsVs feature the characteristics of endocytic structures, as CD151 has been shown to co-internalize with PsVs (Scheffer et al. 2013), and as these structures invaginate into the cell, like PsV filled tubular organelles previously described by electron microscopy (Schelhaas et al. 2012).’ For a proper quantification of these highly variable structures a much larger sample would be required.

      The formation of virus-filled tubules upon cytoD treatment has been previously reported. Are these viruses that come from the cell body or from the ECM?

      With the new data and explanations that have been added to the manuscript, it should be clear that it is reasonable to assume that they come largely from the ECM.

      (6) Figure 4: How are the subcellular ROIs chosen? Is there not a bias by not studying a full cell?

      We now explain better how we chose cells for analysis. We state on line 138 ‘Instead, we focus on isolated HaCaT cells or cells at the periphery of cell patches. In these cells, we find more PsVs per cell than one would expect from the employed 50 viral genome equivalents (vge) per cell, as PsVs are unequally distributed between the cells. Moreover, these PsVs usually are not homogenously distributed around the cell but concentrate at one region. We investigate the translocation of PsVs from these regions, defining ROIs for analysis that cover PsVs at the periphery and the cell body (see Supplementary Figures 6A and 8A).’

      (7) Figure 5/6: The data needs a better analysis on correlation by using randomisation as explained above.

      Please see our reply to the same point of the public review raised by the same referee.

      (8) Figure 7: This model involves CD151 being a mediator in transfer, but this has not been functionally shown. There are HaCaT CD151 KO cells available (from the Sonnenberg lab), it would be good to use those to test the model and whether transfer indeed involves CD151.

      As already stated above, we are sorry for having raised the impression that PsVs bind directly to CD151. The model Figure has been modified. Please see our reply above.

      (9) The manuscript would benefit from a number of experiments addressing the most crucial issues:

      (a) As mentioned before, the use of blebbistatin, which blocks myosin II function and arrests actin retrograde flow within seconds of addition, would be a good inhibitor to control for transfer in at least some of the most crucial experiments.

      In Figure 8 we have tested blebbistatin. Please see our reply above.

      (b) Live cell analysis would allow for monitoring of whether membrane retraction upon cytoD treatment would have to be taken into account for the analysis of the data. The same is true for the cytoD washouts, upon which most cells exhibit pronounced membrane spreading. The latter is important to support filopodial transport rather than membrane ruffling and spreading, leading to the clearance of extracellular virions from the ECM.

      We agree that this would be desirable. As replied above, we now discuss the issue of possible membrane spreading and reason why we consider it as rather unlikely.

      (c) To rid oneself of the issue of plasma membrane-bound virions as a confounding factor, one could use cells treated by sodium chlorate, which leads to undersulfation of HS on the cell surface, and seed them onto ECM with functional HSPGs. This would then indeed establish that the HS and virus are transferred together.

      We agree that this would be a smart experiment. As the main focus of our study is not clarifying whether PsVs are coated with HS or not, we gave other experiments priority.

      (10) The manuscript is, while carefully and thoughtfully worded on the issue of microscopy analysis, for a good part, extrapolating too strongly from the authors' data and unsubstantiated assumptions to conclude on their model. It would be good if the authors would support their claims with previous or their own experimental work. Just two examples of several: the assumption that cell-bound virions are negligible should be substantiated, as the literature would indicate otherwise.

      We determined the PsV density in adhered, CytD treated cells, and find around 0.14 per µm<sup>2</sup> (Supplementary figure 1B), which is 4 to 5-fold less when compared to the PsV density quantified in an area covering the cell body and the periphery (Figure 1B, see line 174 for PsVs/µm<sup>2</sup> values). Quantifying the PsV density only in the periphery would yield a severalfold larger difference. However, due to the limited resolution of the microscope we would strongly underestimate the PsV density in the accumulations. We prefer not to discuss this in detail, as exact numbers are difficult to obtain.

      Line 129: Cyto D should not inhibit the enzymes modifying HS or proteins (including virions). This is true, but cytoD may limit their secretion and abundance.

      We show in Figure 3 that CytD does not reduce HS staining (e.g., by limiting HS secretion, as suggested by the referee), suggesting that it rather does not limit secretion.

      We thank the referee´s and the reviewing editor for their helpful comments!

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reviewer #1 __

      *This study "Interpreting the Effects of DNA Polymerase Variants at the Structural Level" comprises an in-depth analysis of protein sequence variants in two DNA polymerase enzymes with particular emphasis on deducing the mechanistic impact in the context of cancer. The authors identify numerous variants for prioritisation in further studies, and showcase the effectiveness of integrating various data sources for inferring the mechanistic impact of variants. *

      *All the comments below are minor, I think the manuscript is exceptionally well written. *

      *> The main body of the manuscript has almost as much emphasis on usage of the MAVISp tool as analysis of the polymerase variants. I don't think this is an issue, as an illustrated example of proper usage is very handy. I do, however, think that the title and abstract should better reflect this emphasis. E.g. "Interpreting the Effects of DNA Polymerase Variants at the Structural Level with MAVISp". This would make the paper more discoverable to people interested in learning about the tool. *

      We have changed the manuscript title according to the reviewer’s suggestions, and the current title is “Interpreting the Effects of DNA Polymerase Variants at the Structural Level using MAVISp and molecular dynamics simulations.”

      • *

      *> Figure 1. I don't believe there is much value in showing the intersection between the datasets (especially since the in-silico saturation dataset intersects perfectly with all the others). As an alternative, I suggest a flow-chart or similar visual overview of the analysis pipeline. *

      • *

      We moved the former Figure 1 to SI. We decided to keep it at least in SI because it provides guidance on the number of variants relative to the total reported across the different disease-related datasets annotated with the MAVISp toolkit. On the other hand, the suggestion of a visual scheme for the pipeline followed in the analyses is a great idea. We have thus added Figure 1, which illustrates the pipeline workflows for analysis of known pathogenic variants and for discovery of VUS and other unknown variants, as suggested by the reviewer.

      *> Please note in the MAVISp dot-plot figure legends that the second key refers to the colour of the X-axis labels rather than the dots *

      We have revised the code that produces the dotplot so the second key is placed closer to the x-axis and clearer to read.

      Missing figure reference (Figure XXX) at the bottom of page 16

      We apologize for this mistake. Figures, contents, and the order have changed significantly to address all reviewers’ comments; this statement is no longer included. Also, we have carefully proofread the final version of the manuscript before resubmitting it.


      __Reviewer #2 __

      • *

      This manuscript reports a comprehensive study of POLE and POLD1 annotated clinical variants using a recently developed framework, MAVISp, that leverages scores and classifications from evolutionary-based variant effect predictors. The resource can be useful for the community. However, I have a number of major concerns regarding the methodology, the presentation of the results.

      *** On the choice of tools in MAVISp and interpretation of their outputs *

      - Based on the ProteinGym benchmark: https://proteingym.org/benchmarks*, GEMME outperforms EVE for predicting the pathogenicity of ClinVar mutations, with an AUC of 0.919 for GEMME compared to 0.914 for EVE. Thus, it is not clear for me why the authors chose to put more emphasis on EVE for predicting mutation pathogenicity. It seems that GEMME can better predict this property, without any adaptation or training on clinical labels. *

      • *

      We appreciate this comment, but we should not exclude EVE entirely from our data collection or from VEP coverage under MAVISp, based on a difference in AUC of 0.005. It was not our intention to place more emphasis on EVE predictions, and we have revised it accordingly. We would like to clarify the workflow we use for applications of the MAVISp framework in “discovery mode,” i.e., for variants not reported as pathogenic in ClinVar. This relies on AlphaMissense to prioritize the pathogenic variants and then retain further only the ones that also have an impact according to DeMaSk, which provides further indication for loss/gain-of-fitness. DeMaSk nicely fits the MAVISp framework, as it was trained on data from experimental deep mutational scans, which we generally import in the EXPERIMENTAL_DATA module. We have revised the text to make this clearer. GEMME and EVE (or REVEL) can be used for complementary analysis in the discovery workflow. Other users of MAVISp data might want to combine them with a different design, and they have access to all the original scores in the MAVISp database CSV file and the code for downstream analysis to do so. The choice for our MAVISp discovery workflow is mainly dictated by the fact that we have noticed we do not always have full coverage of all variants in many protein instances for EVE, GEMME, and REVEL. In particular, since the reviewer highlights GEMME over EVE, GEMME is currently unavailable for a few cases in the MAVISp database. This is because we need to rely on an external web server to collect the data, which slows down data collection on our end.

      Additionally, we have encountered instances where GEMME was unable to provide an output for inclusion in the MAVISp entries. When we designed the workflow for variant characterization in focused studies, we also made practical considerations. We are also exploring the possibility of using pre-calculated GEMME scores from

      https://datadryad.org/dataset/doi:10.5061/dryad.vdncjsz1s, but we encountered some challenges at the moment that deserve further investigations and considerations. For example, MAVISp annotations rely on the canonical isoform as reported in Uniprot, which can lead to mismatches with the GeMME pre-computed scores. So far, we have identified a couple of entries whose canonical isoforms no longer match the one in the pre-computed GEMME score dataset. Another limitation is the absence of the original MSA files in the dataset, which we would need for a more in-depth comparison with the ones we used for our calculations. We are facing some challenges in reproducing the MSA output from MMseq2-based ColabFold protocol in this context that need to be solved first. Overall, the dataset shows potential for integration into MAVISp, but we need to define the inclusion criteria and compare it with the existing results in more detail.

      Additionally, since the principle behind MAVISp is to provide a framework rooted in protein structure, AlphaMissense was the most reasonable choice for us as the primary indicator among the VEPs for our discovery workflow, and it has performed reasonably well in this case study and others.

      Of course, our discovery design is one of the many applications and designs that could be envisioned using the data provided and collected by MAVISp. We also include all raw scores in the database's final CSV files, allowing other end users to decide how to use them in their own computational design. The design choice we made for the discovery phase of focused studies, using MAVISp to identify variants of interest for further studies, has been applied in other publications (see https://elelab.gitbook.io/mavisp/overview/publications-that-used-mavisp-data) in some cases together with experiments. It is also a fair choice for the application, as the ultimate goal is to provide a catalog of variants for further studies that may have a potentially damaging impact, along with a corresponding structural mechanism.

      We have now revised the results section text where Table 1 is cited to clarify this. We also revised the terminology because we are using the VEPs' capability to predict damaging variants, rather than the pathogenic variants themselves. Experiments on disease models should validate our predictions before concluding whether a variant is pathogenic in a disease context, and we want to avoid misunderstandings among readers regarding our stance on this matter.

      - Which of the predictors, among AM, EVE, GEMME, and DeMaSK, provide a classification of variants and which ones provide continuous scores? This should be clarified in the text. If some predictors do not output a classification, then evaluating their performance on a classification task is unfair. The MAVISp framework sets thresholds on the predicted scores to perform the classification and it is unclear from reading the manuscript whether these thresholds are optimal nor whether using universal cutoff values is pertinent. For instance, for GEMME, a recent study shows that fitting a Gaussian mixture to the predicted score distribution yields higher accuracy than setting a universal threshold (https://doi.org/10.1101/2025.02.09.637326*). Along this line, for predictors that do not provide a classification, I am not convinced of the benefit for the users of having access to only binary labels, instead of the continuous scores. The users currently do not have any idea of whether each variant is borderline (close to theshold) or confident (far from threshold). *

      We agree with the reviewer, and this is due to us not being sufficiently clear in the manuscript. We have now revised the first part of the results to clarify this and to explain how we use the MAVISp data for application to focused studies, where the goal is to identify the most interesting variants that are potentially damaging and have a linked structural mechanism. Of course, there are other applications for leveraging the data in the database. We do offer scores to variants instead of just classification labels in the MAVISp csv file. They can be accessed, together with the full dataset, through the MAVISp website and reused for any applications.

      Additionally, we used the scores in the revised manuscript for the VUS variant ranking (Figure 5), applying a strategy recently designed as an addition to the downstream analysis tool kit of MAVISp (​​https://github.com/ELELAB/MAVISp_downstream_analysis), thereby allowing the scores themselves to be taken into account. Also, in the final part of the manuscript, the VEP scores have been used to introduce the ACMG-like classification of the variants in response to reviewer 3 (Figure 9 and Tables S3-S4). We absolutely agree that it is informative to keep the continuous scores, and we have never overlooked this aspect. However, we also need a strategy with a simpler classification to highlight the most interesting variants among thousands or more to start an exploration. This is why we included the support with dotplots and lolliplots, for example. Our purpose here is to identify, among many cases, those with a potentially damaging signature (and thus we need a binary classification for simplicity). Next, we evaluate whether this signature entails a fitness effect (with DeMaSk), and finally, retain only the cases we can identify with a structural mechanism to study further.

      The thresholds we set as the default for data analysis of dotplots in GEMME and DeMaSk are discussed in __Supplementary Text S3 __of the original MAVISp article. In brief, we carried out an ROC analysis against the scores for known pathogenic and benign variants in ClinVar with review status higher than 2. For applicative purposes, one could design other strategies to analyze the MAVISp data too; it is not limited to the workflow we decided to set as the primary one for our focused studies, as already mentioned above.

      We have now also included classification based on the GMM model applied to GEMME scores for POLE and POLD1, so it can be evaluated against other designs for our protein of interest (see Table 1 in the revised version). The method section has been revised to include this part, and the ProteoCast pre-print is cited as a reference. We have not yet officially included this classification in the MAVISp database because we must first follow internal protocols to meet the inclusion criteria for new methods or analyses. We will do so by performing a similar comparison on the entire MAVISp dataset and focusing on high-quality variants, as ClinVar annotations, as we did to set the current thresholds for GEMME in Supplementary Table S3 of the original MAVISp article. We need to allocate time and resources to this pilot, which is scheduled for Q1 2026.

      ** On the presentation and impact of the results

      • While reading the manuscript, it is difficult to grasp the main messages. The text contains abundant discussion about the potential caveats of the framework, the care that should be taken in interpreting the results, and the dependency on the clinical context. Although these aspects are certainly important, this extensive discussion (spread throughout the manuscript) obscures the results. Moreover, the way variants are catalogued throughout the text makes it difficult to grasp key highlights. The reader is left unsure about whether the framework can actually help the clinical practitioners.

      We have revised the text to make it easier to read, including additional MD simulations of three variants of interest and more downstream analyses to clarify the mechanisms of action. We also added a recap of the most interesting variants and their associated mechanisms, along with the ranking of the variants using the different features available in the MAVISp csv file for the VUS. We hope that this makes it more accessible and valuable. In the original publication, Table 2 aimed to provide a summary of the interesting variants, and we have revised it now in light of the ranking results and the additional analyses that allow us to clarify the mechanisms of action further. We have also introduced__ Figure 9 and Tables S3 and S4__, which present data on ACMG-like classification for VUS that can fall into the likely pathogenic or benign categories.

      • In many cases, the authors state that experimental validation is required to validate the results. Could they be more explicit on the experimental design and the expected outcome?

      We have added a section on the point above at pages 21 and 30, where, alongside the summary of mechanisms per variant, we propose the experimental readouts to use based on known MAVE assays or assays that could be designed.

      • AlphaMissense seems to tend to over-predict pathogenicity. Could the authors comment on that?

      We are unsure whether this comment relates to our specific case or to a general feature of AlphaMissense.

      In the latest iteration of our small benchmarking dataset for POLE and POLD1 (as shown in the paper), we achieve a sensitivity of 1 and a balanced specificity of 0.96 for AlphaMissense, which suggests that AlphaMissense does not over-predict pathogenicity very significantly in these proteins, predicting true negatives (i.e., non-pathogenic) mutations quite accurately. As performance was sufficient in our case, we deemed recalibrating the classification threshold for AlphaMissense unnecessary.

      We are aware that this is not necessarily the case for every gene, e.g., it has been shown that AlphaMissense shows lower specificity in some cases (see e.g. 10.3389/fgene.2024.1487608, 10.1038/s41375-023-02116-3). This is also why we found it essential to evaluate its performance with its recommended classification on a gene-specific basis, as done here. In the future, we will keep a critical eye on our predictors to understand whether they are suitable for the specific case of study, or whether they require threshold recalibration or the use of a different predictor.

      ** On specific variants

      • The mention of H1066R, H1068, and D1068Y is very confusing. There seems to be a confusion between residue numbers and amino acid types.

      We have revised the text for typos and errors. This part of the text changed, so these specific variants are no longer mentioned.

      • A major limitation of the 3D modeling is this impossibility to include Zn2+ coordination by cysteine residues. This limitation holds for both POLE and POLD1. Could the authors comment on the implication of this limitation for interpreting the mechanistic impact of variants. In particular, there are several variants reported in the study that consist in gain of cysteines. The authors discuss the potential impact of some of these mutations on the structural stability but not that on Zn coordination or the formation of disulphide bridges.

      This is a great suggestion. We had, for a long time, a plan in the pipeline to include a module to tackle changes in cysteines. We have now used this occasion to include a new module that allows identifying mutations: 1) that are likely to disrupt native disulphide bridges and annotate them as damaging or 2) potential de novo formation of disulphide bridges upon a mutation of a residue to a cysteine, also annotated as damaging with respect to the original functionality. We also included a step that evaluates if the protein target is eligible for the analysis based on the cellular localization, since in specific compartments the redox condition (such as the nucleus) would not favour disulfide bridges. The module has been added to MAVISp, and we are collecting data with the module for the existing entries in the database to be able to release them at one of the following updates. More details are on the website in the Documentation section (https://services.healthtech.dtu.dk/services/MAVISp-1.0/). We could not apply the module to POLE and POLD1 since they are nuclear proteins, and it would not be meaningful to look into this structural aspect either in connection with loss of native cysteines or de novo disulfide bridge formation upon mutations that change a wild-type residue to a cysteine.

      We would like to clarify that the structures we use, as it is a focused study rather than high-throughput data collection for the first inclusion in the MAVISp database, have been modelled with zinc at the correct position. It is just the first layer of high-throughput collection with MAVISp, which uses models without cofactors unless the biocurator attempts to model them or we move to collect further data for research studies (as done here). Prompted by this confusion, we have now added a field to the metadata of a MAVISp entry indicating the cofactor state. Nevertheless, the RaSP stability prediction does not account for the cofactor's presence, even when it is bound in the model. This is discussed in the Method Section. We thus did not further analyze the variants in sites directly coordinating the metal groups due to these limitations.

      • MAVISp does not identify any mechanistic effect for a substantial portion of variants labelled as pathogenic. Could the authors comment on this point?

      We are not sure how to interpret this question. It can be read two ways. Either the reviewer is asking about the known pathogenic ClinVar variants without mechanistic indicators, or more generally, the ones that we label “pathogenic” in discovery (we actually refer to more usually damaging in the dotplots), and for which we cannot associate a mechanism.

      Overall, as a general consideration, it would be challenging to envision a mechanism for each variant predicted to be functionally damaging. For example, in the case of POLE and POLD1, we still lack models of complexes that did not meet the quality-control and inclusion criteria for the binding-free-energy scheme used by the LOCAL INTERACTION module. Also, when it comes to effects on catalysis or to analyzing effects in more detail at the cofactor sites, we could miss effects that would require QM/MM calculations. Other points we have not yet covered include cases related to changes in protein abundance due to degron exposure for degradation, which is one of the mechanistic indicators we are currently developing. Moreover, we used only unbiased molecular simulations of the free protein, and we would need future studies with enhanced sampling approaches and longer timescales to better address conformational changes and changes in the population of different protein conformational states induced by the mutation (including DNA). This can be handled formally by the MAVISp framework using metadynamics approaches, but it would be outside the scope of this work and is a direction for future studies on a subset of variants to investigate in even greater detail.

      Furthermore, modifications related to PTM differ from phosphorylations. Anyway, our scope is to use the platform to provide structure-based characterization of either known pathogenic variants or potentially damaging ones predicted by VEPs, and focus on more detailed analyses of those. As we develop MAVISp further and design new modules, we will also be able to tackle other mechanistic aspects. This discussion, however, is more relevant to the MAVISp method paper itself.

      Moreover, none of the variants discussed are associated with allosteric effect. Is this expected?

      .

      In general, allosteric mutations are rare. Nevertheless, in these case studies, the size of the proteins under investigation also poses some challenges for the underlying coarse-grain model used in the simple mode to generate the allosteric signalling map, as we have found it performs best on protein structures below 1000 residues

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      The manuscript utilized the MAVISp framework to characterize 64,429 missense variants (43,415 in POLE and 21,014 in POLD1) through computational saturation mutagenesis. The authors integrate protein stability predictions with pathogenicity predictors to provide mechanistic insights into DNA polymerase variants relevant to cancer predisposition and immunotherapy response. There are discussions of known PPAP-associated variants and somatic cancer mutations in the context of known data and some proposed variants of interest (which are not validated).

      Major comments:

      I was unaware of the MAVISp framework. It concerns me that alebit this paper has a lot of technical details about the framework, its not the paper about the framework. I did look into the paper https://www.biorxiv.org/content/10.1101/2022.10.22.513328v5 which keeps benign updated (version five now) for three years, but I do not see a peer reviewed version. It would be unfair of me to peer review the underlying framework of the work but together with the previous comments, I am a bit concerned.

      We have intentionally left the MAVISp resource paper as a living pre-print until we have sufficient data in the database that could be useful to the rest of the community. We have been actively revising the manuscript, thanks to comments from users in previous versions, to ensure it provides a solid resource. We had attempted approximately one and a half years ago a submission to a high-impact journal and even addressed the reviewers’ comments there. Still, we did not receive feedback for a long time, and ultimately, we were not sent to the reviewers again despite more than six months of work on our side. After that, we realized that we would benefit from collecting a larger dataset, and we invested time and effort in that and submitted again for revision, this time through Review Commons in the Summer of 2025. Anyway, the paper has been peer-reviewed by three reviewers through Review Commons. We submitted the revised version and response to reviewers, and it is now under revision with Protein Science. The reviewers’ comments and our responses can be found in the “Latested Referred Preprints” on the Review Commons website with the date of 17th of October 2025.

      We would also like to clarify another point on this. In our experience, it is common practice to keep sofware on BioRxiv even for a long and to bring it to a more complete form in parallel with the community already applying it. This allows feedback from peers in a broad manner. We had similar experiences with MoonlightR, where the first publications with applications within the TCGA-PanCancer papers came before the publication of the tool itself, and the same has been for any of our main workflows, such as MutateX or RosettaDDGPrediction, which are widely used by the community. Finally, it can be considered that the MAVISp framework has already been used in different published peer-review studies (since 2023), attesting to its integrity and potential. Here, the reviewer can read more about the studies that used MAVISp data or modules: https://elelab.gitbook.io/mavisp/overview/publications-that-used-mavisp-data

      For example, the authors are using AlphaFold models to predict DDG values. Delgado et al. (2025, Bioinformatics) explicitly tested FoldX on such models and concluded that "AlphaFold2 models are not suitable for point mutation ΔΔG estimation" after observing a correlation of 0.06 between experimental and calculated values. AlphaFold's own documentation states it "has not been validated for predicting the effect of mutations". Pak et al. (2023, PLOS ONE) showed correlation between AlphaFold confidence metrics and experimental ΔΔG of -0.17. Needless to say that these concerns seriously undermine the validity of a major part of the study.

      We appreciate the reviewer’s comments and would like to clarify a point regarding the MAVISp STABILITY module, which we believe may have been misunderstood. Based on the studies cited by the reviewer, which critique the use of AF-generated mutant structures for assessing stability effects, we understand that this assumption may have led to the concern.

      The STABILITY module utilises three in silico tools (FoldX, Rosetta, and RaSP) to assess changes in protein stability resulting from missense mutations. Importantly, the input to these assessments consists of AF models of the WT protein structures, not of AF-generated mutant structures. The mutants are generated using the FoldX and Rosetta protocols, along with estimates of the changes in free energy. For further details and clarification, we kindly refer the reviewer to the MAVISp original publication.

      Also, one should consider the goal of our use of free energy calculations: not to identify the exact ΔΔG values, but to correlate with data from in vitro or biophysical experiments, such as those from cellular experiments like MAVE. We, other researchers, have shown that we have a good agreement in the MAVISp paper (case study on PTEN as an example in the original MAVISp publication and https://pmc.ncbi.nlm.nih.gov/articles/PMC5980760/ https://pubmed.ncbi.nlm.nih.gov/28422960/,10.7554/eLife.49138). Also, we had, before even designing the STABILITY module for MAVISp, verified that we can use WT structures from AlphaFold (upon proper trimming and quality control with Prockech) instead of experimental structure without compromising accuracy in the publications of the two main protocols of the STABILITY module (MutateX and RosettaDDGPrediction and a case study on p53, https://doi.org/10.1093/bib/bbac074,https://doi.org/10.1002/pro.4527). In the focused studies, we also carefully consider whether the prediction is at a site with a low pLDDT score or surrounded by other sites with a low pLDDT score before reaching any conclusions. The pLDDT score is reported in the MAVISp csv file exactly to be used for flagging variants or looking closer at them, as we discuss in this study (see, for example, Figure 2). Additionally, it should be noted that we employ a consensus approach across the two classes of methods in MAVISp to account for their limitations arising from their empirical energy function or backbone stiffness. Furthermore, in the focused studies, we also collected molecular dynamics simulations for the ensemble mode and reassessed the stability on different conformations from the trajectory to compensate for the issues with backbone stiffness of FoldX, RaSP, and Rosetta ΔΔG protocols.

      I have to add that this is also true for the technical choices: Several integrated predictors (DeMaSk, GEMME) are outperformed by newer methods according to benchmarking studies (https://www.embopress.org/doi/full/10.15252/msb.202211474). AlphaMissense, while state-of-the-art, shows substantial overcalling of pathogenic variants. could ensemble meta-predictors (REVEL, BayesDel) improve accuracy?

      The MAVISP framework includes REVEL as one of the VEPs available for data analysis. In this way, we were representing one of the ensemble meta-predictors. This is explained in the MAVISp original paper. We were not aware of BayesDel, which we will consider for one of the next pilot projects to assess new tools for the framework (see more details below on how we generally proceed). Currently, we cannot use REVEL for all variants because we do not necessarily have genomic coordinates for them. We retrieve genomic-level variants corresponding to our protein variants from mutation databases, where available (e.g., ClinVar, COSMIC, or CbioPortal). However, as we strive to cover every possible mutation, several of the variants in MAVISp are not in the database, which means we do not have the corresponding genomic variation for those, limiting our ability to annotate them with VEPs. In the future (see GitHub issue https://github.com/ELELAB/cancermuts/issues/235), we will revise the code to identify the genomic variants that could give rise to each protein mutation of interest, thereby increasing the coverage of VEP annotations.

      We can see from the work cited by the reviewer that ESM-1v, EVE, and DeepSequence are among the top performers, whereas reviewer 2 cited another work in which GEMME outperforms EVE. We have been covering all of them, except ESM-1v, in our framework. We are planning to evaluate for inclusion in MAVISP some of the new top-performing predictors, including ESM-1v, in Q2 2026 (according to the protocol described later in this answer), which is why it is not available yet.

      In our discovery protocol (i.e., when we work on VUS or variants not classified in ClinVar), we generally use AlphaMissense as the first indicator of potentially damaging variants. EVE, REVEL, or GEMME could be used in the case that AlphaMissense data are missing or as a second layer of evidence in the case we want, for example, to select a smaller pool of variants for experimental validation in a protein target with too many uncharacterized variants and too many that pass the evaluation with our discovery workflow. Finally, we rely on DeMaSk, as it also provides information on possible loss- or gain-of-fitness signatures to further filter the variant of interest for the search of mechanistic indicators. Since the MAVISp framework is modular, other users may want to use the data differently and design a different workflow. They have access to them (scores and classifications) through the web portal. The fact that we combine AlphaMissense with DeMaSk could yield final results after further variant filtering and mitigate the issue that AlphaMissense risks over-predicting pathogenicity.

      In general, we work to keep MAVISp up-to-date, and we have developed a protocol for the inclusion of new methodologies in the available module before generating and releasing data with new tools in the database. In particular, we perform comparative studies using data already available in the database to evaluate the performance of new approaches against that of the tools already included. Depending on the module, we use different golden standards that we are also curating in parallel, and it would make sense to apply for that specific module. For example, if the question is to evaluate VEP, we would compare it against ClinVar known variants with good review status. If the VEP performs better than the currently included ones, we can include it as an additional source of annotations and evaluate whether we could change the protocol for the discovery/characterization of variants. We operate similarly for the structural modules. For example, for stability, we are importing experimental data from MAVE assays on protein abundance and use them as a golden standard where we evaluate new approaches against the current FoldX and Rosetta-based consensus for changes in folding free energies. Instead, If we find evidence that suggests switching to a new method or integrating it would be beneficial, we will do so as a result of these investigations. An example of our working mode for evaluating tools for inclusion in the framework is illustrated by how we handled the comparison between RaSP and Rosetta in the MAVISp original article (Supplementary file S2) before officially switching to RaSP for high-throughput data collection. We still maintain Rosetta, especially in focused studies, to validate further variants classified as uncertain.

      *Further, I found the web site of the framework, where I looked for the data on these models, rather user unfriendly. Selecting POLD1, POLD2, or POLE tells me I am viewing entries A2ML1, ABCB11, ABCB6 respectively, when I search for POL and then click: these are the first three entries of the table, bot the what I click on. displaying the whole table and clicking on POLD1, gets me to POLD1. However, when I selected "Damaging mutations on structure" I get "Could not fetch protein structure model from the AlphaFold Protein Structure Database". Many other features are not working (Safari or Chrome, in a Mac). That is a concern for the usability of the dataset. *

      • *

      We have been able to reproduce the bugs identified by the reviewer and have fixed them. The second was connected to recent updates on the AlphaFold Protein Structure Database. We are not really sure how to work and act on the “other features that are not working” due to lack of specificity in this comment. Still, we have worked to make the website more robust: the coauthors of this work and other colleagues in the MAVISp team have extensively tested it across different proteins and with various browsers and operating systems, and we have fixed all identified issues. We also have a GitHub repository where users can open issues to share problems they have been experiencing with the website, which we will fix as promptly as we can (https://www.github.com/ELELAB/MAVISp), as we do for any of the tools we develop and maintain. If the reviewer were to come across other specific problems with the website, we recommend to (anonymously) open issues on the MAVISp repository so that they can be described more in detail and dealt with appropriately.

      This comment seems more related to the MAVISP paper itself than to the POLE and POLD1 entries. We have been doing several revisions to the web app to improve it over time. We are also afraid that the reviewer consulted it during one of these changes, and we hope it will be better now. For POLE and POLD1, the CSV files were, in any case, also available through the MAVISp website itself (https://services.healthtech.dtu.dk/services/MAVISp-1.0/), as well as in the OSF repository connected to this paper (https://osf.io/z8x4j/overview), in case the reader needed to consult them or as a reference for the analyses reported in this paper.

      Albeit this is a thorough analysis with the existing tools, and the authors make some sparse attempts to put the mutants classification in context with examples, the work stays descriptive for know effects in literature, or point out that e.g. "further functional and in vitro assays are required". The examples are not presented in a systematic way, or in an appealing manner. Thus, what this manuscript adds to the web site is unclear. It is a description of content, which could be at least more appealing if examples woudl be more clearly outlined in a conceptual framework, and illustrated more consistently. For exmaple I read in the middle of mage 16 "One such example is the F931S (p.Phe931Ser) variant (Figure 5A)" and then I see "F931 forms contacts with D626, a critical residue for the coordination of Mg2+ which is essential for the correct orientation of the incoming nucleotide (Figure XXX)". Figure 5B is not XXX as this has just many mutations labeled. These issues are very discouraging. I woudl recommend to put much more effort in examples, put them in clearer paragraphs, and decribe results rather than the methodology. Doing both in an intemigled way, clearly does not work for me.

      We have revised the storyline to make it more straightforward for the reader, focusing on the essential messages and avoiding excessive description in the results section, instead conveying the key points directly. We also included new simulation data on three variants and downstream analyses of other variants. We revised the section to focus less on methodologies and more on the actual biological results. We have also added a ranking approach for the VUS and an ACMG-like classification to facilitate the identification of the most important results.

      Additionally, we included a summary Table (Table 2) and Figure 9 that present the main findings on the VUS, and we discussed in the text the possible associated experimental validation.

      We also do not fully understand the reviewer’s comment “the work stays descriptive for know effects in literature”. We agree that we should make a better effort to write the results in a logical and easy-to-follow manner, without risking the reader getting lost in too many details, and with more dedicated subsections. However, the paper does not describe just known effects in the literature. We had, in the previous version, a section aimed at identifying mechanistic indicators for ClinVar-reported variants that are also (in some cases) functionally characterized. This is true, but it is the very first part of the results, and it is still adding structure-based knowledge to these variants. After this, we also reported predicted results with mechanisms for VUS and variants in other databases. We took the opportunity in this revised version to elaborate more on the results of the variants reported in COSMIC and cBioPortal.

      We are afraid that we also do not fully understand the reviewer's comment on the fact that “Thus, what this manuscript adds to the website is unclear.” We have generated POLE and POLD1 data with the MAVISp toolkit in both ensemble and simple mode, and the whole pool of local interactions with other proteins and DNA, specifically for this publication. It should be acknowledged that we have generated new data in ensemble mode, which relies on all-atom microsecond molecular dynamics simulations, and additional modules for the simple mode, including calculations with the flexddg protocol of Rosetta, which is also computationally demanding, to provide a comprehensive overview of the effects of variants in POLE and POLD1. The two proteins were available in the database only in simple mode with the basic default modules, and the remaining data were collected during this research article. This can also be inferred by the references in the csv file of the ensemble mode, which refer only to the DOI of the pre-print of this article. This entails a substantial effort in computing and analysis. The website is the repository for data that researchers collect using the MAVISp protocols or modules; in our opinion, it cannot replace a research project. We designed the database to store the data generated by the framework for others to consult and use for various purposes (e.g., biological studies, preparing datasets for benchmarking approaches against existing ones, or using features for machine learning applications). The entry point in the database is the simple mode, along with some compulsory modules (VEPs, STABILITY, PTM, EFOLDMINE, SASA). After this initial entry point, a biocurator or a team of researchers can decide to expand data coverage by moving into the other modules. Still, at some point, one would need to design focused studies to have a comprehensive overview of the effects on specific targets, as we did here, or, for example, in the publication https://doi.org/10.1016/j.bbadis.2024.167260.

      Furthermore, there are analyses here, especially in the simulations, that are not directly available from consulting the database; in these cases, one needs to use other resources beyond MAVISp to investigate further the mechanisms underlying the predicted mechanistic indicators. We also included simulations of mutant variants to validate the hypothesis further. And another example is the analysis of the effects on the splicing site that is not covered by a structure-based framework, such as MAVISp, but is still an essential aspect in the analysis of the variants' effects.

      Will the community find this analysis useful?

      The analysis provided here will be helpful, especially for researchers interested in experimental studies of these enzymes, because they have throughout the study an extensive portfolio of structural data to consult, including a ranked list of variants by class of effect. We originally started designing MAVISp because we realized it was needed by our experimental collaborators, both in cellular biology and in more clinical research, whenever they needed to predict or simulate variants, and we expanded the concept into a robust, versatile framework for broader use. Especially for those genes where extensive MAVE data are not available (as in this case), having a set of variants to test experimentally is crucial support, as it provides the potential mechanism behind the predicted damaging variant.

      How many ClinVar VUS could be reclassified using MAVISp data under current ACMG/AMP guidelines?

      • *

      The ACMG/AMP variant classification guidelines, to the best of our knowledge, include computational evidence (PP3/BP4) and well-established functional studies (PS3/BS3). Because MAVISp provides multi-level mechanistic predictions derived from structural modelling, these data formally fall within the PP3/BP4 computational category. They cannot be used to reclassify ClinVar VUS independently under ACMG/AMP rules. This is not really the goal of our framework, which is to provide a structure-based framework for investigating potentially damaging variants predicted by VEPs. However, the suggestion of the reviewer is something we wanted to explore too in general with MAVISp data, and we failed because of a lack of time. We checked the requirements for PP3, BP4, and PM1 and developed a classifier for VUS reported in ClinVar, using MAVISp features in accordance with the ACMG/AMP guidelines. Using ClinVar pathogenic and benign variants with at least a review status of 1 for calibration, we obtained thresholds for all MAVISp-supported VEPs (REVEL, AlphaMissense, EVE, GEMME, and DeMaSk). These thresholds were then applied to all ClinVar VUS to determine PP3 (pathogenic-supporting) and BP4 (benign-supporting) evidence. In parallel, we constructed a PM1-like mechanistic evidence category that integrates MAVISp structural stability, protein–protein interactions, DNA interactions, long-range allosteric paths, functional sites, and PTM-mediated regulatory effects. Variants classified as damaging in MAVISp according to such criteria were assigned PM1-like support. These evidence tags provide mechanistic insight to support VUS classification for polymerase proofreading genes. The workflow and complete annotated VUS table are now included in the revised manuscript and in the OSF repository. Although these findings cannot formally reclassify variants under ACMG/AMP criteria, they provide prioritization for PS3/BS3 experimental validation and highlight variants that are likely to be reclassified once supporting functional evidence becomes available.

      How do MAVISp predictions meet calibrated thresholds, as in https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-023-01234-y* for the exonuclease domain of POLE and POLD1? *

      • *

      Mur et al. (Genome Medicine 2023) restricted their ACMG/AMP recommendations to the exonuclease domain (ED) because (i) nearly all known pathogenic germline variants in POLE/POLD1 cluster within the ED, (ii) the ED has a well-characterised structure–function architecture, and (iii) sufficient pathogenic and benign variants exist only within the ED to support empirical calibration. To mirror this approach, we performed the calibration workflow exclusively on ED variants (POLE residues 268–471; POLD1 residues 304–533). For these ED-restricted variants, we recalibrated all MAVISp-derived computational predictors (REVEL, AlphaMissense, EVE, GEMME, DeMaSk) using ClinVar P/LP and B/LB variants. We applied the resulting POLE/POLD1-specific thresholds to all ClinVar VUS within the ED. We also applied our PM1-like structural/functional evidence exclusively to ED variants. The results of this ED-specific analysis are now reported in the revised manuscript (Figure 9 Supplementary Tables S3 and S4), as also explained in the response to the previous question. This ensures that MAVISp predictions are applied in a manner that is consistent with the principles of Mur et al. and ACMG/AMP variant interpretation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      MPRAs are a high-throughput and powerful tool for assaying the regulatory potential of genomic sequences. However, linking MPRA-nominated regulatory sequences to their endogenous target genes and identifying the more specific functional regions within these sequences can be challenging. MPRAs that tile a genomic region, and saturation mutagenesis-based MPRAs, can help to address these challenges. In this work, Tulloch et al. describe a streamlined MPRA system for the identification and investigation of the regulatory elements surrounding a gene of interest with high resolution. The use of BACs covering a locus of interest to generate MPRA libraries allows for an unbiased and high-coverage assessment of a particular region. Follow-up degenerate MPRAs, where each nucleotide in the nominated sequences is systematically mutated, can then point to key motifs driving their regulatory activity. The authors present this MPRA platform as straightforward, easily customizable, and less time- and resource-intensive than traditional MPRA designs. They demonstrate the utility of their design in the context of the developing mouse retina, where they first use the LS-MPRA to identify active regulatory elements for select retinal genes, followed by d-MPRA, which allowed them to dissect the functional regions within those elements and nominate important regulatory motifs. These assays were able to recapitulate some previously known cis-regulatory modules (CRMs), as well as identify some new potential regulatory regions. Follow-up experiments assessing co-localization of the gene of interest with the CRM-linked GFP reporter in the target cells, and CUT&RUN assays to confirm transcription factor binding to nominated motifs, provided support linking these CRMs to the genes of interest. Overall, this method appears flexible and could be an easy-to-implement tool for other investigators aiming to study their locus of interest with high resolution.

      Strengths:

      (1) The method of fragmenting BACs allows for high, overlapping coverage of the region of interest.

      (2) The d-MPRA method was an efficient way to identify key functional transcription factor motifs and nominate specific transcription factor-driven regulatory pathways that could be studied further.

      (3) Additional assays like co-expression analyses using the endogenous gene promoter, and use of the Notch inhibitor in the case of Olig2, helped correlate the activity of the CRMs to the expression of the gene of interest, and distinguish false positives from the initial MPRA.

      (4) The use of these assays across different time points, tissues, and even species demonstrated that they can be used across many contexts to identify both common and divergent regulatory mechanisms for the same gene.

      Weaknesses:

      The LS-MPRA assay most strongly identified promoters, which are not usually novel regulatory elements you would try to discover, and the signal-to-noise ratio for more TSS-distal, non-promoter regulatory elements was usually high, making it difficult to discriminate lower activity CRMs, like enhancers, from the background. For example, NR2 and NR3 in Figure 3 have very minimal activity peaks (NR3 seems non-existent). The ex vivo data in Figure 2 are similarly noisy. Is there a particular metric or calculation that was or could be used to quantitatively or statistically call a peak above the background? The authors mention in the discussion some adjustments that could reduce the noise, such as increased sequencing depth, which I think is needed to make these initial LS-MPRA results and the benchmarking of this assay more convincing and impactful.

      Much of the statistical and quantitative data asked for by the Reviewers have been provided in the Revision. However, it is important to note that the types of statistics using peak callers asked for regarding candidate choice will be of limited value. If one is testing a library in a single cell type in vitro, and/or running genome-wide assays, these statistics could aid in the choice of candidates. However, here we are electroporating a complex and dynamic set of cells, with each cell type constituting what can be very different frequencies (e.g. Olig2-expressing cells are <2.4% of cells). This fact alone will give different apparent signal to noise values. In addition, at least for Olig2 and Ngn2, their expression is very transient, suggesting dynamic regulation by what is likely multiple positive and negative CRMs. An additional confound is that the level of expression of each gene that one might test is variable. All of these variables render a statistical prediction of candidates to be less valuable than one might hope, and might lead one to miss those CRMs of interest, particularly those in a small subset of cells. Instead, we suggest that one use one’s own level of interest and knowledge in choosing CRM candidates. We provide several examples of experimental, rather than purely statistical, approaches that might help in one’s choice of candidates. We used a functional read-out of CRM activity (Notch perturbation), carried out in the context of the entire LS-MPRA library, as one method. Co-expression in single cells of candidate regulators identified by the d-MPRA is another. One can of course use chromatin structure and sequence conservation, as used in many studies of regulatory regions, as other ways to narrow down candidates. The d-MPRA predictions also can be viewed in light of previous genetic studies, i.e. mutations in TFs that effect the cell type of interest or the regulation of the gene of interest, as we were able to do here for CRMs predicted to be regulated by Otx2.

      Reviewer #2 (Public review):

      Summary:

      In this study, Tulloch et al. developed two modified massively parallel reporter assays (MPRAs) and applied them to identify cis-regulatory modules (CRMs) - genomic regions that activate gene expression, controlling retinal gene expression. These CRMs usually function at specific developmental stages and in distinct cell types to orchestrate retinal development. Studying them provides insights into how retinal progenitor cells give rise to various retinal cell types.

      The first assay, named locus-specific MPRA (LS-MPRA), tests all genomic regions within 150-300 kb of the gene of interest, rather than relying on previously predicted candidate regulatory elements. This approach reduces potential bias introduced during candidate selection, lowers the cost of synthesizing a library of candidate sequences, and simplifies library preparation. The LS-MPRA libraries were electroporated into mouse retinas in vivo or ex vivo. To benchmark the method, the authors first applied LS-MPRA near stably expressed retinal genes (e.g., Rho, Cabp5, Grm6, and Vsx2), and successfully identified both known and novel CRMs. They then used LS-MPRA to identify CRMs in embryonic mouse retinas, near Olig2 and Ngn2, genes expressed in subsets of retinal progenitor cells. Similar experiments were conducted in chick retinas and postnatal mouse retinas, revealing some CRMs with conserved activity across species and developmental stages.

      Although the study identified CRMs with robust reporter activity in Olig2+ or Ngn2+ cells, the data do not provide sufficient evidence to support the claims that these CRMs regulate Olig2 or Ngn2, rather than other nearby genes, in a cell-type-specific manner. For example, the authors propose that three regions (NR1/2/3) regulate Olig2 specifically in retinal progenitor cells based on: (1) the three regions are close to Olig2, (2) increased Olig2 expression and NR1/2/3 activity upon Notch inhibition, and (3) reporter activity observed in Olig2+ cells (though also present in many Olig2- cells). While these are promising findings, they do not directly support the claims.

      The second assay, called degenerate MPRA (d-MPRA), introduces random point mutations into CRMs via error-prone PCR to assess the impact of sequence variations on regulatory activity. This approach was used on NR1/2/3 to identify mutations that alter CRM activity, potentially by influencing transcription factor binding. The authors inferred candidate transcription factors, such as Mybl1 and Otx2, through motif analysis, co-expression with Olig2 (based on single-cell RNA-seq), and CUR&RUN profiling. While some transcription factors identified in this way overlapped with the d-MPRA results, others did not. This raises questions about how well d-MPRA complements other methods for identifying transcriptional regulators.

      Strengths:

      (1) The study introduces two technically robust MPRA protocols that offer advantages over standard methods, such as avoiding reliance on predefined candidate regions, reducing cost and labor, and minimizing selection bias.

      (2) The identified regulatory elements and transcription factors contribute to our understanding of gene regulation in retinal development and may have translational potential for cell-type-specific gene delivery into developing retinas.

      Weaknesses:

      (1) The claims for gene-specific and cell type-specific CRMs would benefit from further validation using complementary approaches, such as CRISPR interference or Prime editing.

      The methods that we developed were meant to provide candidates for regulatory elements for a gene of interest. These candidates could be used to further understand the regulation of a gene, a complex and difficult task, especially for dynamically regulated genes in the context of development. These candidates could also, or instead, be used to drive gene expression specifically in a target cell of interest for applications such as gene therapy or perturbations that need this type of specificity. In the first case, to use the candidates to understand the regulation of a gene, one would need to validate the candidates using the types of methods typically employed for this purpose, most rigorously in the in vivo genomic context. We did not pursue this level of validation as it would encompass a great deal of work outside the scope of the current study. However, by initially testing loci which have been studied by several groups (as cited in the manuscript, Rho, Grm6, Vsx2, and Cabp5), we were able to show that LS-MPRA can identify known CRMs. In the cases of Rho and Vsx2, previous data have shown the CRMs to be relevant in the genomic context in vivo. In addition, two Vsx2 CRM’s identified by LS-MPRA are located at -37 Kb and -17Kb, and the Grm6 CRM identified by LS-MPRA is at -8Kb. These are the same CRM locations identified previously using classical methods. These data show that the method is capable of identifying distal elements. When one has only one or a few loci of interest, i.e. one does not need to use genome-wide approaches, LS-MPRA is accurate enough to be worth the relatively small effort to identify potential CRMs, even those at some distance from the TSS. However, it is apparent that our methods are not perfect and that the LS-MPRA does not pick up all CRMs. We do not know of a method that has been shown to do so.

      Reviewer #3 (Public review):

      Summary:

      Use of reporter assays to understand the regulatory mechanisms controlling gene expression moves beyond simple correlations of cis-regulatory sequence accessibility, evolutionary sequence conservation, and epigenetic status with gene expression, instead quantifying regulatory sequence activity for individual elements. Tulloch et al., provide a systematic characterization of two new reporter assay techniques (LS-MPRA and d-MPRA) to comprehensively identify cis-regulatory sequences contained within genomic loci of interest during retinal development. The authors then apply LS-MPRA and d-MPRA to identify putative cis-regulatory sequences controlling Olig2 and Ngn2 expression, including potential regulatory motifs that known retinal transcription factors may bind. Transcription factor binding to regulatory sequences is then assessed via CUT&RUN. The broader utility of the techniques is then highlighted by performing the assays across development, across species, and across tissues.

      Strengths:

      (1) The authors validate the reporter assays on retinal loci for which the regulatory sequences are known (Rho, Vsx2, Grm6, Cabp5) mostly confirming known regulatory sequence activity but highlighting either limitations of the current technology or discrepancies of previous reporter assays and known biology. The techniques are then applied to loci of interest (Olig2 and Ngn2) to better understand the regulatory sequences driving expression of these transcription factors across retinal development within subsets of retinal progenitor cells, identifying novel regulatory sequences through comprehensive profiling of the region.

      (2) LS-MPRA provides broad coverage of loci of interest.

      (3) d-MPRA identifies sequence features that are important for cis-regulatory sequence activity.

      (4) The authors take into account transcript and protein stability when determining the correlation of putative enhancer sequence activity with target gene expression.

      Weaknesses:

      (1) In its current form, the many important controls that are standard for other MPRA experiments are not shown or not performed, limiting the interpretations of the utility of the techniques. This includes limited controls for basal-promoter activity, limited information about sequence saturation and reproducibility of individual fragments across different barcode sequences, limitations in cloning and assay delivery, and sequencing requirements. Additional quantitative metrics, including locus coverage and number of barcodes/fragments, would be beneficial throughout the manuscript.

      We thank the reviewer for these comments and have provided detailed responses to the additional analyses in the subsequent Recommendations section.

      (2) There are no statistical metrics for calling a region/sequence 'active'. This is especially important given that NR3 for Olig2 seems to have a small 'peak' and has non-significant activity in Figure 4.

      See comments about peak calling in our response to Reviewer #1.

      (3) The authors present correlational data for identified cis-regulatory sequences with target gene expression. Additionally, the significance of transcription factor binding to the putative regulatory sequences is not currently tested, only correlated based on previous single-cell RNA-sequencing data. While putative regulatory sequences with potential mechanisms of regulation are identified/proposed, the lack of validation (and discrepancies with previous literature) makes it hard to decipher the utility of the techniques.

      See comments about further validation in our response to Reviewer #2.

      (4) While the interpretations that Olig2 mRNA/protein expression is dynamically regulated improved the proportions of cells that co-expressed CRM-regulated GFP and Olig2, alternate explanations (some noted) are just as likely. First, the electroporation isn't specific to Olig2+ progenitors. Also, the tested, short CRM fragments may have activating signals outside of Olig2 neurogenic cells because chromatin conformation, histone modifications, and DNA methylation are not present on plasmids to precisely control plasmid activity. Alternatively, repressive elements that control Olig2 expression are not contained in the reporter vectors.

      The electroporation of Olig2 minus and plus cells is an excellent way to determine if a CRM is active in all cells, or only a specific subset, and we therefore consider this the best way to answer the question of specificity. We agree that we were unable to show that all CRM active cells were indeed Olig2-expressing cells. As noted by the Reviewer, we went to some lengths to quantify RNA and protein co-expression, including of endogenous Olig2 protein and RNA. Even with the endogenous RNA and protein, there was a mismatch wherein one infrequently saw the two together in the same cell, which could be predicted from the short half-lives of these molecules. Regarding chromatin, etc., we are intrigued by the proper regulation that we have observed for CRMs that we have previously discovered by plasmid electroporation (e.g. Kim et al. 2008, Matsuda and Cepko, 2004, Wang et al. 2014, Emerson et al. 2013). It is indeed interesting that plasmids can recapitulate proper regulation, without the proper genomic context or chromatin modifications. We have expanded our discussion of these points in the Discussion.

      (5) It is unclear as to why the d-MPRA uses a different barcoding strategy, placing a second copy of the cis-regulatory sequence in the 3' UTR. As acknowledged by the author, this will change the transcript stability by changing the 3' UTR sequence. Because of this, comparisons of sequence activity between the LS-MPRA and d-MPRA should not be performed as the experiments are not equivalent.

      We had provided a rationale for the different strategies of barcoding in the original submission, and believe it is at the discretion of the experimenter to utilize either strategy for their specific purposes. We agree that comparing activity between different techniques would not be appropriate. The analysis of mutated CRMs using d-MPRA does not utilize data from the LS-MPRA, but is an analysis of relative activity among all mutated d-MPRA constructs.

      (6) Furthermore, details of the mutational burden in d-MPRA experiments are not provided, limiting the interpretations of these results.

      We have provided detailed responses to the additional analyses in the subsequent Recommendations section and included details of the mutational burden in Supplemental Document A.

      (7) Many figures are IGV screenshots that suffer from low resolution. Many figures could be consolidated.

      We have increased the resolution of all IGV genome tracks, but believe the content within all figures remains appropriate.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Suggestions for improving the clarity of the results in the figures:

      (1) The pie charts used the show the percentage of overlapping cells in the colocalization analyses were not especially intuitive to read, and although the percentages and any statistical significance were often written in the text, it would've been helpful to have them written in the figures. I would suggest displaying the results in stacked bar plots, possibly like the one shown in Figure 6A, to demonstrate the data more clearly.

      We thank the reviewer for the suggestions. Though adding the percentages directly to the pie charts would make the relevant panels too confusing to interpret, we added supplemental tables (Tables S5-S9) with the percentages displayed in all pie charts for readers interested in the precise quantifications.

      (2) The scRNA-seq UMAPs showing co-expression of Olig2 with the TFS of interest - it is very hard to see the cells that co-express. I would recommend either having a window zoomed in on the Olig2-expressing cell population to be able to see the co-expression more clearly visually, and/or including a graph demonstrating the percentages of co-expressing cells. These numbers were written in the text, but would be useful to see in the figure.

      The resolution of the scRNA-Seq plot has been improved for the visualization of co-expressing cells, which were also brought forward in all UMAP plots to improve clarity. Because of the higher quality images, insets should no longer be necessary. We have also included percentages of co-expression in the figures (Figs. 8 and 8S) and thank the reviewer for the suggestion.

      Other minor suggestions/corrections:

      (3) Figures 6B and 10S are missing the overlap quantification (in bar or pie charts) like in the other figures.

      The quantification for the image in 6B (i.e., GFP fluorescence and GFP RNA) is displayed in 6D for the four Olig2 CRM plasmid constructs. In Fig. 10S, the experiments in early chick ventral neural tube delivered constructs to a very limited number of cells, and quantification of cells would not necessarily represent an accurate number of cells with CRM activity. We therefore decided to show only representative images of CRM activity in this population of cells rather than present a biased count or increase the number of experiments/samples to obtain a robust quantification.

      (4) On the second-to-last line of page 10, in the sentence "The d-MPRA approach provided a robust, high resolution method for functionally relevant TF binding sites....", I think you're missing a word between "for" and "functionally". For example, it might be "for identifying..." or "for nominating...".

      We have revised the sentence accordingly.

      Reviewer #2 (Recommendations for the authors):

      Minor suggestions:

      (1) Please indicate which mouse reference genome (e.g., mm10) was used in plots such as Figure 2.

      We have added text to the relevant sections in the Results (the reference genome was already mentioned in Methods).

      (2) In Figures 2 and 2S, the CRMs discussed in the text are not labeled or highlighted, making it unclear which regions are being referenced.

      We have labeled peaks with roman numerals in both the figures, legends, and text for clarity and thank the reviewer for the suggestion.

      (3) Consider listing the genomic coordinates for the CRMs mentioned in the text, as this information would be especially useful for readers interested in exploring these regions further.

      This information was included in Table 2S in the original submission, with all relevant coordinates provided therein.

      (4) The d-MPRA plots (e.g., Figure 7C-E) do not clearly show the effects of different nucleotide substitutions. A more informative visualization style can be found in Kircher et al (PMID: 31395865, Fig. 1D) or Deng et al (PMID: 38781390, Fig. 5F).

      The precise nucleotide substitutions would be informative to visualize the effects of specific changes. However, we were more interested in how any nucleotide substitution influenced the CRM activity to hone in on relevant TFBS. We therefore believe the current visualization is the most appropriate to accomplish this. However, for some types of future applications, a more informative visualization as noted would be a valuable addition.

      (5) It would be extremely helpful to the community if the LS-MPRA data were uploaded to the UCSC genome browser and made accessible via a link.

      We have uploaded all LS-MPRA genome tracks to a Track Hub in the UCSC genome browser and provided the appropriate link to access the Hub (https://github.com/cattapre/ALAS00) in the methods section.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should address the following metrics to showcase the utility of the techniques:

      We thank the reviewer for requesting the detailed metrics outlined below. We have addressed all inquiries and included the majority of metrics in the resubmission.

      (a) Library size

      This should be shown for each library that is generated. It is acknowledged that the complete size of the library is limited by sequencing, and the comprehensiveness of the library will change every time the library is re-prepped. However, metrics of this are not currently provided in a robust manner for each library. "Libraries of at least 7x10^6 and as many as 9x10^7 fragments are made" - vague - how was library complexity established since this seems to be an estimation, how many reads were utilized to estimate library complexity?

      We created a new supplemental table (Table S3) that displays the complexity based on sequencing rather than the estimated complexity based on the serial dilutions prior to 3D culture (which was used for the estimates listed in the results). We updated the complexity range in the text as well and thank the reviewer for the suggestion.

      Does library size scale proportionally to the BACs of different sizes?

      The fragmentation of different BACs with differing sizes does not necessarily alter the size of the library. Library size is primarily determined by the library creation pipeline, with the size selection step of the fragmented BAC and the cloning step that inserts adapter-ligated fragments into the barcoded expression vector being the primary determinants of complexity of plasmid libraries.

      (b) Sequence saturation

      Can the authors please provide evidence that the libraries have been sequenced to saturation or estimates of the degree of under-sequencing? How many reads does it take to discover a new barcode associated with a new regulatory sequence?

      We have provided library characteristics for this in Table S3 and have also generated Sequence Saturation Curves for each association library in Supplemental Document A.

      (c) Barcode saturation

      How many barcodes are present for each fragment in the libraries? Are most fragments only covered by 1 barcode? The barcoding strategy doesn't prevent the same barcode from being assigned to multiple different fragments, as barcodes are random. What is the incidence of barcode collisions?

      We have provided library characteristics for this in Table S3 and have also generated Barcode Saturation Curves for each association library in Supplemental Document A.

      Additionally, we tested whether the omission of barcode collisions would affect the output of our LS-MPRA. We reanalyzed one barcode abundance library (one replicate following 12h Notch inhibitor) and filtered the barcodes so that only unique barcodes were analyzed. We were able to replicate all previously identified peaks. Though it is not necessary to filter out barcode collisions, there may be an improvement in signal-to-noise if the sequencing depth of libraries was sufficient (see Supplemental Document B).

      (d) Normalization

      As performed, fragment activity is normalized by RNA expression compared to the presence of fragments in the library. While this is done for small libraries, for large libraries, this may not be appropriate. For large libraries, every sequence in the library will not be delivered to each cell, and many fragments contained in the library may not be electroporated at all. Ideally, the authors would have sequenced both the RNA and DNA from the electroporations to i) identify the fragment distribution of the library that was successfully electroporated and ii) provide an internal normalization factor across replicate samples. This is especially important if the libraries were ever re-prepped, as the jack-potting or asymmetries in fragment recovery can occur every time the library is re-derived.

      We agree with the reviewer’s comments about the variability in fragments delivered experimentally, though we also believe the normalization of the libraries is still appropriate. We never needed to re-prep the libraries as there was sufficient material for many more experiments than were performed. However, should one ever need to re-prep an LS-MPRA library, all experimental sequencing should be normalized to the respective sequenced association library to account for biased distributions, as the reviewer mentions.

      In the absence of these metrics (this would likely require the authors to repeat all experiments and is acknowledged to be outside the scope of revisions), the authors should provide information on the percentage of the library that is profiled in the RNA for each library.

      We have provided RNA profiles of all abundance libraries in Table S4. The overall fraction of fragments represented in the RNA pools was lower than that observed in other published MPRAs. This difference is expected given that most MPRA studies preselect fragments based on chromatin accessibility, transcription factor binding, sequence conservation, or bioinformatically predicted CRMs, thereby enriching for regulatory elements with high activity potential. Our locus-specific MPRA libraries, by contrast, include all fragments across the targeted genomic region, many of which are likely to be inactive in the tested context. Consequently, only a smaller proportion of fragments show measurable RNA expression.

      (e) Fragment sizes

      Please provide a density plot or something similar showcasing the size distribution of the libraries generated. Is there any correlation between sequence activity and the size of fragments?

      We have generated size distribution plots and correlations between fragment size and activity of all libraries and have included them in Supplemental Document A.

      (2) Questions about the statistical validity of results:

      (a) What threshold is utilized for calling a sequence as active? This is important as NR3 does not seem to be an element that has significant activity.

      See comments about peak calling in prior responses.

      (b) A Fisher's exact test using cells from single-cell RNA-sequencing as replicate samples is inappropriate as the cells are i) not from replicate experiments and ii) potentially in different cell states. The proportions of cells across replicate scRNA-seq datasets would be more appropriate.

      We thank the reviewer for raising this important point. While we agree that individual cells do not substitute for biological replicates, we believe Fisher’s exact test remains appropriate for testing whether gene expression is associated with Olig2 expression within a single scRNA-seq dataset. The test assesses co-occurrence at the level of individual cells, which is valid under the assumption that each cell represents an independent sampling of transcriptional states, even when it is possible that cells are in different states. We use this method as an exploratory tool to identify candidate genes associated with Olig2 expression in this dataset, and in the future, this could also be further validated by comparing the proportions of cells across replicate datasets, as the reviewer mentions.

      (3) Discussion of the reporter/Olig2/Ngn2 RNA/protein disconnect needs to be expanded. Some simpler explanations for the presence of GFP in Olig2- and Ngn2- cells, as well as the presence of Olig2 or Ngn2 in GFP- cells, is that (i) these putative CRMs are being introduced to cells in plasmids, taking them out of their native genomic context where they may be inaccessible or repressed and allowing them to drive reporter expression even if their candidate target gene is not endogenously expressed, (ii) these putative CRMs may regulate genes besides just Olig2 or Ngn2, and (iii) Olig2 and Ngn2 are regulated by far more regulatory elements than the 3 or 4 being tested in each reporter assay, so their expression likely does not rely solely on the activity of the few putative CRMs tested.

      We have added these points in an expanded discussion in the text.

      (4) Problems with figures: Low resolution of many IGV genome tracks, pink 'co-expression' dots are completely indiscernible. Numbers should be listed with the pie charts. BFP expression should be shown since this is being quantified, especially since electroporation efficiency can change across age and/or tissue samples.

      We have reconfigured the IGV tracks so that they are higher resolution and have included supplemental tables for the numbers pertaining to the pie charts. For electroporation controls (BFP and RFP), BFP expression is shown in Figs 5S, 6, and 10S and the RFP electroporation control is shown in Fig. 11. Though BFP is sometimes used as a qualifier in the denominator of some of the quantification, displaying its expression, particularly in combination with three other signals that are already included in most images, provides limited utility.

      (5) More information is required to understand the utility of the d-MPRA. Detailed quantification of the number of mutations/fragments needs to be ascertained. When multiple mutations are present, how are the authors controlling for which mutation is affecting activity? What is the coverage of the loci of interest for mutational burden (ie, is every base pair mutated in at least one fragment?). For mutations that increase the activity of the element, are there specific sequence features that increase activity (new motifs generated)?

      The d-MPRA platform is a high-throughput assay that seeks to identity putative sub-regions within CRMs nominated by the LS-MPRA, or any other assay. It relies on deep mutational coverage to determine positive and negative regulatory sub-regions of the CRMs. While many reads have multiple mutations, they are broadly co-occurring across the entire fragment (see Supplemental Document A) so as not to create a false linkage between the sites. Every individual site is mutated many times with roughly even coverage across each fragment (see Supplemental Document A), thus allowing us to assess the requirement of each base in contributing to a putative CRM’s activity. Comparing d-MPRA plots using bulk fragments or fragments with singleton mutations (Supplemental Document A) yielded almost identical plots for two libraries, and a similar analysis of the third library. Any differences between analysis of fragments with one or more mutations is likely a result of either sequencing depth or the requirement of multiple bases for binding or CRM activation. Follow-up experiments investigating intra-CRM interactions would elucidate such variability. Whether new motifs are generated for any specific substitution is an interesting question, which could be followed up for a CRM of interest. The d-MPRA data that we provide would provide the starting point for such follow-up experiments.

      (6) Transcription factors as regulators of CRM-activity.

      It is appreciated that the authors validated the binding of transcription factors to NR2. However, this correlative analysis should be further tested in follow-up experiments to highlight novel biology using systems already in place. Potential experiments that could be performed include the following (reagents in hand, or performed in a manner similar to experiments performed by the lab in previous publications):

      (a) over-expression of TF using LS-MPRA library.

      (b) over-expression of TF using d-MPRA library, showing that mutations in the putative TF binding site disrupt activity compared to non-mutated sequences.

      (c) performing TF over-expression using target CRMs, including sequences where the TF binding site is mutated (similar to a small MPRA).

      (d) the quantification of target gene expression when i) TF is over-expressed, ii) CRM is activated using CRISPRa, or iii) CRM is inhibited using CRISPRi.

      These are all valid follow-up experiments. Please see prior responses we have provided regarding further validation.

      Minor points

      (1) Please acknowledge that some distal regulatory sequences may be contained outside of the BAC regions. Also, the authors should emphasize the point that the assay is NOT cell-type-specific or specific to regulatory sequences for the gene of interest, but ALL regulatory sequences contained within the locus. The discussion of this with respect to Ift122 and Rpl32 is somewhat confusing.

      We have added a sentence in the Discussion addressing possible CRMs outside the BAC coverage. We believe it is implicitly understood that the assay only screens regulatory activity in the BAC, and believe we have addressed this in the manuscript.

      If one wishes to use a candidate CRM to drive gene expression in a targeted cell type, one needs to establish specificity. In particular, specificity needs to be established in the context of the vector that is being used. Non-integrated vs integrated vectors, different types of viral vectors with their own confounding regulatory sequences, different types of plasmids and methods of delivery, and copy number can all affect specificity. We provided a double in situ hybridization method for the examination of specificity for some of the novel candidate CRMs. It was quite difficult in the case of Olig2 and Ngn2 as their RNAs and proteins are unstable. We would need to provide further evidence should we wish to use these candidate CRMs for directing expression specifically in Olig2- or Ngn2-expressing cells. We suggest that an investigator can choose the vector and method for establishing specificity depending upon the goals of the application.

      (2) I am curious as to why low-resolution, pseudo-bulked single-nucleus ATAC was utilized instead of more comprehensive retina ATAC samples at similar time-points (for example, as available in Al Diri et al., 2017 (E14, E17, P0, P3, P7, P10) samples are all available.

      The use of pseudo-bulked single-nucleus ATAC-seq data provided a convenient and consistent comparison to our LS-MPRA results. We agree that incorporating higher-resolution datasets such as those from Al Diri et al. would be valuable for future analyses aimed at linking CRM activity with broader chromatin accessibility dynamics.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      The Reviewer structured their review such that their first two recommendations specifically concerned the two major weaknesses they viewed in the initial submission. For clarity and concision, we have copied their recommendations to be placed immediately following their corresponding points on weaknesses.

      Strengths:

      Studying prediction error from the lens of network connectivity provides new insights into predictive coding frameworks. The combination of various independent datasets to tackle the question adds strength, including two well-powered fMRI task datasets, resting-state fMRI interpreted in relation to behavioral measures, as well as EEG-fMRI.

      Weaknesses:

      Major:

      (R1.1) Lack of multiple comparisons correction for edge-wise contrast:

      The analysis of connectivity differences across three levels of prediction error was conducted separately for approximately 22,000 edges (derived from 210 regions), yet no correction for multiple comparisons appears to have been applied. Then, modularity was applied to the top 5% of these edges. I do not believe that this approach is viable without correction. It does not help that a completely separate approach using SVMs was FDR-corrected for 210 regions.

      [Later recommendation] Regarding the first major point: To address the issue of multiple comparisons in the edge-wise connectivity analysis, I recommend using the Network-Based Statistic (NBS; Zalesky et al., 2010). NBS is well-suited for identifying clusters (analogous to modules) of edges that show statistically significant differences across the three prediction error levels, while appropriately correcting for multiple comparisons.

      Thank you for bringing this up. We acknowledge that our modularity analysis does not evaluate statistical significance. Originally, the modularity analysis was meant to provide a connectome-wide summary of the connectivity effects, whereas the classification-based analysis was meant to address the need for statistical significance testing. However, as the reviewer points out, it would be better if significance were tested in a manner more analogous to the reported modules. As they suggest, we updated the Supplemental Materials (SM) to include the results of Network-Based Statistic analysis (SM p. 1-2):

      “(2.1) Network-Based Statistic

      Here, we evaluate whether PE significantly impacts connectivity at the network level using the Network-Based Statistic (NBS) approach.[1] NBS relied on the same regression data generated for the main-text analysis, whereby a regression is performed examining the effect of PE (Low = –1, Medium = 0, High = +1) on connectivity for each edge. This was done across the connectome, and for each edge, a z-score was computed. For NBS, we thresholded edges to |Z| > 3.0, which yielded one large network cluster, shown in Figure S3. The size of the cluster – i.e., number of edges – was significant (p < .05) per a permutation-test using 1,000 random shuffles of the condition data for each participant, as is standard.[1] These results demonstrate that the networklevel effects of PE on connectivity are significant. The main-text modularity analysis converts this large cluster into four modules, which are more interpretable and open the door to further analyses”.

      We updated the Results to mention these findings before describing the modularity analysis (p. 8-9):

      “After demonstrating that PE significantly influences brain-wide connectivity using Network-Based Statistic analysis (Supplemental Materials 2.1), we conducted a modularity analysis to study how specific groups of edges are all sensitive to high/low-PE information.”

      (R1.2) Lack of spatial information in EEG:

      The EEG data were not source-localized, and no connectivity analysis was performed. Instead, power fluctuations were averaged across a predefined set of electrodes based on a single prior study (reference 27), as well as across a broader set of electrodes. While the study correlates these EEG power fluctuations with fMRI network connectivity over time, such temporal correlations do not establish that the EEG oscillations originate from the corresponding network regions. For instance, the observed fronto-central theta power increases could plausibly originate from the dorsal anterior cingulate cortex (dACC), as consistently reported in the literature, rather than from a distributed network. The spatially agnostic nature of the EEG-fMRI correlation approach used here does not support interpretations tied to specific dorsal-ventral or anterior-posterior networks. Nonetheless, such interpretations are made throughout the manuscript, which overextends the conclusions that can be drawn from the data.

      [Later recommendation] Regarding the second major point: I suggest either adopting a source-localized EEG approach to assess electrophysiological connectivity or revising all related sections to avoid implying spatial specificity or direct correspondence with fMRI-derived networks. The current approach, which relies on electrode-level power fluctuations, does not support claims about the spatial origin of EEG signals or their alignment with specific connectivity networks.

      We thank the reviewer for this important point, which allows us to clarify the specific and distinct contributions of each imaging modality in our study. Our primary goal for Study 3 was to leverage the high temporal resolution of EEG to identify the characteristic frequency at which the fMRI-defined global connectivity states fluctuate. The study was not designed to infer the spatial origin of these EEG signals, a task for which fMRI is better suited and which we addressed in Studies 1 and 2.

      As the reviewer points out, fronto-central theta is generally associated with the dACC. We agree with this point entirely. We suspect that there is some process linking dACC activation to the identified network fluctuations – some type of relationship that does not manifest in our dynamic functional connectivity analyses – although this is only a hypothesis and one that is beyond the present scope.

      We updated the Discussion to mention these points and acknowledge the ambiguity regarding the correlation between network fluctuation amplitude (fMRI) and Delta/Theta power (EEG) (p. 24):

      “We specifically interpret the fMRI-EEG correlation as reflecting fluctuation speed because we correlated EEG oscillatory power with the fluctuation amplitude computed from fMRI data. Simply correlating EEG power with the average connectivity or the signed difference between posterior-anterior and ventral-dorsal connectivity yields null results (Supplemental Materials 6), suggesting that this is a very particular association, and viewing it as capturing fluctuation amplitude provides a parsimonious explanation. Yet, this correlation may be interpreted in other ways. For example, resting-state Theta is also a signature of drowsiness,[2] which may correlate with PE processing, but perhaps should be understood as some other mechanism. Additionally, Theta is widely seen as a sign of dorsal anterior cingulate cortex activity,3 and it is unclear how to reconcile this with our claims about network fluctuations. Nonetheless, as we show with simulations (Supplemental Materials 5), a correlation between slow fMRI network fluctuations and fast EEG Delta/Theta oscillations is also consistent with a common global neural process oscillating rapidly and eliciting both measures.”

      Regarding source-localization, several papers have described known limitations of this strategy for drawing precise anatomical inferences,[4–6] and this seems unnecessary given that our fMRI analyses already provide more robust anatomical precision. We intentionally used EEG in our study for what it measures most robustly: millisecond-level temporal dynamics.

      (R1.2a)Examples of problematic language include:

      Line 134: "detection of network oscillations at fast speeds" - the current EEG approach does not measure networks.

      This is an important issue. We acknowledge that our EEG approach does not directly measure fMRI-defined networks. Our claim is inferential, designed to estimate the temporal dynamics of the large-scale fMRI patterns we identified. The correlation between our fMRI-derived fluctuation amplitude (|PA – VD|) and 3-6 Hz EEG power provides suggestive evidence that the transitions between these network states occur at this frequency, rather than being a direct measurement of network oscillations.

      To support the validity of this inference, we performed two key analyses (now in Supplemental Materials). First, a simulation study provides a proof-of-concept, confirming our method can recover the frequency of a fast underlying oscillator from slow fMRI and fast EEG data. Second, a specificity analysis shows the EEG correlation is unique to our measure of fluctuation amplitude and not to simpler measures like overall connectivity strength. These analyses demonstrate that our interpretation is more plausible than alternative explanations.

      Overall, we have revised the manuscript to be more conservative in the language employed, such as presenting alternative explanations to the interpretations put forth based on correlative/observational evidence (e.g., our modifications above described in our response to comment R1.2). In addition, we have made changes throughout the report to state the issues related to reverse inference more explicitly and to better communicate that the evidence is suggestive – please see our numerous changes described in our response to comment R3.1. For the statement that the reviewer specifically mentioned here, we revised it to be more cautious (p. 7):

      “Although such speed outpaces the temporal resolution of fMRI, correlating fluctuations in dynamic connectivity measured from fMRI data with EEG oscillations can provide an estimate of the fluctuations’ speed. This interpretation of a correlation again runs up against issues related to reverse inference but would nonetheless serve as initial suggestive evidence that spontaneous transitions between network states occur rapidly.”

      (R1.2b) Line 148: "whether fluctuations between high- and low-PE networks occur sufficiently fast" - this implies spatial localization to networks that is not supported by the EEG analysis.

      Building on our changes described in our immediately prior response, we adjusted our text here to say our analyses searched for evidence consistent with the idea that the network fluctuations occur quickly rather than searching for decisive evidence favoring this idea (p. 7-8):

      “Finally, we examined rs-fMRI-EEG data to assess whether we find parallels consistent with the high/low-PE network fluctuations occurring at fast timescales suitable for the type of cognitive operations typically targeted by PE theories.”

      (R1.2c) Line 480: "how underlying neural oscillators can produce BOLD and EEG measurements" - no evidence is provided that the same neural sources underlie both modalities.

      As described above, these claims are based on the simulation study demonstrating that this is a possibility, and we have revised the manuscript overall to be clearer that this is our interpretation while providing alternative explanations.

      Reviewer #2 (Public review):

      Strengths:

      Clearly, a lot of work and data went into this paper, including 2 task-based fMRI experiments and the resting state data for the same participants, as well as a third EEG-fMRI dataset. Overall, well written with a couple of exceptions on clarity, as per below, and the methodology appears overall sound, with a couple of exceptions listed below that require further justification. It does a good job of acknowledging its own weakness.

      Weaknesses:

      (R2.1) The paper does a good job of acknowledging its greatest weakness, the fact that it relies heavily on reverse inference, but cannot quite resolve it. As the authors put it, "finding the same networks during a prediction error task and during rest does not mean that the networks' engagement during rest reflects prediction error processing". Again, the authors acknowledge the speculative nature of their claims in the discussion, but given that this is the key claim and essence of the paper, it is hard to see how the evidence is compelling to support that claim.

      We thank the reviewer for this comment. We agree that reverse inference is a fundamental challenge and that our central claim requires a particularly high bar of evidence. While no single analysis resolves this issue, our goal was to build a cumulative case that is compelling by converging on the same conclusion from multiple, independent lines of evidence.

      For our investigation, we initially established a task-general signature of prediction error (PE). By showing the same neural pattern represents PE in different contexts, we constrain the reverse inference, making it less likely that our findings are a task-specific artifact and more likely that they reflect the core, underlying process of PE. Building on this, our most compelling evidence comes from linking task and rest at the individual level. We didn't just find the same general network at rest; we showed that an individual’s unique anatomical pattern of PE-related connectivity during the task specifically predicts their own brain's fluctuation patterns at rest. This highly specific, person-by-person correspondence provides a direct bridge between an individual's task-evoked PE processing and their intrinsic, resting-state dynamics. Furthermore, these resting-state fluctuations correlate specifically with the 3-6 Hz theta rhythm—a well-established neural marker for PE.

      While reverse inference remains a fundamental limitation for many studies on resting-state cognition, the aspects mentioned above, we believe, provide suggestive evidence, favoring our PE interpretation. Nonetheless, we have made changes throughout the manuscript to be more conservative in the language we use to describe our results, to make it clear what claims are based on correlative/observational evidence, and to put forth alternative explanations for the identified effects. Please find our numerous changes detailed in our response to comment R3.1.

      (R2.2) Given how uncontrolled cognition is during "resting-state" experiments, the parallel made with prediction errors elicited during a task designed for that effect is a little difficult to make. How often are people really surprised when their brains are "at rest", likely replaying a previously experienced event or planning future actions under their control? It seems to be more likely a very low prediction error scenario, if at all surprising.

      We (and some others) take a broad interpretation of PE and believe it is often more intuitive to think about PE minimization in terms of uncertainty rather than “surprise”; the word “surprise” usually implies a sudden emotive reaction from the violation of expectations, which is not useful here.

      When planning future actions, each step of the plan is spurred by the uncertainty of what is the appropriate action given the scenario set up by prior steps. Each planned step erases some of that uncertainty. For example, you may be mentally simulating a conversation, what you will say, and what another person will say. Each step of this creates uncertainty of “what is the appropriate response?” Each reasoning step addresses contingencies. While planning, you may also uncover more obvious forms of uncertainty, sparking memory retrieval to finish it. A resting-state participant may think to cook a frozen pizza when they arrive home, but be uncertain about whether they have any frozen pizzas left, prompting episodic memory retrieval to address this uncertainty. We argue that every planning step or memory retrieval can be productively understood as being sparked by uncertainty/surprise (PE), and the subsequent cognitive response minimizes this uncertainty.

      We updated the Introduction to include a paragraph near the start providing this explanation (p. 3-4):

      “PE minimization may broadly coordinate brain functions of all sorts, including abstract cognitive functions. This includes the types of cognitive processes at play even in the absence of stimuli (e.g., while daydreaming). While it may seem counterintuitive to associate this type of cognition with PE – a concept often tied to external surprises – it has been proposed that the brain's internal generative model is continuously active.[12–14] Spontaneous thought, such as planning a future event or replaying a memory, is not a passive, low-PE process. Rather, it can be seen as a dynamic cycle of generating and resolving internal uncertainty. While daydreaming, you may be reminded of a past conversation, where you wish you had said something different. This situation contains uncertainty about what would have been the best thing to say. Wondering about what you wish you said can be viewed as resolving this uncertainty, in principle, forming a plan if the same situation ever arises again in the future. Each iteration of the simulated conversation repeatedly sparks and then resolves this type of uncertainty.”

      (R2.3)The quantitative comparison between networks under task and rest was done on a small subset of the ROIs rather than on the full network - why? Noting how small the correlation between task and rest is (r=0.021) and that's only for part of the networks, the evidence is a little tenuous. Running the analysis for the full networks could strengthen the argument.

      We thank the reviewer for this opportunity to clarify our method. A single correlation between the full, aggregated networks would be conceptually misaligned with what we aimed to assess. To test for a personspecific anatomical correspondence, it is necessary to examine the link between task and rest at a granular level. We therefore asked whether the specific parts of an individual's network most responsive to PE during the task are the same parts that show the strongest fluctuations at rest. Our analysis, performed iteratively across all 3,432 possible ROI subsets, was designed specifically to answer this question, which would be obscured by an aggregated network measure.

      We appreciate the reviewer's concern about the modest effect size (r = .021). However, this must be contextualized, as the short task scan has very low reliability (.08), which imposes a severe statistical ceiling on any possible task-rest correlation. Finding a highly significant effect (p < .001) in the face of such noisy data, therefore, provides robust evidence for a genuine task-rest correspondence.

      We updated the Discussion to discuss this point (p. 22-23):

      “A key finding supporting our interpretation is the significant link between individual differences in task-evoked PE responses and resting-state fluctuations. One might initially view the effect size of this correspondence (r = .021) as modest. However, this interpretation must be contextualized by the considerable measurement noise inherent in short task-fMRI scans; the split-half reliability of the task contrast was only .08. This low reliability imposes a severe statistical ceiling on any possible task-rest correlation. Therefore, detecting a highly significant (p < .001) relationship despite this constraint provides robust evidence for a genuine link. Furthermore, our analytical approach, which iteratively examined thousands of ROI subsets rather than one aggregated network, was intentionally granular. The goal was not simply to correlate two global measures, but to test for a personspecific anatomical correspondence – that is, whether the specific parts of an individual's network most sensitive to PE during the task are the same parts that fluctuate most strongly at rest. An aggregate analysis would obscure this critical spatial specificity. Taken together, this granular analysis provides compelling evidence for an anatomically consistent fingerprint of PE processing that bridges task-evoked activity and spontaneous restingstate dynamics, strengthening our central claim.”

      (R2.4) Looking at the results in Figure 2C, the four-quadrant description of the networks labelled for low and high PE appears a little simplistic. The authors state that this four-quadrant description omits some ROIs as motivated by prior knowledge. This would benefit from a more comprehensive justification.Which ROIs are excluded, and what is the evidence for exclusion?

      Our four-quadrant model is a principled simplification designed to distill the dominant, large-scale connectivity patterns from the complex modularity results. This approach focuses on coherent, well-documented anatomical streams while setting aside a few anatomically distant and disjoint ROIs that were less central to the main modules. This heuristic additionally unlocks more robust and novel analyses.

      The two low-PE posterior-anterior (PA) pathways are grounded in canonical processing streams. (i) The OCATL connection mirrors the ventral visual stream (the “what” pathway), which is fundamental for object recognition and is upregulated during the smooth processing of expected stimuli. (ii) The IPL-LPFC connection represents a core axis of the dorsal attention stream and the Fronto-Parietal Control Network (FPCN), reflecting the maintenance of top-down cognitive control when information is predictable; the IPL-LPFC module excludes ROIs in the middle temporal gyrus, which are often associated with the FPCN but are not covered here.

      In contrast, the two high-PE ventral-dorsal (VD) pathways reflect processes for resolving surprise and conflict. (i) The OC-IPL connection is a classic signature of attentional reorienting, where unexpected sensory input (high PE) triggers a necessary shift in attention; the OC-IPL module excludes some ROIs that are anterior to the occipital lobe and enter the fusiform gyrus and inferior temporal lobe. (ii) The ATL-LPFC connection aligns with mechanisms for semantic re-evaluation, engaging prefrontal control regions to update a mental model in the face of incongruent information.

      Beyond its functional/anatomical grounding, this simplification provides powerful methodological and statistical advantages. It establishes a symmetrical framework that makes our dynamic connectivity analyses tractable, such as our “cube” analysis of state transitions, which required overlapping modules. Critically, this model also offers a statistical safeguard. By ensuring each quadrant contributes to both low- and high-PE connectivity patterns, we eliminate confounds like region-specific signal variance or global connectivity. This design choice isolates the phenomenon to the pattern of connectivity itself (posterior-anterior vs. ventral-dorsal), making our interpretation more robust.

      We updated the end of the Study 1A results (p. 10-11):

      “Some ROIs appear in Figure 2C but are excluded from the four targeted quadrants (Figures 2C & 2D) – e.g., posterior inferior temporal lobe and fusiform ROIs are excluded from the OC-IPL module, and middle temporal gyrus ROIs are excluded from the IPL-LPFC modules. These exclusions, in favor of a four-quadrant interpretation, are motivated by existing knowledge of prominent structural pathways among these quadrants. This interpretation is also supported by classifier-based analyses showing connectivity within each quadrant is significantly influenced by PE (Supplemental Materials 2.2), along with analyses of single-region activity showing that these areas also respond to PE independently (Supplemental Materials 3). Hence, we proceeded with further analyses of these quadrants’ connections, which summarize PE’s global brain effects.

      “This four-quadrant setup also imparts analytical benefits. First, this simplified structure may better generalize across PE tasks, and Study 1B would aim to replicate these results with a different design. Second, the four quadrants mean that each ROI contributes to both the posterior-anterior and ventral-dorsal modules, which would benefit later analyses and rules out confounds such as PE eliciting increased/decreased connectivity between an ROI and the rest of the brain. An additional, less key benefit is that this setup allows more easily evaluating whether the same phenomena arise using a different atlas (Supplemental Materials Y).”

      (R2.5) The EEG-fMRI analysis claiming 3-6Hz fluctuations for PE is hard to reconcile with the fact that fMRI captures activity that is a lot slower, while some PEs are as fast as 150 ms. The discussion acknowledges this but doesn't seem to resolve it - would benefit from a more comprehensive argument.

      We thank the reviewer for raising this important point, which allows us to clarify the logic of our multimodal analysis. Our analysis does not claim that the fMRI BOLD signal itself oscillates at 3-6 Hz. Instead, it is based on the principle that the intensity of a fast neural process can be reflected in the magnitude of the slow BOLD response. It’s akin to using a long-exposure photograph to capture a fast-moving object; while the individual movements are blurred, the intensity of the blur in the photo serves as a proxy for the intensity of the underlying motion. In our case, the magnitude of the fMRI network difference (|PA – VD|) acts as the "blur," reflecting the intensity of the rapid fluctuations between states within that time window.

      Following this logic, we correlated this slow-moving fMRI metric with the power of the fast EEG rhythms, which reflects their amplitude. To bridge the different timescales, we averaged the EEG power over each fMRI time window and convolved it with the standard hemodynamic response function (HRF) – a crucial step to align the timing of the neural and metabolic signals. The resulting significant correlation specifically in the 3-6 Hz band demonstrates that when this rhythm is stronger, the fMRI data shows a greater divergence between network states. This allows us to infer the characteristic frequency of the underlying neural fluctuations without directly measuring them at that speed with fMRI, thus reconciling the two timescales.

      Reviewer #3 (Public review):

      Bogdan et al. present an intriguing and timely investigation into the intrinsic dynamics of prediction error (PE)-related brain states. The manuscript is grounded in an intuitive and compelling theoretical idea: that the brain alternates between high and low PE states even at rest, potentially reflecting an intrinsic drive toward predictive minimization. The authors employ a creative analytic framework combining different prediction tasks and imaging modalities. They shared open code, which will be valuable for future work.

      (R3.1) Consistency in Theoretical Framing

      The title, abstract, and introduction suggest inconsistent theoretical goals of the study.

      The title suggests that the goal is to test whether there are intrinsic fluctuations in high and low PE states at rest. The abstract and introduction suggest that the goal is to test whether the brain intrinsically minimizes PE and whether this minimization recruits global brain networks. My comments here are that a) these are fundamentally different claims, and b) both are challenging to falsify. For one, task-like recurrence of PE states during resting might reflect the wiring and geometry of the functional organization of the brain emerging from neurobiological constraints or developmental processes (e.g., experience), but showing that mirroring exists because of the need to minimize PE requires establishing a robust relationship with behavior or showing a causal effect (e.g., that interrupting intrinsic PE state fluctuations affects prediction).

      The global PE hypothesis-"PE minimization is a principle that broadly coordinates brain functions of all sorts, including abstract cognitive functions"-is more suitable for discussion rather than the main claim in the abstract, introduction, and all throughout the paper.

      Given the above, I recommend that the authors clarify and align their core theoretical goals across the title, abstract, introduction, and results. If the focus is on identifying fluctuations that resemble taskdefined PE states at rest, the language should reflect that more narrowly, and save broader claims about global PE minimization for the discussion. This hypothesis also needs to be contextualized within prior work. I'd like to see if there is similar evidence in the literature using animal models.

      Thank you for bringing up this issue. We have made changes throughout the paper to address these points. First, we have omitted reference to a “global PE hypothesis” from the Abstract and Introduction, in favor of structuring the Introduction in terms of a falsifiable question (p. 4):

      “We pursued this goal using three studies (Figure 1) that collectively targeted a specific question: Do the taskdefined connectivity signatures of high vs. low PE also recur during rest, and if so, how does the brain transition between exhibiting high/low signatures?”

      We made changes later in the Introduction to clarify that the investigation is based on correlative evidence and requires interpretations that may be debated (p. 5-7):

      “Although this does not entirely address the reverse inference dilemma and can only produce correlative evidence, the present research nonetheless investigates these widely speculated upon PE ideas more directly than any prior work.

      Although such speed outpaces the temporal resolution of fMRI, correlating fluctuations in dynamic connectivity measured from fMRI data with EEG oscillations can provide an estimate of the fluctuations’ speed. This interpretation of a correlation again runs up against issues related to reverse inference but would nonetheless serve as initial suggestive evidence that spontaneous transitions between network states occur rapidly.

      Second, we examined the recruitment of these networks during rs-fMRI, and although the problems related to reverse inference are impossible to overcome fully, we engage with this issue by linking rs-fMRI data directly to task-fMRI data of the same participants, which can provide suggestive evidence that the same neural mechanisms are at play in both.”

      We made changes throughout the Results now better describing the results as consistent with a hypothesis rather than demonstrating it (p. 12-19):

      “In other words, we essentially asked whether resting-state participants are sometimes in low PE states and sometimes in high PE states, which would be consistent with spontaneous PE processing in the absence of stimuli.

      These emerging states overlap strikingly with the previous task effects of PE, suggesting that rs-fMRI scans exhibit fluctuations that resemble the signatures of low- and high-PE states. 

      To be clear, this does not entirely dissuade concerns about reverse inference, which would require a type of causal manipulation that is difficult (if not impossible) to perform in a resting state scan. Nonetheless, these results provide further evidence consistent with our interpretation that the resting brain spontaneously fluctuates between high/low PE network states.

      These patterns are most consistent with a characteristic timescale near 3–6 Hz for the amplitude of the putative high/low-PE fluctuations. This is notably consistent with established links between PE and Delta/Theta and is further consistent with an interpretation in which these fluctuations relate to PE-related processing during rest.”

      We have also made targeted edits to the Discussion to present the findings in a more cautious way, more clearly state what is our interpretation, and provide alternative explanations (p. 19-26):

      “The present research conducted task-fMRI, rs-fMRI, and rs-fMRI-EEG studies to clarify whether PE elicits global connectivity effects and whether the signatures of PE processing arise spontaneously during rest. This investigation carries implications for how PE minimization may characterize abstract task-general cognitive processes. […] Although there are different ways to interpret this correlation, it is consistent with high/low PE states generally fluctuating at 3-6 Hz during rest. Below, we discuss these three studies’ findings.

      Our rs-fMRI investigation examined whether resting dynamics resemble the task-defined connectivity signatures of high vs. low PE, independent of the type of stimulus encountered. The resting-state analyses indeed found that, even at rest, participants’ brains fluctuated between strong ventral-dorsal connectivity and strong posterior-anterior connectivity, consistent with shifts between states of high and low PE. This conclusion is based on correlative/observational evidence and so may be controversial as it relies on reverse inference.

      These patterns resemble global connectivity signatures seen in resting-state participants, and correlations between fMRI and EEG data yield associations, consistent with participants fluctuating between high-PE (ventral-dorsal) and low-PE (posterior-anterior) states at 3-6 Hz. Although definitively testing these ideas is challenging, given that rs-fMRI is defined by the absence of any causal manipulations, our results provide evidence consistent with PE minimization playing a role beyond stimulus process.”

      (R3.2) Interpretation of PE-Related Fluctuations at Rest and Its Functional Relevance. It would strengthen the paper to clarify what is meant by "intrinsic" state fluctuations. Intrinsic might mean taskindependent, trait-like, or spontaneously generated. Which do the authors mean here? Is the key prediction that these fluctuations will persist in the absence of a prediction task?

      Of the three terms the reviewer mentioned, “spontaneous” and “task-independent” are the most accurate descriptors. We conceptualize these fluctuations as a continuous background process that persists across all facets of cognition, without requiring a task explicitly designed to elicit prediction error – although we, along with other predictive coding papers, would argue that all cognitive tasks are fundamentally rooted in PE mechanisms and thus anything can be seen as a “prediction task” (see our response to comment R2.2 for our changes to the Introduction that provide more intuition for this point). The proposed interactions can be seen as analogous to cortico-basal-thalamic loops, which are engaged across a vast and diverse array of cognitive processes.

      The prior submission only used the word “intrinsic” in the title. We have since revised it to “spontaneous,” which is more specific than “intrinsic,” and we believe clearer for a title than “task-independent” (p. 1): “Spontaneous fluctuations in global connectivity reflect transitions between states of high and low prediction error”

      We have also made tweaks throughout the manuscript to now use “spontaneously” throughout (it now appears 8 times in the paper).

      Regardless of the intrinsic argument, I find it challenging to interpret the results as evidence of PE fluctuations at rest. What the authors show directly is that the degree to which a subset of regions within a PE network discriminates high vs. low PE during task correlates with the magnitude of separation between high and low PE states during rest. While this is an interesting relationship, it does not establish that the resting-state brain spontaneously alternates between high and low PE states, nor that it does so in a functionally meaningful way that is related to behavior. How can we rule out brain dynamics of other processes, such as arousal, that also rise and fall with PE? I understand the authors' intention to address the reverse inference concern by testing whether "a participant's unique connectivity response to PE in the reward-processing task should match their specific patterns of resting-state fluctuation". However, I'm not fully convinced that this analysis establishes the functional role of the identified modules to PE because of the following:

      Theoretically, relating the activities of the identified modules directly to behavior would demonstrate a stronger functional role.

      (R3.2a) Across participants: Do individuals who exhibit stronger or more distinct PE-related fluctuations at rest also perform better on tasks that require prediction or inference? This could be assessed using the HCP prediction task, though if individual variability is limited (e.g., due to ceiling effects), I would suggest exploring a dataset with a prediction task that has greater behavioral variance.

      This is a good idea, but unfortunately difficult to test with our present data. The HCP gambling task used in our study was not designed to measure individual differences in prediction or inference and likely suffers from ceiling effects. Because the task outcomes are predetermined and not linked to participants' choices, there is very little meaningful behavioral variance in performance to correlate with our resting-state fluctuation measure.

      While we agree that exploring a different dataset with a more suitable task would be ideal, given the scope of the existing manuscript, this seems like it would be too much. Although these results would be informative, they would ultimately still not be a panacea for the reverse inference issues.

      Or even more broadly, does this variability in resting state PE state fluctuations predict general cognitive abilities like WM and attention (which the HCP dataset also provides)? I appreciate the inclusion of the win-loss control, and I can see the intention to address specificity. This would test whether PE state fluctuations reflect something about general cognition, but also above and beyond these attentional or WM processes that we know are fluctuating.

      This is a helpful suggestion, motivating new analyses: We measured the degree of resting-state fluctuation amplitude across participants and correlated it with the different individual differences measures provided with the HCP data (e.g., measures of WM performance). We computed each participant’s fluctuation amplitude measure as the average absolute difference between posterior-anterior and ventral-dorsal connectivity; this is the average of the TR-by-TR fMRI amplitude measure from Study 3. We correlated this individual difference score with all of the ~200 individual difference measures provided with the HCP dataset (e.g., measures of intelligence or personality). We measured the Spearman correlation between mean fluctuation amplitude with each of those ~200 measures, while correcting for multiple hypotheses using the False Discovery Rate approach.[18]

      We found a robust negative association with age, where older participants tend to display weaker fluctuations (r = -.16, p < .001). We additionally find a positive association with the age-adjusted score on the picture sequence task (r = .12, p<sub>corrected</sub> = .03) and a negative association with performance in the card sort task (r = -.12, p<sub>corrected</sub> = 046). It is unclear how to interpret these associations, without being speculative, given that fluctuation amplitude shows one positive association with performance and one negative association, albeit across entirely different tasks.  We have added these correlation results as Supplemental Materials 8 (SM p. 11):

      “(8) Behavioral differences related to fluctuation amplitude 

      To investigate whether individual differences in the magnitude of resting-state PE-state fluctuations predict general cognitive abilities, we correlated our resting-state fluctuation measure with the cognitive and demographic variables provided in the HCP dataset.

      (8.1) Methods

      For each of the 1,000 participants, we calculated a single fluctuation amplitude score. This score was defined as the average absolute difference between the time-varying posterior-anterior (PA) and ventral-dorsal (VD) connectivity during the resting-state fMRI scan (the average of the TR-by-TR measure used for Study 3). We then computed the Spearman correlation between this score and each of the approximately 200 individual difference measures provided in the HCP dataset. We corrected for multiple comparisons using the False Discovery Rate (FDR) approach.

      (8.2) Results

      The correlations revealed a robust negative association between fluctuation amplitude and age, indicating that older participants tended to display weaker fluctuations (r = -.16, p<sub>corrected</sub> < .001). After correction, two significant correlations with cognitive performance emerged: (i) a positive association with the age-adjusted score on the Picture Sequence Memory Test (r = .12, p<sub>corrected</sub> = .03), (ii) a negative association with performance on the Card Sort Task (r = -.12, p<sub>corrected</sub> = .046). As greater fluctuation amplitude is linked to better performance on one task but worse performance on another, it is unclear how to interpret these findings.”

      We updated the main text Methods to direct readers to this content (p. 39-40):

      “(4.4.3) Links between network fluctuations and behavior

      We considered whether the extent of PE-related network expression states during resting-state is behaviorally relevant. We specifically investigated whether individual differences in the overall magnitude of resting-state fluctuations could predict individual difference measures, provided with the HCP dataset. This yielded a significant association with age, whereby older participants tended to display weaker fluctuations. However, associations with cognitive measures were limited. A full description of these analyses is provided in Supplemental Materials 8.”

      (R3.2b) Within participants: Do momentary increases in PE-network expression during tasks relate to better or faster prediction? In other words, is there evidence that stronger expression of PE-related states is associated with better behavioral outcomes?

      This is a good question that probes the direct behavioral relevance of these network states on a trial-by-trial basis. We agree with the reviewer's intuition; in principle, one would expect a stronger expression of the low-PE network state on trials where a participant correctly and quickly gives a high likelihood rating to a predictable stimulus.

      Following this suggestion, we performed a new analysis in Study 1A to test this. We found that while network expression was indeed linked to participants’ likelihood ratings: higher likelihood ratings correspond to stronger posterior-anterior connectivity, whereas lower ratings correspond to stronger ventral-dorsal connectivity (Connectivity-Direction × likelihood, β [standardized] = .28, p = .02). Yet, this is not a strong test of the reviewer’s hypothesis, and different exploratory analyses of response time yield null results (p > .05). We suspect that this is due to the effect being too subtle, so we have insufficient statistical power. A comparable analysis was not feasible for Study 1B, as its design does not provide an analogous behavioral measure of trialby-trial prediction success.

      (R3.3) A priori Hypothesis for EEG Frequency Analysis.

      It's unclear how to interpret the finding that fMRI fluctuations in the defined modules correlate with frontal Delta/Theta power, specifically in the 3-6 Hz range. However, in the EEG literature, this frequency band is most commonly associated with low arousal, drowsiness, and mind wandering in resting, awake adults, not uniquely with prediction error processing. An a priori hypothesis is lacking here: what specific frequency band would we expect to track spontaneous PE signals at rest, and why? Without this, it is difficult to separate a PE-based interpretation from more general arousal or vigilance fluctuations.

      This point gets to the heart of the challenge with reverse inference in resting-state fMRI. We agree that an interpretation based on general arousal or drowsiness is a potential alternative that must be considered. However, what makes a simple arousal interpretation challenging is the highly specific nature of our fMRI-EEG association. As shown in our confirmatory analyses (Supplemental Materials 6), the correlation with 3-6 Hz power was found exclusively with the absolute difference between our two PE-related network states (|PA – VD|)—a measure of fluctuation amplitude. We found no significant relationship with the signed difference (a bias toward one state) or the sum (the overall level of connectivity). This specificity presents a puzzle for a simple drowsiness account; it seems less plausible that drowsiness would manifest specifically as the intensity of fluctuation between two complex cognitive networks, rather than as a more straightforward change in overall connectivity. While we cannot definitively rule out contributions from arousal, the specificity of our finding provides stronger evidence for a structured cognitive process, like PE, than for a general, undifferentiated state. 

      We updated the Discussion to make the argument above and also to remind readers that alternative explanations, such as ones based on drowsiness, are possible (p. 24):

      “We specifically interpret the fMRI-EEG correlation as reflecting fluctuation speed because we correlated EEG oscillatory power with the fluctuation amplitude computed from fMRI data. Simply correlating EEG power with the average connectivity or the signed difference between posterior-anterior and ventral-dorsal connectivity yields null results (Supplemental Materials 6), suggesting that this is a very particular association, and viewing it as capturing fluctuation amplitude provides a parsimonious explanation. Yet, this correlation may be interpreted in other ways. For example, resting-state Theta is also a signature of drowsiness,[2] which may correlate with PE processing, but perhaps should be understood as some other mechanism.”

      (R3.4) Significance Assessment

      The significance of the correlation above and all other correlation analyses should be assessed through a permutation test rather than a single parametric t-test against zero. There are a few reasons: a) EEG and fMRI time series are autocorrelated, violating the independence assumption of parametric tests;

      Standard t-tests can underestimate the true null distribution's variance, because EEG-fMRI correlations often involve shared slow drifts or noise sources, which can yield spurious correlations and inflating false positives unless tested against an appropriate null.

      Building a null distribution that preserves the slow drifts, for example, would help us understand how likely it is for the two time series to be correlated when the slow drifts are still present, and how much better the current correlation is, compared to this more conservative null. You can perform this by phase randomizing one of the two time courses N times (e.g., N=1000), which maintains the autocorrelation structure while breaking any true co-occurrence in patterns between the two time series, and compute a non-parametric p-value. I suggest using this approach in all correlation analyses between two time series.

      This is an important statistical point to clarify, and the suggested analysis is valuable. The reviewer is correct that the raw fMRI and EEG time series are autocorrelated. However, because our statistical approach is a twolevel analysis, we reasoned that non-independence at the correlation-level would not invalidate the higher-level t-test. The t-test’s assumption of independence applies to the individual participants' coefficients, which are independent across participants. Thus, we believe that our initial approach is broadly appropriate, and its simplicity allows it to be easily communicated.

      Nonetheless, the permutation-testing procedure that the Reviewer describes seems like an important analysis to test, given that permutation-testing is the gold standard for evaluating statistical significance, and it could guarantee that our above logic is correct. We thus computed the analysis as the reviewer described. For each participant, we phase-randomized the fMRI fluctuation amplitude time series. Specifically, we randomized the Fourier phases of the |PA–VD| series (within run), while retaining the original amplitude spectrum; inverse transforms yielded real surrogates with the same power spectrum. This was done for each participant once per permutation. Each participant’s phase-randomized data was submitted to the analysis of each oscillatory power band as originally, generating one mean correlation for each band. This was done 1,000 times.

      Across the five bands, we find that the grand mean correlation is near zero (M<sub>r</sub> = .0006) and the 97.5<sup>th</sup> percentile critical value of the null distribution is r = ~.025; this 97.5<sup>th</sup> percentile corresponds to the upper end of a 95% confidence interval for a band’s correlation; the threshold minimally differs across bands (.024 < rs < .026). Our original correlation coefficients for Delta (M<sub>r</sub> = .042) and Theta (M<sub>r</sub> = .041), which our conclusions focused on, remained significant (p ≤ .002); we can perform family-wise error-rate correction by taking the highest correlation across any band for a given permutation, and the Delta and Theta effects remain significant (p<sub>FWE</sub>corrected ≤ .003); previously Reviewer comment R1.4c requested that we employ family-wise error correction.

      These correlations were previously reported in Table 1, and we updated the caption to note what effects remain significant when evaluated using permutation-testing and with family-wise error correction (p. 19):

      “The effects for Delta, Theta, Beta, and Gamma remain significant if significance testing is instead performed using permutation-testing and with family-wise error rate correction (p<sub>corrected</sub> < .05).”

      We updated the Methods to describe the permutation-testing analysis (p. 43):

      “To confirm the significance of our fMRI-EEG correlations with a non-parametric approach, we performed a group-level permutation-test. For each of 1,000 permutations, we phase-randomized the fMRI fluctuation amplitude time series. Specifically, we randomized the Fourier phases of the |PA–VD| series (within run), while retaining the original amplitude spectrum; inverse transforms yielded real surrogates with the same power spectrum. This procedure breaks the true temporal relationship between the fMRI and EEG data while preserving its structure. We then re-computed the mean Spearman correlation for each frequency band using this phase-randomized data. We evaluated significance using a family-wise error correction approach that accounts for us analyzing five oscillatory power bands. We thus create a null distribution composed of the maximum correlation value observed across all frequency bands from each permutation. Our observed correlations were then tested for significance against this distribution of maximums.”

      (R3.5) Analysis choices

      If I'm understanding correctly, the algorithm used to identify modules does so by assigning nodes to communities, but it does not itself restrict what edges can be formed from these modules. This makes me wonder whether the decision to focus only on connections between adjacent modules, rather than considering the full connectivity, was an analytic choice by the authors. If so, could you clarify the rationale? In particular, what justifies assuming that the gradient of PE states should be captured by edges formed only between nearby modules (as shown in Figure 2E and Figure 4), rather than by the full connectivity matrix? If this restriction is instead a by-product of the algorithm, please explain why this outcome is appropriate for detecting a global signature of PE states in both task and rest.

      We discuss this matter in our response to comment R2.(4).

      When assessing the correspondence across task-fMRI and rs-fMRI in section 2.2.2, why was the pattern during task calculated from selecting a pair of bilateral ROIs (resulting in a group of eight ROIs), and the resting state pattern calculated from posterior-anterior/ventral-dorsal fluctuation modules? Doesn't it make more sense to align the two measures? For example, calculating task effects on these same modules during task and rest?

      We thank the reviewer for this question, as it highlights a point in our methods that we could have explained more clearly. The reviewer is correct that the two measures must be aligned, and we can confirm that they were indeed perfectly matched.

      For the analysis in Section 2.2.2, both the task and resting-state measures were calculated on the exact same anatomical substrate for each comparison. The analysis iteratively selected a symmetrical subset of eight ROIs from our larger four quadrants. For each of these 3,432 iterations, we computed the task-fMRI PE effect (the Connectivity Direction × PE interaction) and the resting-state fluctuation amplitude (E[|PA – VD|]) using the identical set of eight ROIs. The goal of this analysis was precisely to test if the fine-grained anatomical pattern of these effects correlated within an individual across the task and rest states. We will revise the text in Section 2.2.2 to make this direct alignment of the two measures more explicit.

      Recommendations for authors:

      Reviewer #1 (Recommendations for authors):

      (R1.3) Several prior studies have described co-activation or connectivity "templates" that spontaneously alternate during rest and task states, and are linked to behavioral variability. While they are interpreted differently in terms of cognitive function (e.g., in terms of sustained attention: Monica Rosenberg; alertness: Catie Chang), the relationship between these previously reported templates and those identified in the current study warrants discussion. Are the current templates spatially compatible with prior findings while offering new functional interpretations beyond those already proposed in the literature? Or do they represent spatially novel patterns?

      Thank you for this suggestion. Broadly, we do not mean to propose spatially novel patterns but rather focus on how these are repurposed for PE processing. In the Discussion, we link our identified connectivity states to established networks (e.g., the FPCN). We updated this paragraph to mention that these patterns are largely not spatially novel (p. 20):

      “The connectivity patterns put forth are, for the most part, not spatially novel and instead overlap heavily with prior functional and anatomical findings.”

      Regarding the specific networks covered in the prior work by Rosenberg and Chang that the reviewer seems to be referring to, [7,8] this research has emphasized networks anchored heavily in sensorimotor, subcortical– cerebellar, and medial frontal circuits, and so mostly do not overlap with the connectivity effects we put forth.

      (R1.4) Additional points:

      (R1.4a) I do not think that the logic for taking the absolute difference of fMRI connectivity is convincing. What happens if the sign of the difference is maintained ?

      Thank you for pointing out this area that requires clarification. Our analysis targets the amplitude of the fluctuation between brain states, not the direction. We define high fluctuation amplitude as moments when the brain is strongly in either the PA state (PA > VD) or the VD state (VD > PA). The absolute difference |PA – VD| correctly quantifies this intensity, whereas a signed difference would conflate these two distinct high-amplitude moments. Our simulation study (Supplemental Materials, Section 5) provides the theoretical validation for this logic, showing how this absolute difference measure in slow fMRI data can track the amplitude of a fast underlying neural oscillator.

      When the analysis is tested in terms of the signed difference, as suggested by the Reviewer, the association between the fMRI data and EEG power is insignificant for each power band (ps<sub>uncorrected</sub> ≥ .47). We updated Supplemental Materials 6 to include these results. Previously, this section included the fluctuation amplitude (fMRI) × EEG power results while controlling for: (i) the signed difference between posterior-anterior and ventral-dorsal connectivity, (ii) the sum of posterior-anterior and ventral-dorsal connectivity, and (iii) the absolute value of the sum of posterior-anterior and ventral-dorsal connectivity. For completeness, we also now report the correlation between each EEG power band and each of those other three measures (SM, p. 9)

      “We additionally tested the relationship between each of those three measures and the five EEG oscillation bands. Across the 15 tests, there were no associations (ps<sub>uncorrected</sub>  ≥ .04); one uncorrected p-value was at p = .044, although this was expected given that there were 15 tests. Thus, the association between EEG oscillations and the fMRI measure is specific to the absolute difference (i.e., amplitude) measure.”

      (R1.4b) Reasoning of focus on frontal and theta band is weak, and described as "typical" (line 359) based on a single study.

      Sorry about this. There is a rich literature on the link between frontal theta and prediction error,[3,9–11] and we updated the Introduction to include more references to this work (p. 18): “The analysis was first done using power averaged across frontal electrodes, as these are the typical focus of PE research on oscillations.[3,9–11]”

      We have also updated the Methods to cite more studies that motivate our electrode choice (p. 41): “The analyses first targeted five midline frontal electrodes (F3, F1, Fz, F2, F4; BioSemi64 layout), given that this frontal row is typically the focus of executive-function PE research on oscillations.[9–11]”

      (R1.4c) No correction appears to have been applied for the association between EEG power and fMRI connectivity. Given that 100 frequency bins were collapsed into 5 canonical bands, a correction for 5 comparisons seems appropriate. Notably, the strongest effects in the delta and theta bands (particularly at fronto-central electrodes) may still survive correction, but this should be explicitly tested and reported.

      Thanks for this suggestion. We updated the Table 1 caption to mention what results survive family-wise error rate correction – as the reviewer suggests, the Delta/Theta effects would survive Bonferroni correction for five tests, although per a later comment suggesting that we evaluate statistical significance with a permutationtesting approach (comment R3.4), we instead report family-wise error correction based on that. The revised caption is as follows (p. 19):

      “The effects for Delta, Theta, Beta, and Gamma remain significant if significance testing is instead performed using permutation-testing and with family-wise error rate correction (p<sub>corrected</sub> < .05).”

      (R1.4d) Line 135. Not sure I understand what you mean by "moods". What is the overall point here?

      The overall argument is that the fluctuations occur rapidly rather than slowly. By slow “moods” we refer to how a participant could enter a high anxiety state of >10 seconds, linked to high PE fluctuations, and then shift into a low anxiety state, linked to low PE fluctuations. We argue that this is not occurring. Regardless, we recognize that referring to lengths of time as short as 10 seconds or so is not a typical use of the word “mood” and is potentially ambiguous, so we have omitted this statement, which was originally on page 6: “Identifying subsecond fluctuations would broaden the relevance of the present results, as they rule out that the PE states derive from various moods.”

      (R1.4e) Line 100. "Few prior PE studies have targeted PE, contrasting the hundreds that have targeted BOLD". I don't understand this sentence. It's presumably about connectivity vs activity?

      Yes, sorry about this typo. The reviewer is correct, and that sentence was meant to mention connectivity. We corrected (p. 5): “Few prior PE studies have targeted connectivity, contrasting the hundreds that have targeted BOLD.”

      (R1.4f) Line 373: "0-0.5Hz" in the caption is probably "0-50Hz".

      Yes, this was another typo, thank you. We have corrected it (p. 19): “… every 0.5 Hz interval from 0-50 Hz.”

      Reviewer #2 (Recommendations for authors):

      (R2.6) (Page 3) When referring to the "limited" hypothesis of local PE, please clarify in what sense is it limited. That statement is unclear.

      Thank you for pointing out this text, which we now see is ambiguous. We originally use "limited" to refer to the hypothesis's constrained scope – namely, that PE is relevant to various low-level operations (e.g., sensory processing or rewards) but the minimization of PE does not guide more abstract cognitive processes. We edited this part of the Introduction to be clearer (p. 3)

      “It is generally agreed that the brain uses PE mechanisms at neuronal or regional levels,[15,16] and this idea has been useful in various low-level functional domains, including early vision [15] and dopaminergic reward processing.[17] Some theorists have further argued that PE propagates through perceptual pathways and can elicit downstream cognitive processes to minimize PE.”

      (R2.7) (Page 5) "Few prior PE have targeted PE"... this statement appears contradictory. Please clarify.

      Sorry about this typo, which we have corrected (p. 5):

      “Few prior PE studies have targeted connectivity, contrasting the hundreds that have targeted BOLD.”

      (R2.8) What happened to the data of the medium PE condition in Study 1A?

      The medium PE condition data were not excluded. We modeled the effect of prediction error on connectivity using a linear regression across the three conditions, coding them as a continuous variable (Low = -1, Medium = 0, High = +1). This approach allowed us to identify brain connections that showed a linear increase or decrease in strength as a function of increasing PE. This linear contrast is a more specific and powerful way to isolate PErelated effects than a High vs. Low contrast. We updated the Results slightly to make this clearer (p. 8-9):

      “In the fMRI data, we compared the three PE conditions’ beta-series functional connectivity, aiming to identify network-level signatures of PE processing, from low to high. […] For the modularity analysis, we first defined a connectome matrix of beta values, wherein each edge’s value was the slope of a regression predicting that edge’s strength from PE (coded as Low = -1, Medium = 0, High = +1; Figure 2A).”

      (R2.9) (Page 15) The point about how the dots in 6H follow those in 6J better than those in 6I is a little subjective - can the authors provide an objective measure?

      Thank you for pointing out this issue. The visual comparison using Figure 6 was not meant as a formal analysis but rather to provide intuition. However, as the reviewer describes, this is difficult to convey. Our formal analysis is provided in Supplemental Materials 5, where we report correlation coefficients between a very large number of simulated fMRI data points and EEG data points corresponding to different frequencies. We updated this part of the Results to convey this (p. 16-17):

      “Notice how the dots in Figure 6H follow the dots in Figure 6J (3 Hz) better than the dots in Figure 6I (0.5 Hz) or Figure 6K (10 Hz); this visual comparison is intended for illustrative purposes only, and quantitative analyses are provided in Supplemental Materials 5.”

      References

      (1) Zalesky, A., Fornito, A. & Bullmore, E. T. Network-based statistic: identifying differences in brain networks. Neuroimage 53, 1197–1207 (2010)

      (2) Strijkstra, A. M., Beersma, D. G., Drayer, B., Halbesma, N. & Daan, S. Subjective sleepiness correlates negatively with global alpha (8–12 Hz) and positively with central frontal theta (4–8 Hz) frequencies in the human resting awake electroencephalogram. Neuroscience letters 340, 17–20 (2003).

      (3) Cavanagh, J. F. & Frank, M. J. Frontal theta as a mechanism for cognitive control. Trends in cognitive sciences 18, 414–421 (2014).

      (4) Grech, R. et al. Review on solving the inverse problem in EEG source analysis. Journal of neuroengineering and rehabilitation 5, 25 (2008)

      (5) Palva, J. M. et al. Ghost interactions in MEG/EEG source space: A note of caution on inter-areal coupling measures. Neuroimage 173, 632–643 (2018).

      (6) Koles, Z. J. Trends in EEG source localization. Electroencephalography and clinical Neurophysiology 106, 127–137 (1998).

      (7) Rosenberg, M. D. et al. A neuromarker of sustained attention from whole-brain functional connectivity. Nature neuroscience 19, 165–171 (2016).

      (8) Goodale, S. E. et al. fMRI-based detection of alertness predicts behavioral response variability. elife 10, e62376 (2021).

      (9) Cavanagh, J. F. Cortical delta activity reflects reward prediction error and related behavioral adjustments, but at different times. NeuroImage 110, 205–216 (2015)

      (10) Hoy, C. W., Steiner, S. C. & Knight, R. T. Single-trial modeling separates multiple overlapping prediction errors during reward processing in human EEG. Communications Biology 4, 910 (2021).

      (11) Neo, P. S.-H., Shadli, S. M., McNaughton, N. & Sellbom, M. Midfrontal theta reactivity to conflict and error are linked to externalizing and internalizing respectively. Personality neuroscience 7, e8 (2024).

      (12) Friston, K. J. The free-energy principle: a unified brain theory? Nature reviews neuroscience 11, 127–138 (2010)

      (13) Feldman, H. & Friston, K. J. Attention, uncertainty, and free-energy. Frontiers in human neuroscience 4, 215 (2010).

      (14) Friston, K. J. et al. Active inference and epistemic value. Cognitive neuroscience 6, 187–214 (2015).

      (15) Rao, R. P. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptive-field effects. Nature neuroscience 2, 79–87 (1999)

      (16) Walsh, K. S., McGovern, D. P., Clark, A. & O’Connell, R. G. Evaluating the neurophysiological evidence for predictive processing as a model of perception. Annals of the new York Academy of Sciences 1464, 242– 268 (2020)

      (17) Niv, Y. & Schoenbaum, G. Dialogues on prediction errors. Trends in cognitive sciences 12, 265–272 (2008).

      (18) Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological) 57, 289–300 (1995).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      Summary

      We thank the reviewer for the constructive and thoughtful evaluation of our work. We appreciate the recognition of the novelty and potential implications of our findings regarding UPR activation and proteasome activity in germ cells.

      (1) The microscopy images look saturated, for example, Figure 1a, b, etc. Is this a normal way to present fluorescent microscopy?

      The apparent saturation was not present in the original images, but likely arose from image compression during PDF generation. While the EMA granule was still apparent, in the revised submission, we will provide high-resolution TIFF files to ensure accurate representation of fluorescence intensity and will carefully optimize image display settings to avoid any saturation artifacts.

      (2) The authors should ensure that all claims regarding enrichment/lower vs. lower values have indicated statistical tests.

      We fully agree. In the revised version, we will correct any quantitative comparisons where statistical tests were not already indicated, with a clear statement of the statistical tests used, including p-values in figure legends and text.

      (a) In Figure 2f, the authors should indicate which comparison is made for this test. Is it comparing 2 vs. 6 cyst numbers?

      We acknowledge that the description was not sufficiently detailed. Indeed, the test was not between 2 vs 6 cyst numbers, but between all possible ways 8-cell cysts or the larger cysts studied could fragment randomly into two pieces, and produce by chance 6-cell cysts in 13 of 15 observed examples. We will expand the legend and main text to clarify that a binomial test was used to determine that the proportion of cysts producing 6-cell fragments differed very significantly from chance.

      Revised text:

      “A binomial test was used to assess whether the observed frequency of 6-cell cyst products differed from random cyst breakage. Production of 6-cell cysts was strongly preferred (13/15 cysts; ****p < 0.0001).”

      (b) Figures 4d and 4e do not have a statistical test indicated.

      We will include the specific statistical test used and report the corresponding p-values directly in the figure legends.

      (3) Because the system is developmentally dynamic, the major conclusions of the work are somewhat unclear. Could the authors be more explicit about these and enumerate them more clearly in the abstract?

      We will revise the abstract to better clarify the findings of this study. We will also replace the term Visham with mouse fusome to reflect its functional and structural analogy to the Drosophila and Xenopus fusomes, making the narrative more coherent and conclusive.

      (4) The references for specific prior literature are mostly missing (lines 184-195, for example).

      We appreciate this observation of a problem that occurred inadvertently when shortening an earlier version.  We will add 3–4 relevant references to appropriately support this section.

      (5) The authors should define all acronyms when they are first used in the text (UPR, EGAD, etc).

      We will ensure that all acronyms are spelled out at first mention (e.g., Unfolded Protein Response (UPR), Endosome and Golgi-Associated Degradation (EGAD)).

      (6) The jumping between topics (EMA, into microtubule fragmentation, polarization proteins, UPR/ERAD/EGAD, GCNA, ER, balbiani body, etc) makes the narrative of the paper very difficult to follow.

      We are not jumping between topics, but following a narrative relevant to the central question of whether female mouse germ cells develop using a fusome.  EMA, microtubule fragmentation, polarization proteins, ER, and balbiani body are all topics with a known connection to fusomes. This is explained in the general introduction and in relevant subsections. We appreciate this feedback that further explanations of these connections would be helpful. In the revised manuscript, use of the unified term mouse fusome will also help connect the narrative across sections.  UPR/ERAD/EGAD are processes that have been studied in repair and maintenance of somatic cells and in yeast meiosis.  We show that the major regulator XbpI is found in the fusome, and that the fusome and these rejuvenation pathway genes are expressed and maintained throughout oogenesis, rather than only during limited late stages as suggested in previous literature.

      (7) The heading title "Visham participates in organelle rejuvenation during meiosis" in line 241 is speculative and/or not supported. Drawing upon the extensive, highly rigorous Drosophila literature, it is safe to extrapolate, but the claim about regeneration is not adequately supported.

      We believe this statement is accurate given the broad scope of the term "participates." It is supported by localization of the UPR regulator XbpI to the fusome. XbpI is the ortholog of HacI a key gene mediating UPR-mediated rejuvenation during yeast meiosis.  We also showed that rejuvenation pathway genes are expressed throughout most of meiosis (not previously known) and expanded cytological evidence of stage-specific organelle rejuvenation later in meiosis, such as mitochondrial-ER docking, in regions enriched in fusome antigens. However, we recognize the current limitations of this evidence in the mouse, and want to appropriately convey this, without going to what we believe would be an unjustified extreme of saying there is no evidence.

      Reviewer #2 (Public review):

      We thank the reviewer for the comprehensive summary and for highlighting both the technical achievement and biological relevance of our study. We greatly appreciate the thoughtful suggestions that have helped us refine our presentation and terminology.

      (1) Some titles contain strong terms that do not fully match the conclusions of the corresponding sections.

      (1a) Article title “Mouse germline cysts contain a fusome-like structure that mediates oocyte development”

      We will change the statement to: “Mouse germline cysts contain a fusome that supports germline cyst polarity and rejuvenation.”

      (1b) Result title “Visham overlaps centrosomes and moves on microtubules”

      We acknowledge that “moves” implies dynamics. We will include additional supplementary images showing small vesicular components of the mouse fusome on spindle-derived microtubule tracks.

      (1c) Result title “Visham associates with Golgi genes involved in UPR beginning at the onset of cyst formation”

      We will revise this title to: “The mouse fusome associates with the UPR regulatory protein Xbp1 beginning at the onset of cyst formation” to reflect the specific UPR protein that was immunolocalized.

      (1d) Result title “Visham participates in organelle rejuvenation during meiosis”

      We will revise this to: “The mouse fusome persists during organelle rejuvenation in meiosis.”

      (2) The authors aim to demonstrate that Visham is a fusome-like structure. I would suggest simply referring to it as a "fusome-like structure" rather than introducing a new term, which may confuse readers and does not necessarily help the authors' goal of showing the conservation of this structure in Drosophila and Xenopus germ cells. Interestingly, in a preprint from the same laboratory describing a similar structure in Xenopus germ cells, the authors refer to it as a "fusome-like structure (FLS)" (Davidian and Spradling, BioRxiv, 2025).

      We appreciate the reviewer’s insightful comment. To maintain conceptual clarity and align with existing literature, we will refer to the structure as the mouse fusome throughout the manuscript, avoiding introduction of a new term.

      Reviewer #3 (Public review):

      We thank the reviewer for emphasizing the importance of our study and for providing constructive feedback that will help us clarify and strengthen our conclusions.

      (1) Line 86 - the heading for this section is "PGCs contain a Golgi-rich structure known as the EMA granule"

      We agree that the enrichment of Golgi within the EMA PGCs was not shown until the next section. We will revise this heading to:

      “PGCs contain an asymmetric EMA granule.” 

      (2) Line 105-106, how do we know if what's seen by EM corresponds to the EMA1 granule?

      We will clarify that this identification is based on co-localization with Golgi markers (GM130 and GS28) and response to Brefeldin A treatment, which will be included as supplementary data. These findings support that the mouse fusome is Golgi-derived and can therefore be visualized by EM. The Golgi regions in E13.5 cyst cells move close together and associate with ring canals as visualized by EM (Figure 1E), the same as the mouse fusomes identified by EMA.

      (3) Line 106-107-states "Visham co-stained with the Golgi protein Gm130 and the recycling endosomal protein Rab11a1". This is not convincing as there is only one example of each image, and both appear to be distorted.

      Space is at a premium in these figures, but we have no limitation on data documenting this absolutely clear co-localization. We will replace the existing images with high-resolution, noncompressed versions for the final figures to clearly illustrate the co-staining patterns for GM130 and Rab11a1.

      (4) Line 132-133---while visham formation is disrupted when microtubules are disrupted, I am not convinced that visham moves on microtubules as stated in the heading of this section.

      We will include additional supplementary data showing small mouse fusome vesicles aligned along microtubules.

      (5) Line 156 - the heading for this section states that Visham associates with polarity and microtubule genes, including pard3, but only evidence for pard3 is presented.

      We agree and will revise the heading to: “Mouse fusome associates with the polarity protein Pard3.” We are adding data showing association of small fusome vesicles on microtubules.

      (6) Lines 196-210 - it's strange to say that UPR genes depend on DAZ, as they are upregulated in the mutants. I think there are important observations here, but it's unclear what is being concluded.

      UPR genes are not upregulated in DAZ in the sense we have never documented them increasing. We show that UPR genes during this time behave like pleuripotency genes and normally decline, but in DAZ mutants their decline is slowed.  We will rephrase the paragraph to clarify that Dazl mutation partially decouples developmental processes that are normally linked, which alters UPR gene expression relative to cyst development.

      (7) Line 257-259-wave 1 and 2 follicles need to be explained in the introduction, and how these fits with the observations here clarified.

      Follicle waves are too small a focus of the current study to explain in the introduction, but we will request readers to refer to the cited relevant literature (Yin and Spradling, 2025) for further details.

      We sincerely thank all reviewers for their insightful and constructive feedback. We believe that the planned revisions—particularly the refined terminology, improved image quality, clarified statistics, and restructured abstract—will substantially strengthen the manuscript and enhance clarity for readers.

      Reviewer #1 (Recommendations for the authors):

      (1) Figure 1E: need to use some immuno-gold staining to identify the Visham. Just circling an area of cytoplasm that contains ER between germ cell pairs is not enough.

      We appreciate the reviewer’s insistence that the association between the mouse fusome and Golgi be clearly demonstrated. However, the EMA granule is a large structure discovered and defined by light microscopy, and presents no inherent challenge to documenting its Golgi association by immunofluorescence experiments, which we presented and now further strengthened as described in the next paragraph.  We believe that the suggested EM experiment would add little to the EM we already presented (Figure 1E, E')  Moreover, due to facility limitations, we are currently unable to perform immunogold staining. 

      To strengthen previous immunolocalization experiments, we have now included additional immunostaining data showing the clear colocalization of the fusome region with the Golgi markers GM130 and GS28 (Figure S1H). We have also incorporated a new experiment using the Golgi-specific inhibitor Brefeldin A (BFA) see Figure S1I.  Treatment of in vitro–cultured gonads with BFA, disrupted EMA granule formation, demonstrating that EMA granules not only associate with Golgi, but require Golgi function to to be maintained.

      Additionally, in Figure 2, we showed that the fusome overlaps with the peri-centriolar region—a characteristic locus for Golgi due to its movement on microtubules.  We showed that the dynamic behavior of the fusome during the cell cycle, parallels Golgi dispersal and reassembly, and all these facts provide further strong support for the Golgi-association of the EMA granule and fusome.

      (2) Figure 1F: is this image compressed?

      We have now substituted the image in Figure 1F with a better image and have avoided the compression of the image. 

      (3) In the figure legends, are the sample sizes individual animals or individual sections? Please ensure that all figure legends for each figure panel consistently contain the sample size.

      We have now included the number of measurements (N) in every figure legend. Each experiment was performed using samples from at least three different animals, and in most cases from more than three. This information has also been added to the Methods section under Statistics. In addition, N values are now consistently provided for each graph throughout the figures.

      (4) Figure 2b/c: seemly likely based on the snapshot of different stages of cytokinesis that the "newly formed" visham is accurate, but without live imaging, this claim of "newly formed" is putative/speculative. It is OK if it is labeled as "putative" in the figure panel.  

      The behavior of the Drosophila fusome during mitosis was deduced without live imaging (deCuevas et al. 1998). We clarified that the conversion of a single mouse germ cell with one round fusome to an interconnected pair of cells with two round fusomes of greater total volume following mitosis is the basis for deducing that new fusome formation occurs each cell cycle. However, we agree with the reviewer that the phrase "newly formed" in the original label on Figure 2c suggested a specific mechanism of fusome increase that was not intended and this phrase has been removed entirely.  

      (5) Figure 2e/e is extremely difficult to follow. In order to improve the readability of these figure panels, can individual panels with a single stain be shown? The 'gap' between YFP+ sister cells is not immediately obvious in panel e or e" with the current layout. Since this is a key aspect of the author's claim about cleavage of the cyst, it would be best to make this claim more robust by showing more convincing images. In Figure 2E, the staining pattern of EMA needs to be clarified and described more fully in the text.

      We mapped discontinuities in the microtubule connections, not the fusome or YFP.  YFP is the lineage marker indicating that the cells of a single cyst are being studied. Consequently, no gap between YFP cytoplasmic expression is expected because only in the last example (figure E”), has fragmentation already occurred (and here there is a YFP gap).  The acetylated tubulin gap proceeds fragmentation.  The mitotic spindle remnants labeled by AcTub link the cells into two groups separated by a gap, which is clearly shown in the data images and in the third column where only the relevant AcTub from the cyst itself is shown. In response to the reviewers question about the fusome, which is not directly relevant to fragmentation, we have now provided images of the separate fusome channel and corresponding measurements for all three Figure 2E-E'' cysts in the supplementary Figure S4H. We have improved the text regarding this important figure to try and make it easier to follow, and also added a new example of a 10-cell cyst also in S2H (lower panels).  We also added, movies allowing full 3D study of one of the 8 cell cysts and the new 10-cell cyst.  I also suggest that the reviewer examine how the deduced mechanism of fragmentation explains previously published but not fully understood data on cyst fragmentation going back to 1998 as described in the expanded Discussion on this topic.  

      (6) It would be best to support the proposed model in Figure 2G (4+4+4) with microscopy images of a 12-cell or 16-cell cyst? Would these 12-cell or 16-cell cysts be too large to technically recover in a section?

      Unfortunately the reviewer 's suggestion that 12- or 16-cell cysts are too large to recover and present convincingly is correct. Because our analysis depends on capturing lineage-labeled cysts specifically at telophase with acetylated-tubulin connections, the likelihood of obtaining the correct stage is very low.  In addition, the dense packing of germ cells in the mouse gonad further limits our ability to fully reconstruct all the cells in large cysts, with difficulty increasing as cyst size grows.

      However, as noted, we added a well-resolved 10-cell cyst—the largest size we could confidently analyze—in a 3D video in Supplementary Figure S2H (lower panel), which shows a 6 + 4 breakage pattern.

      (7) We did not find a reference in the text for Figure 2G.

      We have now provided reference for 2G in the text and in the discussion section. 

      (8) Line 189: ERAD is used as an acronym, but is not defined until the discussion.

      We have now provided full form of acronym at its first usage in the text.

      (9) Fig 3i/i': the increase of UPR pathway components, increasing expression during zygotene, is interesting to note, but is not commented enough in the text of the paper.

      We have discussed this issue in the discussion section with specific reference to figure 3I. Please find the detailed discussion under the heading “Germ cell rejuvenation is highly active during cyst formation.”

      (10) Please quantify DNMT3A expression levels in WT control vs Dazl KO germ cells in Figure 4a.

      We have now quantified DNMT3A expression levels in WT control vs Dazl KO germ cells and have added the data in the Figure 4A.

      (11) Please introduce the rationale behind selecting DazL KO for studying cyst formation (text in line 197). This comes out of nowhere.

      True.  We significantly expanded our discussion of Dazl and citations of previous work, including evidence that it can affect cyst structures like ring canals, in the Introduction.  

      (12) It would be best to stain WT control vs DazL KO oogonia in Figure 4a with 5mC antibodies to support their claim that DNA methylation might be affected in the mutants.

      We respectfully disagree that this additional experiment is necessary within the scope of the current study. At the developmental stage examined (E12.5), germ cells in the Dazl mutant are clearly in an arrested and hypomethylated state, as supported by previous evidence (Haston et al. 2009).This initial experiments was designed to show that in our hands Dazl mutants show this known pkuripotency delay. However, the effects of Dazl mutation on female germline cyst development as it relates to polarity or the fusome was not studied before, and that is what the paper addresses, building on previous work.

      Because our study does not focus on germ-cell epigenetic modifications but rather on the consequences of Dazl loss on germ cell cyst development, adding 5mC immunostaining would not substantially advance the main conclusions. The existing data and previous published work already provide sufficient background.

      (13) Figure 4c: a very interesting figure, it would be best to quantify developmental pseudotime (perhaps using monocle3 analysis) and compare more rigorously the developmental stage of WT control vs DazL KO.

      Developmental pseudotime, such as through Monocle3 analysis, might sometimes be valuable but involves assumptions that when possible are better addressed by direct experimental examination. Our conclusions regarding cyst developmental stage are supported by straightforward evidence rather to which computational trajectory inference would add little. Specifically, we have performed analysis of germ-cell methylation state, ring canal formation, pluripotency markers, UPR pathway activity assay (Xbp1 and Proteomic assay), Golgi-stress analysis and Pard3 which collectively document the developmental status of the WT and Dazl KO germ cells. These empirical data demonstrate the same developmental pattern reflected in Figure 4c, making the less reliable pseudotime-based computational method superfluous.

      (14) Figure 4d has two panels labeled as "d".

      We have now corrected the labelling of the figure

      (15) Color coding in 4d, d', d" is confusing; please harmonize some visual presentation here.

      We have now harmonized the visual representation of all the graph in figure 4

      (16) Fig 4e' is labeled as DazL +/- but is this really a typo?

      Thank you for pointing it out. We have now corrected the typo

      (17) Figure F': typo labeled as E3.5, which is E13.5?

      Thank you for pointing it out. We have now corrected the typo

      (18) Figure F': was DazL KO mutant but no WT control.

      The WT control was not provided to avoid the redundancy. Please refer to earlier figure 3A-B, Fig S3C and D and videos S3A and S3b to refer to WT control at every stage.

      (19) Figure G: unusual choice in punctuation marks for cartoon schematic. No key to guide the reader for color-coded structures would be helpful to have something similar to 4h.

      We have now provided the key to guide the readers in the mentioned figure 4G.

      (20) The authors use WGA and EMA as interchangeable markers (Figure 5a) without fully explaining why they have switched markers.

      Because it is germ cell specific, we used EMA as a fusome marker during the time when it is found up through E13.5.  After that point we used WGA which is still usable, but also labels somatic cells.  This rationale is explicitly described at the end of the section “Fusome is highly enriched in Golgi and vesicles”, where we state:

      “EMA staining disappears from germ cells at E14.5 (Figure 1I). However, very similar (but non–germ-cell-specific) staining continued with wheat germ agglutinin (WGA) at later stages (Figure 1G, G’; Figure S1G).”

      To ensure this is fully clear to readers, we have now added an additional statement in the start of the text section discussing the figure 5:

      “For the reasons explained previously (see text for Figure 1G), WGA was used as a fusome marker beyond stage E14.5.”

      (21) Figure 5b' is compressed.

      We have now decompressed the image

      (22) Line 267, Balbiani body is misspelled.  

      We have now corrected the spelling.

      (23) The explanation of why the authors switch focus from DazL KO to DazL +/- is not adequately described. The authors should also explain the phenotype of the DazL +/- animals or reference a paper citing the hets are sterile or subfertile.

      We have now added the explanation of why Dazl KO is used in our introduction section where we have mentioned the phenotype of Dazl homozygous and heterozygous mouse.

      (24) Is Figure 5i actually DazL +/-? It is not labeled clearly in the text, the figure legend, or the figure panel. 

      We have now labelled the figure correctly in figure and in the legend.

      (25) The paper ends abruptly at line 275 with no context or summary.

      The manuscript does not end at line 275; the apparent interruption is due to a page break occurring immediately before the beginning of the Discussion section. We hope that continuation is fully visible in the reviewer 1 (your) version of the PDF.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 93: Fig. 1B: DDX4 marks germ cells; do all the red and yellow cells in the NE inset originate from the same PGC? There are only 2 cells marked in yellow among the group of red cells. Is it a z-projection issue? Or do they come from different PGCs?

      This experiment used vasa staining to identify all germ cells, which are produced by multiple PGCs. Green labeling is a lineage marker derived from a single PGC (due to the low frequency of tamoxifen-activated labeling). Consequently, the two yellow cells observed in the NE inset of Fig. 1B represent YFP-labeled germ cells (YFP + DDX4 double-positive) that have arisen from a single, lineage-traced PGC. This approach, introduced in 2013, is described in the Methods, and represents the field's single largest technical advance that has made it possible to analyze mouse germ cell development at single cell resolution.

      To ensure clarity, we have added a brief explanatory note to the figure legend indicating that yellow cells represent the lineage-traced progeny of a single PGC, while the red staining marks all germ cells.

      (2) Line 96: Figure 1C vs 1C'. The difference between female and male Visham is not obvious, although quantification shows a clear difference. How was the quantification made? Manual or automatic thresholding? Would it be possible to show only the Visham channel?

      We thank the reviewer for pointing out this problem. We have now more clearly described in the text that the female fusome increases in some cells with close attachments to other cells (future oocytes) and decreases in distant nurse cells.  It branches due to rosette formation..  In males, the fusome remains much like the initial EMA granules present in early germ cells, with only fine and difficult to see connections.  The quantification shown in Figures 1C and 1C′ was performed manually, based on the presence of either (i) fused, branched EMA-positive fusome structures or (ii) dispersed, punctate EMA granules. This assessment was carried out across multiple E13.5 male and female gonad samples to ensure robustness.  To facilitate independent evaluation, we have already provided supplementary videos S3B1 and S3B2, which display the EMA-stained E13.5 male and female gonads in three dimensions. These videos allow the structural differences to be examined more clearly than in static images.

      In response to the reviewer’s request, we now additionally include the single-channel fusome image in Supplementary Figure S1E′. This presentation highlights the fusome signal alone and further clarifies the morphological differences underlying the quantification.

      (3) L118: Figure 2A, third row = 2-cell cyst? Please specify PCNT in the legend.

      We appreciate the reviewer’s observation. In Figure 2A (third row), the cells were not specifically labeled as a 2-cell cyst; rather, the intention was to illustrate the presence of two distinct centrosomes positioned on a fused fusome structure, a configuration we frequently observe.

      We have now updated the figure legend to explicitly define PCNT.

      (4) L169: Missing reference to S3B and video S3B1?

      We have now included the reference to S3B1 and S3B2 in the text and in the legend

      (5) L170: Please describe the graph in the Figure 3D legend.

      We have now described the Graph in the legend

      (6) L171: Would it be possible to have a close-up showing both Pard3 and Visham in a ringlike pattern related to RACGAP (RC) staining? The images are too small.

      It is difficult to capture this relationship perfectly in a two dimensional picture. The images represent the maximum close-up possible that still includes enough relevant area for the necessary conclusions. We have now provided additional three close-up images exclusively for ring-canal and Pard3 association in the supplementary Figure S3C for further clarity. However, we also note that the quality of the image permits the reader of a pdf to zoom and to visualize the images in great detail.

      (7) L181: Wrong reference, should be 3 then 3I.

      Thank you for pointing it out, we have now corrected the reference.

      (8) L199: In Figure S4B, was DNMT3 staining quantified? Red intensity differs globally between images; use the somatic red level as a reference? Note: EMA seems higher in Dazl- vs. WT?

      We have now performed quantification of DNMT3 staining, which is presented in Figure 4A. While the red intensity (DNMT3 or EMA) can appear to differ between images, this variation can result from biological differences between tissues or minor technical variability despite using consistent microscope settings. To account for this, we normalized the staining intensity using the somatic cell signal as an internal reference, ensuring that the quantification reflects genuine differences between WT and Dazl-/- samples rather than global intensity variation.

      (9) L229: Should be "proteasome."

      We have now corrected the spelling error.

      (10) L233: Quantify fragmentation of Gs28? EMA doesn't seem affected. Could you quantify both Gs28 and EMA? Images are too small.

      We thank the reviewer for this suggestion. While the current images are small, they can be examined in detail using zoom to visualize the structures clearly. As noted, EMA staining is not affected, (we agree) as cells are in arrested state. This arrested state creates stress on Golgi. The fragmentation of Gs28-labeled Golgi membranes is a classical indicator of Golgi stress, even though the fragmented membranes may remain functionally active. Our results show that Dazl deletion specifically affects Golgi in germ cells, while Golgi in neighboring somatic cells appears healthy. To quantify this effect, we have now included manual quantification of Golgi fragmentation in Figure 4F, assessing tissues for the presence of fragmented versus intact Golgi structures. This confirms that Golgi fragmentation is a germ cell–specific phenotype in Dazl– samples, while pre-formed EMA-positive fusomes remain unaffected but probably in arrested state.

      (11) L237: Figure 4F graph shows E3.5, not E13.5.

      We have now corrected the typo in the figure 

      (12) L257: Figure 5D: quantify as in 5A? overlap?

      Yes, it's an overlap and shown as two separate image with ring canal for better clarity. We have now quantified the image and have produced combined graph for fusome and pard3 in Figure 5A graph.

      (13) L261: Figure 5E-E': black arrowhead not mentioned in legend.

      We have now mentioned the black arrowhead in the legend

      (14) L262: Figure 5C: arrowhead not mentioned in legend. Figure 5F: oocyte appears separated from nurse cells compared to 5C.

      Yes, that may happen as cysts undergo fragmentation; what matters is all cells are lineage labelled and hence are members of a single cyst derived from one PGC.

      (15) L263: Figure 5G has no legend reference; nurse cells are not outlined as in 5C.

      We have now outlined the nurse cells and have added the reference to the graph in the legend.

      (16) L279: "The fusome and Visham and both..." should be replaced with "Both fusome and Visham...".

      We have now replaced the term Visham with fusome as suggested by reviewers and editor.  We updated the statement to correct the grammatical error.

      (17) L1127: Video S3B1: It is unclear what to focus on.

      We have now added the Rectangle area and arrow to highlight what to focus on

      (18) L1128: Video "S3B1" should be "S3B2."

      We have now corrected the legend

      (19) Finally: curiosity question: have the authors tried to use known markers of the Drosophila fusome in mice, such as Spectrin or other markers described in Lighthouse, Buszczak and Spradling, Dev Bio, 2008? And conversely, do EMA and WGA label the fusome in Drosophila?

      Yes, we and others used the most specific markers of the Drosophila fusome such alpha-spectrin, adducin-like Hts, tropomodulin, etc. to search for fusomes in vertebrate species. It was unsuccessful in clarifying the situation, because Hts and alpha-spectrin in Drosophila and other insects generate a protein skeleton that stabilizes the fusome and is easily stained. But this structure is simply not conserved in vertebrates. The polarity behavior of the fusome, it core developmental property, is conserved, however. The mammalian fusome still acquires and maintains cyst polarity, and goes even farther and reflects both initial cyst formation and cyst cleavage, before marking oocyte vs nurse cell development in the smaller cysts.  Expression of the inner microtubule-rich portion of the fusome, its Par proteins, and many ER-related and lysosomal fusome proteins are mostly conserved but their ability to mark the fusome alone varies with time and context (only some of the examples are shown in Figure 3I'). Nearly all of the proteins identified in Lighthouse et al. 2008 are expressed.  These proteins may be involved in rejuvenation as studied here.  We modified the first section of the Discussion to explicitly compare mouse, Xenopus and Drosophila fusomes, which was not possible before this work.  

      Reviewer #3 (Recommendations for the authors):

      The authors should either revise the conclusions or add additional evidence to support their claims. In addition, minor corrections are listed below.

      We have added additional evidence as noted in responses above, and revised some claims that were stated inaccurately.  In addition, we have attempted to clarify the evidence we do present, so that its full significance is more easily grasped by readers.    

      (1) Lines 20-21 are unclear - the cyst doesn't get sent into meiosis, each oocyte does.

      Research is showing that it's more complicated than that.  All cyst cells enter "pre-meiotic S phase", and most cell cycles are conventionally considered to start after the previous M phase-

      i.e. in G1 or S, not in the next prophase, an ancient view limited just to meiosis. Absent this old tradition from meiosis cytology, pre-meiotic S would just be called meiotic S as some workers on meiosis do.  In addition, in different species, nurse cells diverge from meiosis on different schedules, including many much later in the meiotic cycle.  Two cyst cells in Drosophila fully enter meiosis by all criteria, the oocyte and one nurse cell that only exits in late zygotene.  In Xenopus and mouse, scRNAseq shows that many cyst cells enter meiosis up to leptotene and zygotene, including nurse cells that specifically downregulate meiotic genes during this time, possibly to assist their nurse cell functions, while others remain in meiosis even longer (Davidian and Spradling, 2025; Niu and Spradling, 2022). Eventually, only the oocytes within each fragmented mouse cyst complete meiosis. 

      (2) Many places in the manuscript abbreviations are never defined or not defined the first time they are used (but the second or third time): Line 23-ER, Line 29-UPR, Line 33-PGC (not defined until line 45), Line 79-EGAD.

      We have defined full acronyms now upon their first occurrence.

      (3) Line 5 should be the pachytene substage of meiosis I.

      We have now updated the statement to “In pachytene stage of meiosis I…”

      (4) Line 59-61 - this statement needs a reference(s).

      These statements are a continuation from the references cited in the previous statements. However, for further clarity we have again cited the relevant reference here (Niu and Spradling, 2022).

      (5) Line 80 - should it be oocyte proteome quality control?

      We have now updated the statement to “Oocyte proteome quality control begins early”.

      (6) Line 87 - in this case, EMA does not stand for epithelial membrane antigen (AI will call it that, but it is not correct). I believe it originally was the abbrev for (Em)bryonic (a)ntigen, though some papers call it (e)mbryonic (m)ouse (a)ntigen. And the reference here is Hahnel and Eddy, 1986, but in the reference list is a different paper, 1987 (both refer to EMA-1).

      We have now updated the acronym EMA-1 in corrected form and have corrected the citation.

      (7) Line 176 - RNA seq.

      We have now updated the statement to “We performed single cell RNA sequencing (scRNA seq) of mouse gonad”.

      (8) Line 181 - Figure 4E and 4I should be 3E and 3I.

      We have now updated the figure reference in the text to correct one.

      (9) Line 183 - missing period.

      Added.

    1. Author response:

      The following is the authors’ response to the previous reviews

      eLife Assessment

      This valuable study combines a computational language model, i.e., HM-LSTM, and temporal response function (TRF) modeling to quantify the neural encoding of hierarchical linguistic information in speech, and addresses how hearing impairment affects neural encoding of speech. The analysis has been significantly improved during the revision but remain somewhat incomplete - The TRF analysis should be more clearly described and controlled. The study is of potential interest to audiologists and researchers who are interested in the neural encoding of speech.

      We thank the editors for the updated assessment. In the revised manuscript, we have added a more detailed description of the TRF analysis on p. of the revised manuscript. We have also updated Figure 1 to better visualize the analyses pipeline. Additionally, we have included a supplementary video to illustrate the architecture of the HM-LSTM model, the ridge regression methods using the model-derived features, and mTRF analysis using the acoustic envelop and the binary rate models.

      Public Reviews:

      Reviewer #1 (Public review):

      About R squared in the plots:

      The authors have used a z-scored R squared in the main ridge regression plots. While this may be interpretable, it seems non-standard and overly complicated. The authors could use a simple Pearson r to be most direct and informative (and in line with similar work, including Goldstein et al. 2022 which they mentioned). This way the sign of the relationships is preserved.

      We did not use Pearson’s r as in Goldstein et al. (2022) because our analysis did not involve a train-test split, which was a key aspect of their approach. Specifically, Goldstein et al. (2022) divided their data into training and testing sets, trained a ridge regression model on the training set, and then used the trained model to predict neural responses on the test set. They calculated Pearson’s r to assess the correlation between the predicted and observed neural responses, making the correlation coefficient (r) their primary measure of model performance. In contrast, our analysis focused on computing the model fitting performance (R²) of the ridge regression model for each sensor and time point for each subject. At the group level, we conducted one-sample t-tests with spatiotemporal cluster-based correction on the R² values to identify sensors and time windows where R² values were significantly greater than baseline. We established the baseline by normalizing the R² values using Fisher z-transformation across sensors within each subject. We have added this explanation on p.13 of the revised manuscript.

      About the new TRF analysis:

      The new TRF analysis is a necessary addition and much appreciated. However, it is missing the results for the acoustic regressors, which should be there analogous to the HM-LSTM ridge analysis. The authors should also specify which software they have utilized to conduct the new TRF analysis. It also seems that the linguistic predictors/regressors have been newly constructed in a way more consistent with previous literature (instead of using the HM-LSTM features); these specifics should also be included in the manuscript (did it come from Montreal Forced Aligner, etc.?). Now that the original HM-LSTM can be compared to a more standard TRF analysis, it is apparent that the results are similar.

      We used the Python package Eelbrain (https://eelbrain.readthedocs.io/en/r0.39/auto_examples/temporal-response-functions/trf_intro.html) to conduct the multivariate temporal response function (mTRF) analyses. As we previously explained in our response to R3, we did not apply mTRF to the acoustic features due to the high dimensionality of the input. Specifically, our acoustic representation consists of a 130-dimensional vector sampled every 10 ms throughout the speech stimuli (comprising a 129-dimensional spectrogram and a 1dimensional amplitude envelope). This led to interpreting the 130-dimensional TRF estimation difficult to interpret. A similar constraint applied to the hidden-layer activations from our HMLSTM model for the five linguistic features. After dimensionality reduction via PCA, each still resulted in 150-dimensional vectors. To address this, we instead used binary predictors marking the offset of each linguistic unit (phoneme, syllable, word, phrase, sentence). Since our speech stimuli were computer-synthesized, the phoneme and syllable boundaries were automatically generated. The word boundaries were manually annotated by a native Mandarin as in Li et al. (2022). The phrase boundaries were automatically annotated by the Stanford parser and manually checked by a native Mandarin speaker. These rate models are represented as five distinct binary time series, each aligned with the timing of the corresponding linguistic unit, making them well-suited for mTRF analysis. Although the TRF results from the 1-dimensional rate predictors and the ridge regression results from the high-dimensional HM-LSTM-derived features are similar, they encode different things: The rate regressors only encode the timing of linguistic unit boundaries, while the model-derived features encode the representational content of the linguistic input. Therefore, we do not consider the mTRF analyses to be analogous to the ridge regression analyses. Rather, these results complement each other and both provide informative results into the neural tracking of linguistic structures at different levels for the attended and unattended speech.

      Since the TRF result for the continuous acoustic features also concerns R2, we have added an mTRF analysis where we fitted the one-dimensional speech envelope to the EEG. We extracted the envelope at 10 ms intervals for both attended and unattended speech and computed mTRFs independently for each subject and sensor using a basis of 50 ms Hamming windows spanning –100 ms to 300 ms relative to envelope onset. The results showed that in hearing-impaired participants, attended speech elicited a significant cluster in the bilateral temporal regions from 270 to 300 ms post-onset (t = 2.40, p = 0.01, Cohen’s d = 0.63). Unattended speech elicited an early cluster in right temporal and occipital regions from –100 ms to –80 ms (t = 3.07, p = 0.001, d = 0.83). Normal-hearing participants showed significant envelope tracking in the left temporal region at 280–300 ms after envelope onset (t = 2.37, p = 0.037, d = 0.48), with no significant cluster for unattended speech. These results further suggest that hearing-impaired listeners may have difficulty suppressing unattended streams. We have added the new TRF results for envelope to Figure S3 and the “mTRF results for attended and unattended speech” on p.7 and the “mTRF analysis” in Material and Methods of the revised manuscript.

      The authors' wording about this suggests that these new regressors have a nonzero sample at each linguistic event's offset, not onset. This should also be clarified. As the authors know, the onset would be more standard, and using the offset has implications for understanding the timing of the TRFs, as a phoneme has a different duration than a word, which has a different duration from a sentence, etc.

      In our rate‐model mTRF analyses, we initially labelled linguistic boundaries as “offsets” because our ridge‐regression with HM-LSTM features was aligned to sentence offsets rather than onsets. However, since each offset coincides with the next unit’s onset—and our regressors simply mark these transition points as 1—the “offset” and “onset” models yield identical mTRFs. To avoid confusion, we have relabeled “offset” as “boundary” in Figure S2.

      As discussed in our prior responses, this design was based on the structure of our input to the HM-LSTM model, where each input consists of a pair of sentences encoded in phonemes, such as “t a_1 n əŋ_2 f ei_1 <sep> zh ə_4 sh iii_4 f ei_1 j ii_1” (“It can fly <sep> This is an airplane”). The two sentences are separated by a special <sep> token, and the model’s objective is to determine whether the second sentence follows the first, similar to a next-sentence prediction task. Since the model processes both sentences in full before making a prediction, the neural activations of interest should correspond to the point at which the entire sentence has been processed by humans. To enable a fair comparison between the model’s internal representations and brain responses, we aligned our neural analyses with the sentence offsets, capturing the time window after the sentence has been fully perceived by the participant. Thus, we extracted epochs from -100 to +300 ms relative to each sentence offset, consistent with our model-informed design.

      We understand that phonemes, syllables, words, phrases, and sentences differ in their durations. However, the five hidden activity vectors extracted from the model are designed to capture the representations of these five linguistic levels across the entire sentence. Specifically, for a sentence pair such as “It can fly <sep> This is an airplane,” the first 2048-dimensional vector represents all the phonemes in the two sentences (“t a_1 n əŋ_2 f ei_1 <sep> zh ə_4 sh iii_4 f ei_1 j ii_1”), the second vector captures all the syllables (“ta_1 nəŋ_2 fei_1 <sep> zhə_4 shiii_4 fei_1jii_1”), the third vector represents all the words, the fourth vector captures the phrases, and the fifth vector represents the sentence-level meaning. In our dataset, input pairs consist of adjacent sentences from the stimuli (e.g., Sentence 1 and Sentence 2, Sentence 2 and Sentence 3, and so on), and for each pair, the model generates five 2048-dimensional vectors, each corresponding to a specific linguistic level. To identify the neural correlates of these model-derived features—each intended to represent the full linguistic level across a complete sentence—we focused on the EEG signal surrounding the completion of the second sentence rather than on incremental processing. Accordingly, we extracted epochs from -100 ms to +300 ms relative to the offset of the second sentence and performed ridge regression analyses using the five model features (reduced to 150 dimensions via PCA) at every 50 ms across the epoch. We have added this clarification on p.12 of the revised manuscript.

      About offsets:

      TRFs can still be interpretable using the offset timings though; however, the main original analysis seems to be utilizing the offset times in a different, more confusing way. The authors still seem to be saying that only the peri-offset time of the EEG was analyzed at all, meaning the vast majority of the EEG trial durations do not factor into the main HM-LSTM response results whatsoever. The way the authors describe this does not seem to be present in any other literature, including the papers that they cite. Therefore, much more clarification on this issue is needed. If the authors mean that the regressors are simply time-locked to the EEG by aligning their offsets (rather than their onsets, because they have varying onsets or some such experimental design complexity), then this would be fine. But it does not seem to be what the authors want to say. This may be a miscommunication about the methods, or the authors may have actually only analyzed a small portion of the data. Either way, this should be clarified to be able to be interpretable.

      We hope that our response in RE4, along with the supplementary video, has helped clarify this issue. We acknowledge that prior studies have not used EEG data surrounding sentence offsets to examine neural responses at the phoneme or syllable levels. However, this is largely due to a lack of model that represent all linguistic levels across an entire sentence. There is abundant work comparing model predictors with neural data time-locked to offsets because they mark the point at which participants has already processed the relevant information (Brennan, 2016; Brennan et al., 2016; Gwilliams et al., 2024, 2025). Similarly, in our model– brain alignment study, our goal is to identify neural correlates for each model-derived feature. If we correlate model activity with EEG data aligned to sentence onsets, we would be examining linguistic representations at all levels (from phoneme to sentence) of the whole sentence at the time when participants have not heard the sentence yet. Although this limits our analysis to a subset of the data (143 sentences × 400 ms windows × 4 conditions), it targets the exact moment when full-sentence representations emerge against background speech, allowing us to examine each model-derived feature onto its neural signature. We have added this clarification on p.12 of the revised manuscript.

      Reviewer #2 (Public review):

      This study presents a valuable finding on the neural encoding of speech in listeners with normal hearing and hearing impairment, uncovering marked differences in how attention to different levels of speech information is allocated, especially when having to selectively attend to one speaker while ignoring an irrelevant speaker. The results overall support the claims of the authors, although a more explicit behavioural task to demonstrate successful attention allocation would have strengthened the study. Importantly, the use of more "temporally continuous" analysis frameworks could have provided a better methodology to assess the entire time course of neural activity during speech listening. Despite these limitations, this interesting work will be useful to the hearing impairment and speech processing research community. The study compares speech-in-quiet vs. multi-talker scenarios, allowing to assess within-participant the impact that the addition of a competing talker has on the neural tracking of speech. Moreover, the inclusion of a population with hearing loss is useful to disentangle the effects of attention orienting and hearing ability. The diagnosis of high-frequency hearing loss was done as part of the experimental procedure by professional audiologists, leading to a high control of the main contrast of interest for the experiment. Sample size was big, allowing to draw meaningful comparisons between the two populations.

      We thank you very much for your appreciation of our research and we have now added a more description of the mTRF analyses on p.13-14 of the revised manuscript.

      An HM-LSTM model was employed to jointly extract speech features spanning from the stimulus acoustics to word-level and phrase-level information, represented by embeddings extracted at successive layers of the model. The model was specifically expanded to include lower level acoustic and phonetic information, reaching a good representation of all intermediate levels of speech. Despite conveniently extracting all features jointly, the HMLSTM model processes linguistic input sentence-by-sentence, and therefore only allows to assess the corresponding EEG data at sentence offset. If I understood correctly, while the sentence information extracted with the HM-LSTM reflects the entire sentence - in terms of its acoustic, phonetic and more abstract linguistic features - it only gives a condensed final representation of the sentence. As such, feature extraction with the HM-LSTM is not compatible with a continuous temporal mapping on the EEG signal, and this is the main reason behind the authors' decision to fit a regression at nine separate time points surrounding sentence offsets.

      Yes, you are correct. As explained in RE4, the model generates five hidden-layer activity vectors, each intended to represent all the phonemes, syllables, words, phrases within the entire sentence (“a condensed final representation”). This is the primary reason we extract EEG data surrounding the sentence offsets—this time point reflects when the full sentence has been processed by the human brain. We assume that even at this stage, residual neural responses corresponding to each linguistic level are still present and can be meaningfully analyzed.

      While valid and previously used in the literature, this methodology, in the particular context of this experiment, might be obscuring important attentional effects impacted by hearing-loss. By fitting a regression only around sentence-final speech representations, the method might be overlooking the more "online" speech processing dynamics, and only assessing the permanence of information at different speech levels at sentence offset. In other words, the acoustic attentional bias between Attended and Unattended speech might exist even in hearing-impaired participants but, due to a lower encoding or permanence of acoustic information in this population, it might only emerge when using methodologies with a higher temporal resolution, such as Temporal Response Functions (TRFs). If a univariate TRF fit simply on the continuous speech envelope did not show any attentional bias (different trial lengths should not be a problem for fitting TRFs), I would be entirely convinced of the result. For now, I am unsure on how to interpret this finding.

      We agree and we have added the mTRF results using the rate models for the 5 linguistic levels in the prior revision. The rate model aligns with the boundaries of each linguistic unit at each level. As explained in RE3, the rate regressors encode the timing of linguistic unit boundaries, while the model-derived features encode the representational content of the linguistic input. The mTRF results showed similar patterns to those observed using features from our HM-LSTM model with ridge regression (see Figure S2). These results complement each other and both provide informative results into the neural tracking of linguistic structures at different levels for the attended and unattended speech.

      We have also added TRF results fitting the envelope of attended and unattended speech at every 10 ms to the whole 10-minute EEG data at every 10 ms. Our results showed that in hearing-impaired participants, attended speech elicited a significant cluster in the bilateral temporal regions from 270 to 300 ms post-onset (t = 2.40, p = 0.01, Cohen’s d = 0.63). Unattended speech elicited an early cluster in right temporal and occipital regions from –100 ms to –80 ms (t = 3.07, p = 0.001, d = 0.83). Normal-hearing participants showed significant envelope tracking in the left temporal region at 280–300 ms after envelope onset (t = 2.37, p = 0.037, d = 0.48), with no significant cluster for unattended speech. These results further suggest that hearing-impaired listeners may have difficulty suppressing unattended streams. We have added the new TRF results for envelope to Figure S3 and the “mTRF results for attended and unattended speech” on p.7 and the “mTRF analysis” in Material and Methods of the revised manuscript.

      Despite my doubts on the appropriateness of condensed speech representations and singlepoint regression for acoustic features in particular, the current methodology allows the authors to explore their research questions, and the results support their conclusions. This work presents an interesting finding on the limits of attentional bias in a cocktail-party scenario, suggesting that fundamentally different neural attentional filters are employed by listeners with highfrequency hearing loss, even in terms of the tracking of speech acoustics. Moreover, the rich dataset collected by the authors is a great contribution to open science and will offer opportunities for re-analysis.

      We sincerely thank you again for your encouraging comments regarding the impact of our study.

      Reviewer #3 (Public review):

      Summary:

      The authors aimed to investigate how the brain processes different linguistic units (from phonemes to sentences) in challenging listening conditions, such as multi-talker environments, and how this processing differs between individuals with normal hearing and those with hearing impairments. Using a hierarchical language model and EEG data, they sought to understand the neural underpinnings of speech comprehension at various temporal scales and identify specific challenges that hearing-impaired listeners face in noisy settings.

      Strengths:

      Overall, the combination of computational modeling, detailed EEG analysis, and comprehensive experimental design thoroughly investigates the neural mechanisms underlying speech comprehension in complex auditory environments. The use of a hierarchical language model (HM-LSTM) offers a data-driven approach to dissect and analyze linguistic information at multiple temporal scales (phoneme, syllable, word, phrase, and sentence). This model allows for a comprehensive neural encoding examination of how different levels of linguistic processing are represented in the brain. The study includes both single-talker and multi-talker conditions, as well as participants with normal hearing and those with hearing impairments. This design provides a robust framework for comparing neural processing across different listening scenarios and groups.

      Weaknesses:

      The analyses heavily rely on one specific computational model, which limits the robustness of the findings. The use of a single DNN-based hierarchical model to represent linguistic information, while innovative, may not capture the full range of neural coding present in different populations. A low-accuracy regression model-fit does not necessarily indicate the absence of neural coding for a specific type of information. The DNN model represents information in a manner constrained by its architecture and training objectives, which might fit one population better than another without proving the non-existence of such information in the other group. It is also not entirely clear if the DNN model used in this study effectively serves the authors' goal of capturing different linguistic information at various layers. More quantitative metrics on acoustic/linguistic-related downstream tasks, such as speaker identification and phoneme/syllable/word recognition based on these intermediate layers, can better characterize the capacity of the DNN model.

      We agree that, before aligning model representations with neural data, it is essential to confirm that the model encodes linguistic information at multiple hierarchical levels. This is the purpose of our validation analysis: We evaluated the model’s representations across five layers using a test set of 20 four-syllable sentences in which every syllable shares the same vowel—e.g., “mā ma mà mǎ” (mother scolds horse), “shū shu shǔ shù” (uncle counts numbers; see Table S1). We hypothesized that the activity in the phoneme and syllable layer would be more similar than other layers for same-vowel sentences. The results confirmed our hypothesis: Hidden-layer activity for same-vowel sentences exhibited much more similar distributions at the phoneme and syllable levels compared to those at the word, phrase and sentence levels Figure 3C displays the scatter plot of the model activity at the five linguistic levels for each of the 20 4-syllable sentences, post dimension reduction using multidimensional scaling (MDS). We used color-coding to represent the activity of five hidden layers after dimensionality reduction. Each dot on the plot corresponds to one test sentence. Only phonemes are labeled because each syllable in our test sentences contains the same vowels (see Table S1).The plot reveals that model representations at the phoneme and syllable levels are more dispersed for each sentence, while representations at the higher linguistic levels—word, phrase, and sentence—are more centralized. Additionally, similar phonemes tend to cluster together across the phoneme and syllable layers, indicating that the model captures a greater amount of information at these levels when the phonemes within the sentences are similar.

      Apart from the DNN model, we also included the rate models which simply mark 1 at each unit boundaries across the 5 levels. We performed mTRF analyses with these rate models and found similar patterns to our ridge‐regression results with the DNN: (see Figure S2). This provides further evidence that the model reliably captures information across all five hierarchical levels.

      Since EEG measures underlying neural activity in near real-time, it is expected that lower-level acoustic information, which is relatively transient, such as phonemes and syllables, would be distributed throughout the time course of the entire sentence. It is not evident if this limited time window effectively captures the neural responses to the entire sentence, especially for lower-level linguistic features. A more comprehensive analysis covering the entire time course of the sentence, or at least a longer temporal window, would provide a clearer understanding of how different linguistic units are processed over time.

      We agree that lower-level linguistic features may be distributed throughout the whole sentence, however, using the entire sentence duration was not feasible, as the sentences in the stimuli vary in length, making statistical analysis challenging. Additionally, since the stimuli consist of continuous speech, extending the time window would risk including linguistic units from subsequent sentences. This would introduce ambiguity as to whether the EEG responses correspond to the current or the following sentence. Additionally, our model activity represents a “condensed final representation” at the five linguistic levels for the whole sentence, rather than incrementally during the sentence. We think the -100 to 300 ms time window relative to each sentence offset targets the exact moment when full-sentence representations are comprehended and a “condensed final representation” for the whole sentence across five linguistic level have been formed in the brain. We have added this clarification on p.13 of the revised manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Here are some specifics and clarifications of my public review:

      Initially I was interpreting the R squared as a continuous measure of predicted EEG relative to actual EEG, based on an encoding model, but this does not appear to be correct. Thank you for pointing out that the y axis is z-scored R squared in your main ridge regression plots. However, I am not sure why/how you chose to represent this that way. It seems to me that a simple Pearson r would be most informative here (and in line with similar work, including Goldstein et al. 2022 that you mentioned). That way you preserve the sign of the relationships between the regressors and the EEG. With R squared, we have a different interpretation, which is maybe also ok, but I also don't see the point of z-scoring R squared. Another possibility is that when you say "z-transformed" you are referring to the Fisher transformation; is that the case? In the plots you say "normalized", so that sounds like a z-score, but this needs to be clarified; as I say, a simple Pearson r would probably be best.

      We did not use Pearson’s r, as in Goldstein et al. (2022), because our analysis did not involve a train-test split, which was central to their approach. In their study, the data were divided into training and testing sets, and a ridge regression model was trained on the training set. They then used the trained model to predict neural responses on the held-out test set, and calculated Pearson’s r to assess the correlation between the predicted and observed neural responses. As a result, their final metric of model performance was the correlation coefficient (r). In contrast, our analysis is more aligned with standard temporal response function (TRF) approaches. We did not perform a train-test split; instead, we computed the model fitting performance (R²) of the ridge regression model at each sensor and time point for each subject. At the group level, we conducted one-sample t-tests with spatiotemporal cluster-based correction on the R² values to determine which sensors and time windows showed significantly greater R² values than baseline. To establish a baseline, we z-scored the R² values across sensors and time points, effectively centering the distribution around zero. This normalization allowed us to interpret deviations from the mean R² as meaningful increases in model performance and provided a suitable baseline for the statistical tests. We have added this clarification on p.13 of the revised manuscript.

      Thank you for doing the TRF analysis, but where are the acoustic TRFs, analogous to the acoustic results for your HM-LSTM ridge analyses? And what tools did you use to do the TRF analysis? If it is something like the mTRF MATLAB toolbox, then it is also using ridge regression, as you have already done in your original analysis, correct? If so, then it is pretty much the same as your original analysis, just with more dense timepoints, correct? This is what I meant by referring to TRFs originally, because what you have basically done originally was to make a 9-point TRF (and then the plots and analyses are contrasts of pairs of those), with lags between -100 and 300 ms relative to the temporal alignment between the regressors and the EEG, I think (more on this below).

      Also with the new TRF analysis, you say that the regressors/predictors had "a value of 1 at each unit boundary offset". So this means you re-made these predictors to be discrete as I and reviewer 3 were mentioning before (rather than using the HM-LSTM model layer(s)), and also, that you put each phoneme/word/etc. marker at its offset, rather than its onset? I'm also confused as to why you would do this rather than the onset, but I suppose it doesn't change the interpretation very much, just that the TRFs are slid over by a small amount.

      We used the Python package Eelbrain (https://eelbrain.readthedocs.io/en/r0.39/auto_examples/temporal-response-functions/trf_intro.html) to conduct the multivariate temporal response function (mTRF) analyses. As we previously explained in our response to Reviewer 3, we did not apply mTRF to the acoustic features due to the high dimensionality of the input. Specifically, our acoustic representation consists of a 130-dimensional vector sampled every 10 ms throughout the speech stimuli (comprising a 129-dimensional spectrogram and a 1-dimensional amplitude envelope). This renders the 130 TRF weights to the acoustic features uninterpretable. However, we have now added TRF results from the 1- dimension envelope to the attended and unattended speech at every 10 ms.

      A similar constraint applied to the hidden-layer activations from our HM-LSTM model for the five linguistic features. After dimensionality reduction via PCA, each still resulted in 150-dimensional vectors, further preventing their use in mTRF analyses. To address this, we instead used binary predictors marking the offset of each linguistic unit (phoneme, syllable, word, phrase, sentence). These rate models are represented as five distinct binary time series, each aligned with the timing of the corresponding linguistic unit, making them well-suited for mTRF analysis. It is important to note that these rate predictors differ from the HM-LSTMderived features: They encode only the timing of linguistic unit boundaries, not the content or representational structure of the linguistic input. Therefore, we do not consider the mTRF analyses to be equivalent to the ridge regression analyses based on HM-LSTM features

      For onset vs. offset, as explained RE4, we labelled them “offsets” because our ridge‐regression with HM-LSTM features was aligned to sentence offsets rather than onsets (see RE4 and RE15 below for the rationale of using sentence offset). However, since each unit offset coincides with the next unit’s onset—and the rate model simply mark these transition points as 1—the “offset” and “onset” models yield identical mTRFs. To avoid confusion, we have relabeled “offset” as “boundary” in Figure S2.

      I'm still confused about offsets generally. Does this maybe mean that the EEG, and each predictor, are all aligned by aligning their endpoints, which are usually/always the ends of sentences? So e.g. all the phoneme activity in the phoneme regressor actually corresponds to those phonemes of the stimuli in the EEG time, but those regressors and EEG do not have a common starting time (one trial to the next maybe?), so they have to be aligned with their ends instead?

      We chose to use sentence offsets rather than onsets based on the structure of our input to the HM-LSTM model, where each input consists of a pair of sentences encoded in phonemes, such as “t a_1 n əŋ_2 f ei_1 <sep> zh ə_4 sh iii_4 f ei_1 j ii_1” (“It can fly <sep> This is an airplane”). The two sentences are separated by a special <sep> token, and the model’s objective is to determine whether the second sentence follows the first, similar to a next-sentence prediction task. Since the model processes both sentences in full before making a prediction, the neural activations of interest should correspond to the point at which the entire sentence has been processed. To enable a fair comparison between the model’s internal representations and brain responses, we aligned our neural analyses with the sentence offsets, capturing the time window after the sentence has been fully perceived by the participant. Thus, we extracted epochs from -100 to +300 ms relative to each sentence offset, consistent with our modelinformed design. If we align model activity with EEG data aligned to sentence onsets, we would be examining linguistic representations at all levels (from phoneme to sentence) of the whole sentence at the time when participants have not heard the sentence yet. By contrast, aligning to sentence offsets ensures that participants have constructed a full-sentence representation.

      We understand that it is a bit confusing why the regressor of each level is not aligned to their own offsets in the data. The hidden-layer activations of the HM-LSTM model corresponding to the five linguistic levels (phoneme, syllable, word, phrase, sentence) are consistently 150-dimensional vectors after PCA reduction. As a result, for each input sentence pair, the model produces five distinct hidden-layer activations, each capturing the representational content associated with one linguistic level for the whole sentence. We believe our -100 to 300 ms time window relative to sentence offset reflects a meaningful period during which the brain integrates and comprehends information across multiple linguistic levels.

      Being "time-locked to the offset of each sentence at nine latencies" is not something I can really find in any of the references that you mentioned, regarding the offset aspect of this method. Can you point me more specifically to what you are trying to reference with that, or further explain? You said that "predicting EEG signals around the offset of each sentence" is "a method commonly employed in the literature", but the example you gave of Goldstein 2022 is using onsets of words, which is indeed much more in line with what I would expect (not offsets of sentences).

      You are correct that Goldstein (2022) aligned model predictions to onsets rather than offsets; however, many studies in the literature also align model predictions with unit offsets. typically because they mark the point at which participants has already processed the relevant information (Brennan, 2016; Brennan et al., 2016; Gwilliams et al., 2024, 2025). Similarly, in our study, we aim to identify neural correlates for each model-derived feature. If we correlate model activity with EEG data aligned to sentence onsets, we would be examining linguistic representations at all levels (from phoneme to sentence) of the whole sentence at the time when participants have not heard the sentence yet. By contrast, aligning to sentence offsets ensures that participants have constructed a full-sentence representation. Although this limits our analysis to a subset of the data (143 sentences × 400 ms windows × 4 conditions), it targets the exact moment when full-sentence representations emerge against background speech, allowing us to examine each model-derived feature onto its neural signature. We have added this clarification on p.12 of the revised manuscript.

      This new sentence does not make sense to me: "The regressors are aligned to sentence offsets because all our regressors are taken from the hidden layer of our HM-LSTM model, which generates vector representations corresponding to the five linguistic levels of the entire sentence".

      Thank you for the suggestion. We hope our responses in RE4, 15 and 16, along with our supplementary video have now clarified the issue. We have deleted the sentence and provided a more detailed explanation on p.12 of the revised manuscript: The regressors are aligned to sentence offsets because our goal is to identify neural correlates for each model-derived feature of a whole sentence. If we align model activity with EEG data time-locked to sentence onsets, we would be finding neural responses to linguistic levels (from phoneme to sentence) of the whole sentence at the time when participants have not processed the sentence yet. By contrast, aligning to sentence offsets ensures that participants have constructed a full-sentence representation. Although this limits our analysis to a subset of the data (143 sentences × 2 sections × 400 ms windows), it targets the exact moment when full-sentence representations emerge against background speech, allowing us to examine each model-derived feature onto its neural signature. We understand that phonemes, syllables, words, phrases, and sentences differ in their durations. However, the five hidden activity vectors extracted from the model are designed to capture the representations of these five linguistic levels across the entire sentence Specifically, for a sentence pair such as “It can fly <sep> This is an airplane,” the first 2048dimensional vector represents all the phonemes in the two sentences (“t a_1 n əŋ_2 f ei_1 <sep> zh ə_4 sh iii_4 f ei_1 j ii_1”), the second vector captures all the syllables (“ta_1 nəŋ_2 fei_1 <sep> zhə_4 shiii_4 fei_1jii_1”), the third vector represents all the words, the fourth vector captures the phrases, and the fifth vector represents the sentence-level meaning. In our dataset, input pairs consist of adjacent sentences from the stimuli (e.g., Sentence 1 and Sentence 2, Sentence 2 and Sentence 3, and so on), and for each pair, the model generates five 2048dimensional vectors, each corresponding to a specific linguistic level. To identify the neural correlates of these model-derived features—each intended to represent the full linguistic level across a complete sentence—we focused on the EEG signal surrounding the completion of the second sentence rather than on incremental processing. Accordingly, we extracted epochs from -100 ms to +300 ms relative to the offset of the second sentence and performed ridge regression analyses using the five model features (reduced to 150 dimensions via PCA) at every 50 ms across the epoch.

      More on the issue of sentence offsets: In response to reviewer 3's question about -100 - 300 ms around sentence offset, you said "Using the entire sentence duration was not feasible, as the sentences in the stimuli vary in length, making statistical analysis challenging. Additionally, since the stimuli consist of continuous speech, extending the time window would risk including linguistic units from subsequent sentence." This does not make sense to me, so can you elaborate? It sounds like you are actually saying that you only analyzed 400 ms of each trial, but that cannot be what you mean.

      Yes, we analyzed only the 400 ms window surrounding each sentence offset. Although this represents just a subset of our data (143 sentences × 400 ms × 4 conditions), it precisely captures when full-sentence representations emerge against background speech. Because our model produces a single, condensed representation for each linguistic level over the entire sentence—rather than incrementally—we think it is more appropriate to align to the period surrounding sentence offsets. Additionally, extending the window (e.g. to 2 seconds) would risk overlapping adjacent sentences, since sentence lengths vary. Our focus is on the exact period when integrated, level-specific information for each sentence has formed in the brain, and our results already demonstrate different response patterns to different linguistic levels for the two listener groups within this interval. We have added this clarification on p.13 of the revised manuscript.

      In your mTRF analysis, you are now saying that the discrete predictors have "a value of 1" at each of the "boundary offsets", and those TRFs look very similar to your original plots. It sounds to me like you should not be referring to time zero in your original ridge analysis as "sentence offset". If what you mean is that sentence offset time is merely how you aligned the regressors and EEG in time, then your time zero still has a standard, typical TRF interpretation. It is just the point in time, or lag, at which the regressor(s) and EEG are aligned. So activity before zero is "predictive" and activity after zero is "reactive", to think of it crudely. So also in the text, when you say things like "50-150 ms after the sentence offsets", I think this is not really what you mean. I think you are referring to the lags of 50 - 150 ms, relative to the alignment of the regressor and the EEG.

      Thank you very much for the explanation. We agree that, in our ridge‐regression time course, pre zero lags index “predictive” processing and post-zero lags index “reactive” processing. Unlike TRF analysis, we applied ridge regression to our high-dimensional model features at nine discrete lags around the sentence offset. At each lag, we tested whether the regression score exceeded a baseline defined as the mean regression score across all lags. For example, finding a significantly higher regression score between 50 and 150 ms suggests that our regressor reliably predicted EEG activity in that time window. So here time zero refers to the precise moment of the sentence offset—not the the alignment of the regressor and the EEG.

      I look forward to discussing how much of my interpretation here makes sense or doesn't, both with the authors and reviewers.

      Thank you very much for these very constructive feedback and we hope that we have addressed all your questions.

    1. If you are in the dominant cultural group on your campus, write a paragraph describing values you share with your cultural group. Then list things that students with a different background may have difficulty understanding about your group. If your racial, ethnic, or cultural background is different from the dominant cultural group on your campus, write a paragraph describing how students in the dominant culture seem to differ from your own culture. Look back at what you just wrote. Did you focus on characteristics that seem either positive or negative? Might there be any stereotypes creeping into your thinking? Write a second paragraph focusing on yourself as a unique individual, not a part of a group. How would others benefit from getting to know you better?
                      According to a source I've found online, 42% of students at CCAC are white. Hence, I am part of the "dominant cultural group" I am not a very social person, I don't discuss values, I discuss ideas. I share these ideas to whomever might be interested in them, not if they're part of my "dominant cultural group". Things students could have differently than me is being raised in a different household, taught in different schools, and potentially lived in different areas than me.
      
      Looking back at what I wrote, I believe everything I've wrote is subjective, I don't think anything I've said represents either positive or negative, just different background, is different appearance. For the stereotypes, I didn't put that in perspective since you can't exactly stereotype something that is subjective.
      
           Focusing on myself, Caiden Ward, other people could benefit from getting to know me as a way to explore my ideas, and interests. If we share the same interests, we can both discuss them, hence learning from each other, which is then benefiting from each other.
      
    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors use methylphenidate (MPH) administration after learning a Pavlovian to instrumental transfer (PIT) task to parse decision making from instrumental influences. While the main effects were null, individual differences in working memory ability moderated the tendency of MPH to boost cognitive control in order to override PIT-biased instrumental learning. Importantly, this working memory moderator had symmetrical effects in appetite and aversive conditions, and these patterns replicated within each valence condition across different values of gain/loss (Fig S1c), suggesting a reliable effect that is generalized across instances of Pavlovian influence.

      Strengths:

      The idea of using pharmacological challenge after learning but prior to transfer is a novel technique that highlights the influence of catecholamines on the expression of learning under Pavlovian bias, and importantly it dissociated this decision feature from the learning of stimulus-outcome or action-outcome pairings.

      We thank the reviewer for highlighting the timing of the pharmacological intervention as a strength for this study and for the suggested improvements for clarification.

      Weaknesses:

      While the report is largely straightforward and clearly written, some aspects may be edited to improve the clarity for other readers.

      (1) Theoretical clarity. The authors seem to hedge their bets when it comes to placing these findings within a broader theoretical framework.

      Our findings ask for a revision of theories on how catecholamines are involved in instantiation of Pavlovian biases in decision making. The reviewer rightly notices that we offer three routes to modify current theory to be able to incorporate our findings. Briefly, these routes discuss catecholaminergic modulation of Pavlovian biases (i) through modulation of the putative striatal ‘origin’ of Pavlovian biases, (ii) through top-down control, primarily relying on prefrontal processes, and (ii) a combination of the two, where catecholamines regulate the balance between these striatal and frontal processes.

      Given the systemic nature of the pharmacological manipulation, we cannot dissociate between these three accounts. We believe that discussing these possible explanations enriches our Discussion and strengthens our recommendation in the ultimate paragraph to use pharmacological neuroimaging studies to arbitrate between these options. In the revision, we have made this line of reasoning more clear, in part by adding guiding titles to the Discussion section and adding a summary paragraph in the Discussion (Discussion, page 9-12).

      (2) Analytic clarity: what's c^2?

      C^2 seems a technical pdf conversion error problem: all chi-squares (Χ2) have been converted to C2. This is now corrected in our revision.

      Reviewer #2 (Public review):

      Summary:

      In this study, Geurts et al. investigated the effects of the catecholamine reuptake inhibitor methylphenidate (MPH) on value-based decision making using a combination of aversive and appetitive Pavlovian to Instrumental Transfer (PIT) in a human cohort. Using an elegant behavioural design they showed a valence- and action-specific effects of Pavlovian cues on instrumental responses. Initial analyses show no effect of MPH on these processes. However the authors performed a more in-depth analysis and demonstrated that MPH actually modulates PIT in actionspecific manner depending of individual working memory capacities. The authors interpret that as an effect on cognitive control of Pavlovian biasing of actions and decision making more than an invigoration of motivational biases.

      Strengths:

      A major strength of this study is its experimental design. The elegant combination of appetitive and aversive Pavlovian learning with approach/avoidance instrumental actions allows to precisely investigate the different modulation of value-based decision making depending on the context and environmental stimuli. Important MPH is only administered after Pavlovian and instrumental learning, restricting the effect on PIT performance only. Finally, the use of a placeboontrolled crossover design allows within-comparisons between PIT effect under placebo and MPH and the investigation of the relationships between working memory abilities, PIT and MPH effects.

      We thank the reviewer for highlighting the experimental design as a strength for this study and the suggested improvements for clarification.

      Weaknesses:

      As authors stated in their discussion, this study is purely correlational and their conclusions could be strengthened by the addition of interesting (but time- and resource-consuming) neuroimaging work.

      We employ a pharmacological intervention within a randomized placebo controlled cross-over design, which allows for causal inferences with respect to the placebo-controlled intervention. Thus, the reported interactions of interest include correlations, but these are causally dependent on our intervention.

      Perhaps the reviewer refers to the implications of our findings for hypotheses regarding neural implementation of Pavlovian bias-generation. Indeed, based on our data we are not able to arbitrate between frontal and striatal accounts, due to the systemic nature of the pharmacological intervention. Thus, we agree with the reviewer that neuroimaging (in combination with for example brain stimulation) would be a valuable next step to identify the neural correlates to these pharmacological intervention effects, to dissociate between frontal and striatal basis of the effects. In the revision, as per our reply to reviewer 1, we have made this line of reasoning more clear, in part by adding guiding titles to the Discussion section and adding a summary paragraph in the Discussion (Discussion, page 9-12).

      The originality of this work compared to their previous published work using the same cohort could also be clarified at different stages of the article, as I initially wondered what was really novel. This point is much clearer in the discussion section.

      As recommended, we brought forward parts of the Discussion that clarify the originality of the current experiment to the introduction (page 4/5) and result section (page 8).

      A point which, in my opinion, really requires clarification is when the working memory performance presented in Figure 2B has been determined. Was it under placebo (as I would guess) or under MPH? If it is the former, it would be also interesting to look at how MPH modulates working memory based on initial abilities.

      We now clarified that working memory span was assessed for all participants on Day 2 prior to the start of instrumental training (as illustrated in figure 1A). Importantly, this was done prior to ingestion of the drug or placebo (which subjects received after Pavlovian training, which followed the instrumental training). This design also precludes an assessment of the effects of MPH on working memory capacity.

      A final point is that it could be interesting to also discuss these results, not only regarding dopamine signalling, but also including potential effect of MPH on noradrenaline in frontal regions, considering the known role of this system in modulating behavioural flexibility.

      We indeed focus our Discussion more on dopamine than on noradrenaline. Our revision now also discusses noradrenaline in light of our frontal control hypothesis and the recommendation, in future studies, to use a multi-drug design, incorporating, for example, a session with the drug atomoxetine, which modulates cortical catecholamines, but not striatal dopamine (Discussion, page 12).

      Reviewer #3 (Public review):

      The manuscript by Geurts and colleagues studies the effects of methylphenidate on Pavlovian to instrumental transfer in humans and demonstrates that the effects of the drug depend on the baseline working memory capacity of the participants. The experiment used a well established cognitive task that allows to measure the effects of Pavlovian cues predicting monetary wins and losses on instrumental responding in two different contexts, namely approach and withdraw. By administering the drug after participants went through the instrumental and Pavlovian learning phases of the experiment, the authors limited the effects of the drug to the transfer phase in extinction. This allowed the authors to make inference about the invigorating effects of the cues independently from any learning bias. Moreover, the authors employed a within subject design to study the effect of the drug on 100 participants, which also allows to detect continuous between-subject relationships with covariates such as working memory capacity.

      The study replicates previous findings using this task, namely that appetitive cues promote active responding, and aversive cues promote passive responding in an approach instrumental context, whereas the effect of the cues reverses in a withdraw instrumental context. The results of the methylphenidate manipulation show that the drug decreases the effects of the Pavlovian cues on instrumental responding in participants with low working memory capacity but increases the Pavlovian effects in participants with high working memory capacity. Importantly, in the latter group, methylphenidate increases the invigorating effect of appetitive Pavlovian cues on active approach and aversive Pavlovian cues on active withdrawal as well as the inhibitory effects of aversive Pavlovian cues on active approach and appetitive Pavlovian cues on active withdrawal. These results cannot be explained if catecholamines are just involved in Pavlovian biases by modulating behavioral invigoration driven by the anticipation of reward and punishment in the striatum, as this account can't account for the reversal of the effects of a valence cue on vigor depending on the instrumental context.

      In general, I find the methods of this study very robust and the results very convincing and important. However, I have some concerns:

      We thank the Reviewer for highlighting the robustness of the methods and the importance of the results. We are glad to shortly address the concerns here and have incorporated these in our revision.

      I am not convinced that the inclusion of impulsivity scores in the logistic mixed model to analyze the effects of methylphenidate on PIT is warranted. The authors do not show whether inclusion of this covariate is justified in terms of BIC. Moreover, they include this covariate but do not report the effects. Finally, it is possible that impulsivity is correlated with working memory capacity. In that case, multicollinearity may impact the estimation of the coefficient estimates and may inflate the p-values for the correlated covariates. Are the reported results robust when this factor is not included?

      With regard to the inclusion of impulsivity we first like to mention that this inclusion in our analyses was planned a priori and therefore consistently implemented in the other reports resulting from the overarching study (Froböse et al., 2018; Cook et al., 2019; Rostami Kandroodi et al., 2021), especially the study with regard to which the current report is an e-life research advance (Swart et al., 2017). Moreover, we preregistered both working memory span and impulsivity as potential factors (under secondary measures) that could mediate the effects of catecholamines (see https://onderzoekmetmensen.nl/nl/trial/26989). The inclusion of working memory span was based on evidence from PET imaging studies demonstrating a link with dopamine synthesis capacity (Cools et al., 2008; Landau et al, 2009), whereas the inclusion of trait impulsivity was based on evidence from other PET imaging studies showing a link with dopamine (auto)receptor availability (Buckholtz et al., 2010; Kim et al., 2014; Lee et al., 2009; Reeves et al., 2012). Although there was no significant improvement for the model with impulsivity compared with the model without impulsivity, we feel that we should follow our a priori established analyses.

      We can confirm that impulsivity and working memory were not correlated in this sample (r98=-0.16, p=0.88), which rules out multicollinearity.

      Most importantly, results are robust to excluding impulsivity scores as evidenced by a significant four-way interaction from the omnibus GLMM without impulsivity (Action Context x Valence x Drug x WM span: X<sup>2</sup> = 9.5, p=0.002). We will report these findings in the revised manuscript. We now added the text to the Supplemental Results: Control analyses, page 28.

      The authors state that working memory capacity is an established proxy for dopamine synthesis capacity and cite some studies supporting this view. However, the authors omit a recent reference by van den Bosch et al that provides evidence for the absence of links between striatal dopamine synthesis capacity and working memory capacity. The lack of a robust link between working memory capacity and dopamine synthesis capacity in the striatum strengthens the alternative explanations of the results suggested in the discussion.

      We agree with the Reviewer that the lack of a robust link between working memory capacity and dopamine synthesis capacity in the striatum, as measured with [<sup>18</sup>F]-FDOPA PET imaging, is lending support for the proposed hypothesis incorporating a broader perspective on Pavlovian bias generation than the dopaminergic direct/indirect pathway account (although it is possible that the association will hold in a larger sample when synthesis capacity is measured with [<sup>18</sup>F]-FMT PET imaging, which is sensitive to a different component of the metabolic pathway). We will indeed incorporate in our planned revision the findings from our group reported in van den Bosch et al (2022).

      See Supplemental methods 2: Working memory and impulsivity assessment, page 26.

      ** Recommendations for the authors:**

      Reviewer #1 (Recommendations for the authors):

      (1) Theoretical clarity. Some aspects of the paper are ideally clear: Figure 1 clearly explains the paradigm. The general take-home message is clearly described in the last line of the abstract, the last line of the introduction, the first line of the discussion, and throughout other places in the discussion. Yet the authors seem to hedge their bets when it comes to placing these findings within a broader theoretical framework.

      The discussion includes many possible theoretical interpretations of the findings, which is laudable, but many readers may get lost in this multitude (particularly anyone who isn't an RL/DA aficionado). The group's prior work (i.e. striatal hypothesis) is first described, followed by a rather complex breakdown of valenceaction tendencies, then the seemingly preferred explanation for the current study (i.e. cognitive control hypothesis) is advanced as "an alternative account ...". This is followed by a third, more complex idea (i.e. cortico-striatal balance hypothesis), then the paper ends. A reader may be forgiven for skimming through this discussion and not having a clear idea of how to frame these effects. I think some subheaders would help, as well as clearer labeling of the theoretical interpretations in line with a more authoritative description of the author's preferred interpretation of the empirical effects.

      Our findings ask for a revision of theories on how catecholamines are involved in instantiation of Pavlovian biases in decision making. The reviewer rightly notices that we offer three routes to modify current theory to be able to incorporate our findings. Briefly, these routes discuss catecholaminergic modulation of Pavlovian biases (i) through modulation of the putative striatal ‘origin’ of Pavlovian biases, (ii) through top-down control, primarily relying on prefrontal processes, and (ii) a combination of the two, where catecholamines regulate the balance between these striatal and frontal processes.

      Given the systemic nature of the pharmacological manipulation, we cannot dissociate between these three accounts. We believe that discussing these possible explanations enriches our Discussion and strengthens our recommendation in the ultimate paragraph to use pharmacological neuroimaging studies to arbitrate between these options. In the revision, we have made this line of reasoning more clear, in part by adding guiding titles to the Discussion section and adding a summary paragraph in the Discussion (Discussion, page 9-12).

      (2) All statistical effects are presented as c^2 with no df. The methods only describe LMER and make no mention of what the c^2 measure represents.

      C^2 seems a technical pdf conversion error problem: all chi-squares (Χ2) have been converted to C2. This is now corrected in our revision.

      Reviewer #2 (Recommendations for the authors):

      Few minor points:

      Figure 2A is not cited in the text I think

      Checked and changed.

      Figure 2C: "C" is not present in the figure. Also I could not see the data corresponding at MPH-Approach context in Neutral Pavlovian condition but I think it is probably masked by another curve.

      Checked and changed. Indeed, the one curve is masked by the other curve.

      As I stated in the public review, a clarification or more detailed analysis of working memory performance depending on if it was measured under MPH or placebo could be a plus.

      Changed this (see public review reply).

      I did not see any statement about the availability of data but I may have missed it.

      Yes, the statement can be found:

      Methods, page 13: Data and code for the study are freely available at https://data.ru.nl/collections/di/dccn/DSC_3017031.02_734.

      Reviewer #3 (Recommendations for the authors):

      The authors should check that inclusion of impulsivity in the logistic mixed model is justified and if it is justified make sure that multicollinearity is not problematic.

      See answer to public review for convenience reiterated below:

      With regard to the inclusion of impulsivity we first like to mention that this inclusion in our analyses was planned a priori and therefore consistently implemented in the other reports resulting from the overarching study (Froböse et al., 2018; Cook et al., 2019; Rostami Kandroodi et al., 2021), especially the study with regard to which the current report is an e-life research advance (Swart et al., 2017). Moreover, we preregistered both working memory span and impulsivity as potential factors (under secondary measures) that could mediate the effects of catecholamines (see https://onderzoekmetmensen.nl/nl/trial/26989). The inclusion of working memory span was based on evidence from PET imaging studies demonstrating a link with dopamine synthesis capacity (Cools et al., 2008; Landau et al, 2009), whereas the inclusion of trait impulsivity was based on evidence from other PET imaging studies showing a link with dopamine (auto)receptor availability (Buckholtz et al., 2010; Kim et al., 2014; Lee et al., 2009; Reeves et al., 2012). Although there was no significant improvement for the model with impulsivity compared with the model without impulsivity, we feel that we should follow our a priori established analyses.

      We can confirm that impulsivity and working memory were not correlated in this sample (r98=-0.16, p=0.88), which rules out multicollinearity.

      Most importantly, results are robust to excluding impulsivity scores as evidenced by a significant four-way interaction from the omnibus GLMM without impulsivity (Action Context x Valence x Drug x WM span: X<sup>2</sup> = 9.5, p=0.002). We will report these findings in the revised manuscript. We now added the text to the Supplemental Results Control analyses, page 28.

      I would recommend that the authors make clear that the effects of methylphenidate are dependent on working memory capacity in the first sentence of the fore last paragraph of the introduction on page 4.

      Changed this accordingly, see Introduction, page 5.

      I would make sure that the text in the figures is readable without needing to enlarge the figures. I would also highlight the significant effects in the figures.

      We changed the font size accordingly and added significance statements to the caption, because depicting the significance of a four-way interaction including one continuous variable is not straightforward.

      The distributions of p(Go) by conditions such as in figure 1D or 2A are very intuitive. Figure 2B is very informative as it shows the continuous effects of working memory capacity on the PIT effect. I would add (in figure 2 or in the supplement) a plot of the p(Go) with a tertile split based on working memory. Considering that the correspondent analysis is being reported, having the plot would strengthen and simplify the understanding of the results.

      The continuous effects of working memory are based on WM values on the listening span ranging from 2.5-7, in steps of 0.5, resulting in 10 different values. A tertile split would result in binning these into two bins of three values, and one bin of four values. Given that all of the datapoints for this tertile split are already presented in the current figures, we strongly prefer not to include this additional figure.

      I would add some sentences in the results section (and maybe in the discussion if needed) addressing the results that the effect of Valence by drug by WM span is only significant in the withdrawal context but not in the approach context.

      We now added an emphasis on the specifically significant drug effects in withdrawal in the Results section, page 8.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This is a valuable polymer model that provides insight into the origin of macromolecular mixed and demixed states within transcription clusters. The well-performed and clearly presented simulations will be of interest to those studying gene expression in the context of chromatin. While the study is generally solid, it could benefit from a more direct comparison with existing experimental data sets as well as further discussion of the limits of the underlying model assumptions.

      We thank the editors for their overall positive assessment. In response to the Referees’ comments, we have addressed all technical points, including a more detailed explanation of the methodology used to extract gene transcription from our simulations and its analogy with real gene transcription. Regarding the potential comparison with experimental data and our mixing–demixing transition, we have added new sections discussing the current state of the art in relevant experiments. We also clarify the present limitations that prevent direct comparisons, which we hope can be overcome with future experiments using the emerging techniques.

      Reviewer #1 (Public Review):

      This manuscript discusses from a theory point of view the mechanisms underlying the formation of specialized or mixed factories. To investigate this, a chromatin polymer model was developed to mimic the chromatin binding-unbinding dynamics of various complexes of transcription factors (TFs).

      The model revealed that both specialized (i.e., demixed) and mixed clusters can emerge spontaneously, with the type of cluster formed primarily determined by cluster size. Non-specific interactions between chromatin and proteins were identified as the main factor promoting mixing, with these interactions becoming increasingly significant as clusters grow larger.

      These findings, observed in both simple polymer models and more realistic representations of human chromosomes, reconcile previously conflicting experimental results. Additionally, the introduction of different types of TFs was shown to strongly influence the emergence of transcriptional networks, offering a framework to study transcriptional changes resulting from gene editing or naturally occurring mutations.

      Overall I think this is an interesting paper discussing a valuable model of how chromosome 3D organisation is linked to transcription. I would only advise the authors to polish and shorten their text to better highlight their key findings and make it more accessible to the reader.

      We thank the Referee for carefully reading our manuscript and recognizing its scientific value. As suggested, we tried to better highlight our key findings and make the text more accessible while addressing also the comments from the other Referees.

      Reviewer #2 (Public Review):

      Summary:

      With this report, I suggest what are in my opinion crucial additions to the otherwise very interesting and credible research manuscript ”Cluster size determines morphology of transcription factories in human cells”.

      Strengths:

      The manuscript in itself is technically sound, the chosen simulation methods are completely appropriate the figures are well-prepared, the text is mostly well-written spare a few typos. The conclusions are valid and would represent a valuable conceptual contribution to the field of clustering, 3D genome organization and gene regulation related to transcription factories, which continues to be an area of most active investigation.

      Weaknesses:

      However, I find that the connection to concrete biological data is weak. This holds especially given that the data that are needed to critically assess the applicability of the derived cross-over with factory size is, in fact, available for analysis, and the suggested experiments in the Discussion section are actually done and their results can be exploited. In my judgement, unless these additional analysis are added to a level that crucial predictions on TF demixing and transcriptional bursting upon TU clustering can be tested, the paper is more fitted for a theoretical biophysics venue than for a biology journal such as eLife.

      We thank the Reviewer for their positive assessment of the soundness of our work and its contribution to the field. We have added a paragraph to the Conclusions highlighting the current state of experimental techniques and outlining near-term experiments that could be extended to test our predictions. We also emphasise that our analysis builds on state-of-the-art polymer models of chromatin and on quantitative experimental datasets, which we used both to build the model construction and to validate its outcomes (gene activity). We hope this strengthened link to experiment will catalyse further studies in the field.

      Major points:

      (1) My first point concerns terminology.The Merriam-Webster dictionary describes morphology as the study of structure and form. In my understanding, none of the analyses carried out in this study actually address the form or spatial structuring of transcription factories. I see no aspects of shape, only size. Unless the authors want to assess actual shapes of clusters, I would recommend to instead talk about only their size/extent. The title is, by the same argument, in my opinion misleading as to the content of this study.

      We agree with the Referee that the title could be misleading. In our study we characterized clusters size, that is a morphological descriptor, and cluster composition that isn’t morphology per se but used in the community in a broader sense. Nevertheless to strength the message we have changed the title in: “Cluster size determines internal structure of transcription factories in human cells”

      (2) Another major conceptual point is the choice of how a single TF:pol particle in the model relates to actual macromolecules that undergo clustering in the cell. What about the fact that even single TF factories still contain numerous canonical transcription factors, many of which are also known to undergo phase separation? Mediator, CDK9, Pol II just to name a few. This alone already represents phase separation under the involvement of different species, which must undergo mixing. This is conceptually blurred with the concept of gene-specific transcription factors that are recruited into clusters/condensates due to sequencespecific or chromatin-epigenetic-specific affinities. Also, the fact that even in a canonical gene with a ”small” transcription factory there are numerous clustering factors takes even the smallest factories into a regime of several tens of clustering macromolecules. It is unclear to me how this reality of clustering and factory formation in the biological cell relates to the cross-over that occurs at approximately n=10 particles in the simulations presented in this paper.

      This is a good point. However in our case we can either look at clustering transcription factors or transcription units. In an experimental situation, transcription units could be “coloured”, or assigned different types, by looking at different cell types, so that they can be classified as housekeeping, or cell-type independent, or cell-type specific. This is similar to how DHS can be clustered. In this way the mixing or demixing state can be identified by looking at the type of transcription unit, removing any ambiguity due to the fact that the same protein may participate in different TF complexes..

      (3) The paper falls critically short in referencing and exploiting for analysis existing literature and published data both on 3D genome organization as well as the process of cluster formation in relation to genomic elements. In terms of relevant literature, most of the relevant body of work from the following areas has not been included:

      (i) mechanisms of how the clustering of Pol II, canonical TFs, and specific TFs is aided by sequence elements and specific chromatin states

      (ii) mechanisms of TF selectivity for specific condensates and target genomic elements

      (iii) most crucially, existing highly relevant datasets that connect 3D multi-point contacts with transcription factor identity and transcriptional activity, which would allow the authors to directly test their hypotheses by analysis of existing data

      Here, especially the data under point (iii) are essential. The SPRITE method (cited but not further exploited by the authors), even in its initial form of publication, would have offered a data set to critically test the mixing vs. demixing hypothesis put forward by the authors. Specifically, the SPRITE method offers ordered data on k-mers of associated genomic elements. These can be mapped against the main TFs that associate with these genomic elements, thereby giving an account of the mixed / demixed state of these k-mer associations. Even a simple analysis sorting these associations by the number of associated genomic elements might reveal a demixing transition with increasing association size k. However, a newer version of the SPRITE method already exists, which combines the k-mer association of genomic elements with the whole transcriptome assessment of RNAs associated with a particular DNA k-mer association. This can even directly test the hypotheses the authors put forward regarding cluster size, transcriptional activation, correlation between different transcription units’ activation etc.

      To continue, the Genome Architecture Mapping (GAM) method from Ana Pombo’s group has also yielded data sets that connect the long-range contacts between gene-regulatory elements to the TF motifs involved in these motifs, and even provides ready-made analyses that assess how mixed or demixed the TF composition at different interaction hubs is. I do not see why this work and data set is not even acknowledged? I also strongly suggest to analyze, or if they are already sufficiently analyzed, discuss these data in the light of 3D interaction hub size (number of interacting elements) and TF motif composition of the involved genomic elements.

      Further, a preprint from the Alistair Boettiger and Kevin Wang labs from May 2024 also provides direct, single-cell imaging data of all super-enhancers, combined with transcription detection, assessing even directly the role of number of super-enhancers in spatial proximity as a determinant of transcriptional state. This data set and findings should be discussed, not in vague terms but in detailed terms of what parts of the authors’ predictions match or do not match these data.

      For these data sets, an analysis in terms of the authors’ key predictions must be carried out (unless the underlying papers already provide such final analysis results). In answering this comment, what matters to me is not that the authors follow my suggestions to the letter. Rather, I would want to see that the wealth of available biological data and knowledge that connects to their predictions is used to their full potential in terms of rejecting, confirming, refining, or putting into real biological context the model predictions made in this study.

      References for point (iii):

      - RNA promotes the formation of spatial compartments in the nucleus https://www.cell.com/cell/fulltext/S0092-8674(21)01230-7?dgcid=raven_jbs_etoc_email

      - Complex multi-enhancer contacts captured by genome architecture mapping https://www.nature.com/articles/nature21411

      - Cell-type specialization is encoded by specific chromatin topologies https://www.nature.com/articles/s41586-021-04081-2

      - Super-enhancer interactomes from single cells link clustering and transcription https://www.biorxiv.org/content/10.1101/2024.05.08.593251v1.full

      For point (i) and point (ii), the authors should go through the relevant literature on Pol II and TF clustering, how this connects to genomic features that support the cluster formation, and also the recent literature on TF specificity. On the last point, TF specificity, especially the groups of Ben Sabari and Mustafa Mirx have presented astonishing results, that seem highly relevant to the Discussion of this manuscript.

      We appreciate the Reviewer’s insightful suggestion that a comparison between our simulation results and experimental data would strengthen the robustness of our model. In response, we have thoroughly revised the literature on multi-way chromatin contacts, with particular attention to SPRITE and GAM techniques. However, we found that the currently available experimental datasets lack sufficient statistical power to provide a definitive test of our simulation predictions, as detailed below.

      As noted by the Reviewer, SPRITE experiments offer valuable information on the composition of highorder chromatin clusters (k-mers) that involve multiple genomic loci. A closer examination of the SPRITE data (e.g., Supplementary Material from Ref. [1]) reveals that the majority of reported statistics correspond to 3-mers (three-way contacts), while data on larger clusters (e.g., 8-mers, 9-mers, or greater) are sparse. This limitation hinders our ability to test the demixing-mixing transition predicted in our simulations, which occurs for cluster sizes exceeding 10.

      Moreover, the composition of the k-mers identified by SPRITE predominantly involves genomic regions encoding functional RNAs—such as ITS1 and ITS2 (involved in rRNA synthesis) and U3 (encoding small nucleolar RNA)—which largely correspond to housekeeping genes. Conversely, there is little to no data available for protein-coding genes. This restricts direct comparison to our simulations, where the demixing-mixing transition depends critically on the interplay between housekeeping and tissue-specific genes.

      Similarly, while GAM experiments are capable of detecting multi-way chromatin contacts, the currently available datasets primarily report three-way interactions [2,3].

      In summary, due to the limited statistical data on higher-order chromatin clusters [4], a quantitative comparison between our simulation results and experimental observations is not currently feasible. Nevertheless, we have now briefly discussed the experimental techniques for detecting multi-way interactions in the revised manuscript to reflect the current state of the field, mentioning most of the references that the Reviewer suggested.

      (4) Another conceptual point that is a critical omission is the clarification that there are, in fact, known large vs. small transcription factories, or transcriptional clusters, which are specific to stem cells and ”stressed cells”. This distinction was initially established by Ibrahim Cisse’s lab (Science 2018) in mouse Embryonic Stem Cells, and also is seen in two other cases in differentiated cells in response to serum stimulus and in early embryonic development:

      - Mediator and RNA polymerase II clusters associate in transcription-dependent condensates https://www.science.org/doi/10.1126/science.aar4199

      - Nuclear actin regulates inducible transcription by enhancing RNA polymerase II clustering https://www.science.org/doi/10.1126/sciadv.aay6515

      - RNA polymerase II clusters form in line with surface condensation on regulatory chromatin https://www.embopress.org/doi/full/10.15252/msb.202110272

      - If ”morphology” should indeed be discussed, the last paper is a good starting point, especially in combination with this additional paper: Chromatin expansion microscopy reveals nanoscale organization of transcription and chromatin https://www.science.org/doi/10.1126/science.ade5308

      We thank the Reviewer for pointing out the discussion about small and large clusters observed in stressed cells. Our study aims to provide a broader mechanistic explanation on the formation of TF mixed and demixed clusters depending on their size. However, to avoid to generate confusion between our terminology and the classification that is already used for transcription factories in stem and stressed cells, we have now added some comments and references in the revised text.

      (5) The statement scripts are available upon request is insufficient by current FAIR standards and seems to be non-compliant with eLife requirements. At a minimum, all, and I mean all, scripts that are needed to produce the simulation outcomes and figures in the paper, must be deposited as a publicly accessible Supplement with the article. Better would be if they would be structured and sufficiently documented and then deposited in external repositories that are appropriate for the sharing of such program code and models.

      We fully agree with the Reviewer. We have now included in the main text a link to an external repository containing all the codes required to reproduce and analyze the simulations.

      Recommendations for the authors:

      Minor and technical points

      (6) Red, green, and yellow (mix of green and red) is a particularly bad choice of color code, seeing that red-green blindness is the most common color blindness. I recommend to change the color code.

      We appreciate the Reviewer’s thoughtful comment regarding color accessibility. We fully agree that red–green combinations can pose challenges for color-blind readers. In our figures, however, we chose the red–green–yellow color scheme deliberately because it provides strong contrast and intuitive representation for different TF/TU types. To ensure accessibility, we optimized brightness and saturation within red-green schemes and we carefully verified that the chosen hues are distinguishable under the most common forms of color vision deficiency, i.e. trichromatic color blindness, using color-blindness simulation tools (e.g., Coblis).

      How is the dispersing effect of transcriptional activation and ongoing transcription accounted for or expected to affect the model outcome? This affects both transcriptional clusters (they tend to disintegrate upon transcriptional activation) as well as the large scale organization, where dispersal by transcription is also known.

      We thank the Reviewer for this very insightful question. The current versions of both our toy model and the more complex HiP-HoP model do not incorporate the effects of RNA Polymerase elongation. Our primary goal was to develop a minimalisitc framework that focuses on investigating TF clusters formation and their composition. Nevertheless, we find that this straightforward approach provides a good agreement between simulations and Hi-C and GRO-seq experiments, lending confidence to the reliability of our results concerning TF cluster composition.

      We fully agree, however, that the effects of transcription elongation are an interesting topic for further exploration. For example, modeling RNA Polymerases as active motors that continually drive the system out of equilibrium could influence the chromatin polymer conformation and the structure of TF clusters. Additionally, investigating how interactions between RNA molecules and nuclear proteins, such as SAF-A, might lead to significant changes in 3D chromatin organization and, consequently, transcription [5], is also in intriguing prospect. Although we do not believe that the main findings of our study, particularly regarding cluster composition and mixed-demixed transition, would be impacted by transcription elongation effects, we recognize the importance of this aspect. As such, we have now included some comments in the Conclusions section of the revised manuscript.

      “and make the reasonable assumption that a TU bead is transcribed if it lies within 2.25 diameters (2.25σ) of a complex of the same colour; then, the transcriptional activity of each TU is given by the fraction of time that the TU and a TF:pol lie close together.” How is that justified? I do not see how this is reasonable or not, if you make that statement you must back it up.

      As pointed out by the Referee, we consider a TU to be active if at least one TF is within a distance 2.25σ from that TU. This threshold is a slightly larger than the TU-TF interaction cutoff distance, r<sub>c</sub> \= 1.8σ between TFs and TUs. The rationale for this choice is to ensure that, in the presence of a TU cluster surrounded by TFs, TUs that are not directly in contact with a TF are still considered active. Nonetheless, we find that using slightly different thresholds, such as 1.8σ or 1.1σ, leads to comparable results, as shown in Fig. S11, demonstrating the robustness of our analysis.

      Clearly, close proximity in 1D genomic space favours formation of similarly-coloured clusters. This is not surprising, it is what you built the model to do. Should not be presented as a new insight, but rather as a check that the model does what is expected.

      We believed that this sentence already conveyed that the formation of single-color clusters driven by 1D genomic proximity is not a surprising outcome. However, we have now slightly rephrased it to better emphasize that this is not a novel insight.

      That said, we would like to highlight that while 1D genomic proximity facilitates the formation of clusters of the same color, the unmixed-to-mixed transition in cluster composition is not easily predictable solely from the TU color pattern. Furthermore, in simulations of real chromosomes, where TU patterns are dictated by epigenetic marks, the complexity of these patterns makes it challenging—if not impossible—to predict cluster composition based solely on the input data of our model.

      “…how closely transcriptional activities of different TUs correlate…” Please briefly state over what variable the correlation is carried out, is it cross correlation of transcription activity time courses over time? Would be nice to state here directly in the main text to make it easier for the reader.

      We have now included a brief description in the revised manuscript explaining how the transcriptional correlations were evaluated and how the correlation matrix was constructed.

      “The second concerns how expression quantitative trait loci (eQTLs) work. Current models see them doing so post-transcriptionally in highly-convoluted ways [11, 55], but we have argued that any TU can act as an eQTL directly at the transcriptional level [11].” This text does not actually explain what eQTLs do. I think it should, in concise words.

      We agree with the Referee’s suggestion. We have revised the sentence accordingly and now provide a clear explanation of eQTLs upon their first mention. The revised paragraph now reads as follows:

      “The second concerns how expression quantitative trait loci (eQTLs)—genomic regions that are statistically associated with variation in gene expression levels—function. While current models often attribute their effects to post-transcriptional regulation through complex mechanisms [6,7], we have previously argued that any transcriptional unit (TU) can act as an eQTL by directly influencing gene expression at the transcriptional level [7]. Here, we observe individual TUs up-regulating or down-regulating the activity of others TUs – hallmark behaviors of eQTLs that can give rise to genetic effects such as “transgressive segregation” [8]. This phenomenon refers to cases in which alleles exhibit significantly higher or lower expression of a target gene, and can be, for instance, caused by the creation of a non-parental allele with a specific combination of QTLs with opposing effects on the target gene.”

      “In the string with 4 mutations, a yellow cluster is never seen; instead, different red clusters appear and disappear (Fig. 2Eii)…” How should it be seen? You mutated away most of the yellow beads. I think the kymograph is more informative about the general model dynamics, not the effects of mutations. Might be more appropriate to place a kymograph in Figure 1.

      We agree with the Referee that the kymograph is the most appropriate graphical representation for capturing the effects of mutations. Panel 2E already refers to the standard case shown in Figure 1. We have now clarified this both in the caption and in the main text. In addition, we have rephrased the sentence—which was indeed misleading—as follows:

      “From the activity profiles in Fig. 2C, we can observe that as the number of mutations increases, the yellow cluster is replaced by a red cluster, with the remaining yellow TUs in the region being expelled (Fig. 2B(ii)). This behavior is reflected in the dynamics, as seen by comparing panels E(i) and E(ii): in the string with four mutations, transcription of the yellow TUs is inhibited in the affected region, while prominent red stripes—corresponding to active, transcribing clusters—emerge (Fig. 2E(ii)).” We hope that the comparison is now immediately clear to the reader.

      “…but this block fragments in the string with 4 mutations…” I don’t know or cannot see what is meant by ”fragmentation” in the correlation matrix.

      With the sentence “this block fragments in the string with 4 mutations” we mean that the majority of the solid red pixels within the black box become light-red or white once the mutations are applied. We have now added a clarification of this point in the revised manuscript.

      “Fig. 3D shows the difference in correlation between the case with reduced yellow TFs and the case displayed in Fig. 1E.” Can you just place two halves of the different matrices to be compared into the same panel? Similar to Fig. S5. Will be much easier to compare.

      We thank the Referee for this suggestion. We tried to implement this modification, and report the modified figure below (Author response image 1). As we can see, in the new figure it is difficult to spot the details we refer to in the main text, therefore we prefer to keep the original version of the figure.

      Author response image 1.

      Heatmap comparing activity correlations of TUs in the random string under normal conditions (top half) and with reduced yellow-TF concentration (bottom half).

      What is the omnigenic model? It is not introduced.

      We thank the Reviewer for highlighting this important point. The omnigenic model, first introduced by Boyle et al in Ref. [6], was proposed to explain how complex traits, including disease risk, are influenced by a vast number of genes. Accordingly to this model, the genetic basis of a trait is not limited to a small set of core genes whose expression is directly related to the trait, but also includes peripheral genes. The latter, although not directly involved in controlling the trait, can influence the expression of core genes through gene regulatory networks, thereby contributing to the overall genetic influence on the trait. We have now added a few lines in the revised manuscript to explain this point.

      “Additionally, blue off-diagonal blocks indicate repeating negative correlations that reflect the period of the 6-pattern.” How does that look in a kymograph? Does this mean the 6 clusters of same color steal the TFs from the other clusters when they form?

      The intuition of the Referee is indeed correct. The finite number of TFs leads to competition among TUs of the same colour, resulting in anticorrelation:when a group of six nearby TUs of a given colour is active, other, more distant TUs of the same colour are not transcribing due to the lack of available TFs. As the Referee suggested,this phenomenon is visible in the kymograph showing TU activity. In Author response image 2, it can be observed that typically there is a single TU cluster for each of the three colours (yellow, green, and red). These clusters can be long-lived (e.g., the yellow cluster at the center of the kymograph) or may destroy during the simulation (e.g., the red cluster at the top of the kymograph, which dissolves at t ∼ 600 × 10<sup>5</sup> τ<sub>B</sub>). In the latter case, TFs of the corresponding colour are released into the system and can bind to a different location, forming a new cluster (as seen with the red cluster forming at the bottom of the kymograph for t > 600 × 10<sup>5</sup> τ<sub>B</sub>). This point is further discussed at the point 2.30 of this Reply where additional graphical material is provided.

      Author response image 2.

      Kymograph showing the TU activity during a typical run in the 6-pattern case. Each row reports the transcriptional state of a TU during one simulation. Black pixels correspond to inactive TUs, red (yellow, green) pixels correspond to active red (yellow, green) TUs.

      “Conversely, negative correlations connect distant TUs, as found in the single-color model…” But at the most distal range, the negative correlation is lost again! Why leave this out? Your correlation curves show the same , equilibration towards no correlation at very long ranges.

      As highlighted in Figure 5Ai, long-range negative correlations (grey segments) predominantly connect distant TUs of the same colour. This is quantified in Figure 5Bi: restricting to same-colour TUs shows that at large genomic separations the correlation is almost entirely negative, with small fluctuations at distances just below 3000 kbp where sampling is sparse; we therefore avoid further interpretation of this regime.

      “These results illustrate how the sequence of TUs on a string can strikingly affect formation of mixed clusters; they also provide an explanation of why activities of human TUs within genomic regions of hundreds of kbp are positively correlated [60].” This is a very nice insight.

      We thank the Reviewer for the very supportive comment.

      “To quantify the extent to which TFs of different colours share clusters, we introduce a demixing coefficient, θ<sub>dem</sub> (defined in Fig. 1).” This is not defined in Fig. 1 or anywhere else here in the main text.

      We thank the Referee for pointing this out. For a given cluster, the demixing coefficient is defined as

      where n is the number of colors, i indexes each color present in the model, and x<sub>i,max</sub> the largest fraction of TFs of the same i-th color in a single TF cluster.

      The demixing coefficient is defined in the Methods section; therefore, we have replaced defined in Fig. 1 with see Methods for definition.

      “Mixing is facilitated by the presence of weakly-binding beads, as replacing them with non-interacting ones increases demixing and reduces long-range negative correlations (Figure S3). Therefore, the sequence of strong and weak binding sites along strings determines the degree of mixing, and the types of small-world network that emerge. If eQTLs also act transcriptionally in the way we suggest [11], we predict that down-regulating eQTLs will lie further away from their targets than up-regulating ones.” Going into these side topics and minke points here is super distracting and waters down the message. Maybe first deal with the main conclusions on mixed vs demixed clusters in dependence on the strong and specific binding site patterns, before dealing with other additional points like the role of weak binding sites.

      Thank you for the suggestion. We now changed the paragraph to highlight the main results. The new paragraph is as follows. “These results on activity correlation and TF cluster composition suggest that, if eQTLs act transcriptionally as expected [7], down-regulating eQTLs are likely to be located further from their target genes than up-regulating ones. In addition, it is important to note that mixing is promoted by the presence of weakly binding beads; replacing these with non-interacting ones leads to increased demixing and a reduction in long-range negative correlations (Figure S3). More generally, our findings indicate that the presence of multiple TF colors offers an effective mechanism to enrich and fine-tune transcriptional regulation.”

      “…provides a powerful pathway to enrich and modulate transcriptional regulation.” Before going into the possible meaning and implications of the results, please discuss the results themselves first.

      See previous point.

      Figure 5B. Does activation typically coincide with spatial compaction of the binding sites into a small space or within the confines of a condensate? My guess would be that colocalization of the other color in a small space is what leads to the mixing effect?

      As the Reviewer correctly noted, the activity of a given TU is indeed influenced by the presence of nearby TUs of the same color, since their proximity facilitates the recruitment of additional TFs and enhances the overall transcriptional activity. In this context, the mixing effect is certainly affected by the 1D arrangement of TUs along the chromatin fiber. As emphasized in the revised manuscript, when domains of same-color TUs are present (as in the 6-pattern string), the degree of demixing is greater compared to the case where TUs of different colors alternate and large domains are absent (as in the 1-pattern string). This difference in the demixing parameter as a function of the 1D TU arrangement is clearly visible in Fig. S2B.

      “…euchromatic regions blue, and heterochromatic ones grey.” Please also explain what these color monomers mean in terms of non specific interactions with the TFs.

      Generally, in our simulation approach we assume euchromatin regions to be more open and accessible to transcription factors, whereas heterochromatin corresponds to more compacted chromatin segments [9]. To reflect this, we introduce weak, non-specific interactions between euchromatin and TFs, while heterochromatin interacts with TFs only thorugh steric effects. To clarify this point, we have now slightly revised the caption of Fig.6.

      “More quantitatively, Spearman’s rank correlation coefficient is 3.66 10<sup>−1</sup>, which compares with 3.24 10<sup>−1</sup> obtained previously using a single-colour model [11].” This comparison does not tell me whether the improvement in model performance justifies an additional model component. There are other, likelihood based approaches to assess whether a model fits better in a relevant extent by adding a free model parameter. Can these be used for a more conclusive comparison? Besides, a correlation of 0.36 does not seem so good?

      We understand the Reviewer’s concern that the observed increase in the activity correlation may not appear to provide strong evidence for the improvement of the newly introduced model. However, within the context of polymer models developed to study realistic gene transcription and chromatin organization, this type of correlation analysis is a widely accepted approach for model validation. Experimental data commonly used for such validation include Hi-C maps, FISH experiments, and GRO-seq data [10,11]. The first two are typically employed to assess how accurately the model reproduces the 3D folding of chromatin; a comparison between experimental and simulated Hi-C maps is provided in the Supplementary Information (Fig. S5), showing a Pearson correlation of 0.7. GRO-seq or RNA-seq data, on the other hand, are used to evaluate the model’s ability to predict gene transcription levels. To date, the highest correlation for transcriptional activity data has been achieved by the HiP-HoP model at a resolution of 1 kbp [10], reporting a Spearman correlation of 0.6. Therefore, the correlation obtained with our 2-color model represents a good level of agreement when compared with the more complex HiP-HoP model. In this context, the observed increase in correlation—from 0.324 to 0.366—can be regarded as a modest yet meaningful improvement.

      “…consequently, use of an additional color provides a statisticallysignificant improvement (p-value < 10<sup>−6</sup>, 2-sided t-test).” I do not follow this argument. Given enough simulation repeats, any improvement, no matter how small, will lead to statistically significant improvements.

      We agree that this sentence could be misleading. We have now rephrased it in a clearer manner specifying that each of the two correlation values is statistically significant alone, while before we were wrongly referring to the significance of the improvement.

      “Additionally, simulated contact maps show a fair agreement with Hi-C data (Figure S5), with a Pearson correlation r ∼ 0.7 (p-value < 10<sup>−6</sup>, 2-sided t-test).” Nice!

      We thank the Reviewer for the positive comment.

      “Because we do not include heterochromatin-binding proteins, we should not however expect a very accurate reproduction of Hi-C maps: we stress that here instead we are interested in active chromatin, transcription and structure only as far as it is linked to transcription.” Then why do you not limit your correlation assessment to only these regions to show that these are very well captured by your model?

      We thank the Reviewer for this insightful comment. Indeed, we could have restricted our investigation to active chromatin regions, as done in our previous works [11,12]. However, our intention in this section of the manuscript was to clarify that the current model is relatively simple and therefore not expected to achieve a very high level of agreement between experimental and simulated Hi-C maps. Another important limitation of the two color model described in the section is the absence of active loop extrusion mediated by SMC proteins, which is known to play a central role in establishing TADs boundaries. Consequently, even if our analysis were limited to active chromatin regions, the agreement with experimental Hi-C maps would still remain lower than that obtained with more comprehensive models, such as HiP-HoP, that we use later in the last section of the paper. We have now added a comment in the revised manuscript explicitly noting the lack of active loop extrusion in our 2-color model.

      “We also measure the average value of the demixing coefficient, θ<sub>dem</sub> (Materials and Methods). If θ<sub>dem</sub> = 1, this means that a cluster contains only TFs of one colour and so is fully demixed; if θ<sub>dem</sub> = 0, the cluster contains a mixture of TFs of all colors in equal number, and so is maximally mixed.” Repetitive.

      We have now rephrased the sentence in a more concise way.

      “…notably, this is similar to the average number of productivelytranscribing pols seen experimentally in a transcription factory [6].” That seems a bit fast and loose. The number of Polymerases can differ depending on state, type of factory, gene etc. and vary between anything from to a few hundreds of Polymerase complexes depending on definition of factory, and what is counted as active. Also, one would think that polymerases only make up a small part of the overall protein pool that constitutes a condensate, so it is unclear whether this is a pertinent estimate.

      Here we refer to the average size of what is normally referred to as a PolII factory, not a generic nuclear condensate. These are the clusters which arise in our simulations. These structures emerge through microphase separation and have been well characterised, for instance see [13] for a recent review. For these structures while there is a distribution the average is well defined and corresponds to a size of about 100 nm, which is very much in line with the size of the clusters we observe, both in terms of 3D diameter and number of participating proteins. Because of the size, the number of active complexes which can contribute cannot be significantly more than ∼ 10. These estimates are, we note, very much in line with super-resolution measurements of SAF-A clusters [14], which are associated with active transcription and hence it is reasonable to assume they colocalise with RNA and polymerase clusters.

      “Conversely, activities of similar TUs lying far from each other on the genetic map are often weakly negatively correlated, as the formation of one cluster sequesters some TFs to reduce the number available to bind elsewhere.” This point is interesting, and I strongly suspect that this indeed happening. But I don’t think it was shown in the analysis of the simulation results in sufficient clarity. We need direct assessment of this sequestration, currently it’s only indirectly inferred.

      Indeed, this is the mechanism underlying the emergence of negative long-range correlations among TU activity values. As the Reviewer correctly pointed out, the competition for a finite number of TFs was only indirectly inferred in the original manuscript. To address this, we have now included a new figure explicitly illustrating this effect. In Fig. S12, we show the kymograph of active TUs (left panel), as in Fig. 2E(i) of the main text, alongside a new kymograph depicting the number of green TFs within a sphere of radius 10σ centered on each green TU (right panel). For simplicity, we focus here only on green TUs and TFs. It can be observed that, during the initial part of the simulation, green TFs are localized near genomic position ∼ 2000(right panel), where green TUs are transcriptionally active (left panel). Toward the end of the simulation, TUs near genomic position ∼ 500 become active, coinciding with the relocation of TFs to this region and the depletion of the previous one.

      In the definition for the demixing coefficient (equation 1), what does the index i stand for?

      Here i is an index denoting each of the colors present in the model. We have now specified the meaning of i after Eq. 1.

      Reviewer 3 (Public Review):

      In this work, the authors present a chromatin polymer model with some specific pattern of transcription units (TUs) and diffusing TFs; they simulate the model and study TFclustering, mixing, gene expression activity, and their correlations. First, the authors designed a toy polymer with colored beads of a random type, placed periodically (every 30 beads, or 90kb). These colored beads are considered a transcription unit (TU). Same-colored TUs attract with each other mediated by similarly colored diffusing beads considered as TFs. This led to clustering (condensation of beads) and correlated (or anti-correlation) ”gene expression” patterns. Beyond the toy model, when authors introduce TUs in a specific pattern, it leads to emergence of specialized and mixed cluster of different TFs. Human chromatin models with realistic distribution of TUs also lead to the mixing of TFs when cluster size is large.

      Strengths.

      This is a valuable polymer model for chromatin with a specific pattern of TUs and diffusing TF-like beads. Simulation of the model tests many interesting ideas. The simulation study is convincing and the results provide solid evidence showing the emergence of mixed and demixed TF clusters within the assumptions of the model.

      Weaknesses.

      Weakness of the work: The model has many assumptions. Some of the assumptions are a bit too simplistic. Concerns about the work are detailed below:

      We thank the Referee for this overall positive evaluation.

      We thank the Referee for this important observation. The way we The authors assume that when the diffusing beads (TFs) are near a TU, the gene expression starts. However, mammalian gene expression requires activation by enhancer-promoter looping and other related events. It is not a simple diffusion-limited event. Since many of the conclusions are derived from expression activity, will the results be affected by the lack of looping details?

      We do not need to assume promoter-enhancer contact, this emerges naturally through the bridging-induced phase separation and indeed is a key strength of our model. Even though looping is not assumed as key to transcriptional initiation, in practice the vast majority of events in which a TF is near a TU are associated with the presence of a cluster where regulatory elements are looped. So transcription in our case is associated with the bridging-induced phase separation, and there is no lack of looping, looping is naturally associated with transcription, and this is an emergent property of the model (not an assumption), which is an important feature of our model. Accordingly, both contact maps and transcriptional activity are well predicted by our model, both in the version described here and in the more sophisticated single-colour HiP-HoP model [10] (an important ingredient of which is the bridging-induced phase separation).

      Authors neglect protein-protein interactions. Without proteinprotein interactions, condensate formation in natural systems is unlikely to happen.

      We thank the Reviewer for pointing out the absence of protein-protein interactions in our simulations. While we acknowledge this limitation, we would like to emphasize that experimental studies have not observed nuclear proteins forming condensates at physiological concentrations in the absence of DNA or chromatin. For example, studies such as Ryu et al. [15] and Shakya et al. [16] show that protein-protein interactions alone are insufficient to drive condensate formation in vivo. Instead, the presence of a substrate, such as DNA or chromatin, is essential to favor and stabilize the formation of protein clusters.

      In our simulations, we propose that protein liquid-liquid phase separation (LLPS) is driven by the presence of both strong and weak attractions between multivalent protein complexes and the chromatin filament. As stated in our manuscript, the mechanism leading to protein cluster formation is the bridging induced attraction. This mechanism involves a positive feedback loop, where protein binding to chromatin induces a local increase in chromatin density, which then attracts more proteins, further promoting cluster formation.

      While we acknowledge that adding protein-protein interactions could be incorporated into our simulations, we believe this would need to be a weak interaction to remain consistent with experimental data. Additionally, incorporating such interactions would not alter the conclusions of our study.

      What is described in this paper is a generic phenomenon; many kinds of multivalent chromatin-binding proteins can form condensates/clusters as described here. For example, if we replace different color TUs with different histone modifications and different TFs with Hp1, PRC1/2, etc, the results would remain the same, wouldn’t they? What is specific about transcription factor or transcription here in this model? What is the logic of considering 3kb chromatin as having a size of 30 nm? See Kadam et al. (Nature Communications 2023). Also, DNA paint experimental measurement of 5kb chromatin is greater than 100 nm (see work by Boettiger et al.).

      We thank the Reviewer for this important observation, which we now address. To begin, we consider the toy model introduced in the first part of the manuscript, where TUs are randomly positioned rather than derived from epigenetic data. As the Reviewer points out, in this simplified context, our results reflect a generic phenomenon: the composition of clusters depends primarily on their size, independent of the specific types of proteins involved. However, the main goal of our work is to gain insights into apparently contradictory experimental findings, which show that some transcription factories consist of a single type of transcription factors, while other contain multiple types. This led us to focus on TF clusters and their role in transcriptional regulation and co-regulation of distant genes. Therefore, in the second part of the manuscript, we use DNase I hypersensitive site (DHS) data to position TUs based on predicted TF binding sites, providing a more biological framework. In both the toy model and the more realistic HiP-HoP model, we observe a size-dependent transition in cluster composition. However, we refrain from generalizing these results to clusters composed of other protein complexes, such as HP1 and PRC, as their binding is governed by distinct epigenetic marks (e.g. H3K927me3 and H3K27me3), which exhibit different genomic distributions compared to DHS marks.

      Finally, the mapping of 3kb to 30nm is an estimate which does not significantly impact our conclusions. The relationship between genomic distance (in kbp) and spatial distance (in nm) is highly dependent on the degree of chromatin compaction, which can vary across cell types and genomic context. As such, providing an exact conversion is challenging [17]. For example, in a previous work based on the HiP-HoP model [12] we compared simulated and experimental FISH measurements and found that 1kbp typically corresponds to 15 − 20nm, implying that 3kbp could span 60nm. Nevertheless, we emphasize that varying this conversion factor does not affect the core results or conclusions of our study. We have now included a clarification in the revised SI to highlight this point.

      Recommendations for the authors:

      Other points.

      Figure 1(D) caption says 2.25σ = 1.6 nanometer. Is this a typo? Sigma is 30nm.

      Yes, it was. As 1σ ∼ 30nm, we have 2.25σ = 2.25 · 30 nm = 67.2 nm ∼ 6.7 × 10<sup>−8</sup>m. We have now corrected the caption.

      Page 6, column 2nd, 3rd para, it is written that θ<sub>dem</sub> (”defined in Fig.1”). There is no θ<sub>dem</sub> defined in Fig.1, is there? I can see it defined in Methods but not in Fig. 1.

      Correct, we replaced (defined in Fig.1) with (see Methods for definition).

      Page 6, column 2, 4th para: what does “correlations overlap and correlations diverge mean”?

      With reference to the plots from Fig. 5B, correlation overlap and diverge simply refers to the fact that same-colour (red curves) and different-colour (blue curves) correlation trends may or may not overlap on each other. We have now clarified this point.

      What is the precise definition of correlation in Fig 5B (Y-axis)?

      In Fig.5B, correlation means Pearson correlation. We have now specified this point in the revised text and in the caption of Fig.5.

      References

      (1) S. A. Quinodoz, J. W. Jachowicz, P. Bhat, N. Ollikainen, A. K. Banerjee, I. N. Goronzy, M. R. Blanco, P. Chovanec, A. Chow, Y. Markaki et al., “Rna promotes the formation of spatial compartments in the nucleus,” Cell, vol. 184, no. 23, pp. 5775–5790, 2021.

      (2) R. A. Beagrie, A. Scialdone, M. Schueler, D. C. Kraemer, M. Chotalia, S. Q. Xie, M. Barbieri, I. de Santiago, L.-M. Lavitas, M. R. Branco et al., “Complex multi-enhancer contacts captured by genome architecture mapping,” Nature, vol. 543, no. 7646, pp. 519–524, 2017.

      (3) R. A. Beagrie, C. J. Thieme, C. Annunziatella, C. Baugher, Y. Zhang, M. Schueler, A. Kukalev, R. Kempfer, A. M. Chiariello, S. Bianco et al., “Multiplex-gam: genome-wide identification of chromatin contacts yields insights overlooked by hi-c,” Nature Methods, vol. 20, no. 7, pp. 1037–1047, 2023.

      (4) L. Liu, B. Zhang, and C. Hyeon, “Extracting multi-way chromatin contacts from hi-c data,” PLOS Computational Biology, vol. 17, no. 12, p. e1009669, 2021.

      (5) R.-S. Nozawa, L. Boteva, D. C. Soares, C. Naughton, A. R. Dun, A. Buckle, B. Ramsahoye, P. C. Bruton, R. S. Saleeb, M. Arnedo et al., “Saf-a regulates interphase chromosome structure through oligomerization with chromatin-associated rnas,” Cell, vol. 169, no. 7, pp. 1214–1227, 2017.

      (6) E. A. Boyle, Y. I. Li, and J. K. Pritchard, “An expanded view of complex traits: from polygenic to omnigenic,” Cell, vol. 169, no. 7, pp. 1177–1186, 2017.

      (7) C. Brackley, N. Gilbert, D. Michieletto, A. Papantonis, M. Pereira, P. Cook, and D. Marenduzzo, “Complex small-world regulatory networks emerge from the 3d organisation of the human genome,” Nat. Commun., vol. 12, no. 1, pp. 1–14, 2021.

      (8) R. B. Brem and L. Kruglyak, “The landscape of genetic complexity across 5,700 gene expression traits in yeast,” Proceedings of the National Academy of Sciences, vol. 102, no. 5, pp. 1572– 1577, 2005.

      (9) M. Chiang, C. A. Brackley, D. Marenduzzo, and N. Gilbert, “Predicting genome organisation and function with mechanistic modelling,” Trends in Genetics, vol. 38, no. 4, pp. 364–378, 2022.

      (10) M. Chiang, C. A. Brackley, C. Naughton, R.-S. Nozawa, C. Battaglia, D. Marenduzzo, and N. Gilbert, “Genome-wide chromosome architecture prediction reveals biophysical principles underlying gene structure,” Cell Genomics, vol. 4, no. 12, 2024.

      (11) A. Buckle, C. A. Brackley, S. Boyle, D. Marenduzzo, and N. Gilbert, “Polymer simulations of heteromorphic chromatin predict the 3d folding of complex genomic loci,” Mol. Cell, vol. 72, no. 4, pp. 786–797, 2018.

      (12) G. Forte, A. Buckle, S. Boyle, D. Marenduzzo, N. Gilbert, and C. A. Brackley, “Transcription modulates chromatin dynamics and locus configuration sampling,” Nature Structural & Molecular Biology, vol. 30, no. 9, pp. 1275–1285, 2023.

      (13) P. R. Cook and D. Marenduzzo, “Transcription-driven genome organization: a model for chromosome structure and the regulation of gene expression tested through simulations,” Nucleic acids research, vol. 46, no. 19, pp. 9895–9906, 2018.

      (14) M. Marenda, D. Michieletto, R. Czapiewski, J. Stocks, S. M. Winterbourne, J. Miles, O. C. Flemming, E. Lazarova, M. Chiang, S. Aitken et al., “Nuclear rna forms an interconnected network of transcription-dependent and tunable microgels,” BioRxiv, pp. 2024–06, 2024.

      (15) J.-K. Ryu, C. Bouchoux, H. W. Liu, E. Kim, M. Minamino, R. de Groot, A. J. Katan, A. Bonato, D. Marenduzzo, D. Michieletto et al., “Bridging-induced phase separation induced by cohesin smc protein complexes,” Science advances, vol. 7, no. 7, p. eabe5905, 2021.

      (16) A. Shakya, S. Park, N. Rana, and J. T. King, “Liquid-liquid phase separation of histone proteins in cells: role in chromatin organization,” Biophysical journal, vol. 118, no. 3, pp. 753–764, 2020.

      (17) A.-M. Florescu, P. Therizols, and A. Rosa, “Large scale chromosome folding is stable against local changes in chromatin structure,” PLoS computational biology, vol. 12, no. 6, p. e1004987, 2016.

    1. Reviewer #2 (Public review):

      Summary:

      The authors' work focuses on studying cell morphological changes during differentiation of hPSCs into neural progenitors in a 2D monolayer setting. The authors use genetic mutations in VANGL2 and patient-derived iPSCs to show that (1) human phenotypes can be captured in the 2D differentiation assay, and (2) VANGL2 in humans is required for neural contraction, which is consistent with previous studies in animal models. The results are solid and convincing, the data are quantitative, and the manuscript is well written. The 2D model they present successfully addresses the questions posed in the manuscript. However, the broad impact of the model may be limited, as it does not contain NNE cells and does not exhibit tissue folding or tube closure, as seen in neural tube formation. Patient-derived lines are derived from amniotic fluid cells, and the experiments are performed before birth, which I find to be a remarkable achievement, showing the future of precision medicine.

      Major comments:

      (1) Figure 1. The authors use F-actin to segment cell areas. Perhaps this could be done more accurately with ZO-1, as F-actin cables can cross the surface of a single cell. In any case, the authors need to show a measure of segmentation precision: segmented image vs. raw image plus a nuclear marker (DAPI, H2B-GFP), so we can check that the number of segmented cells matches the number of nuclei.

      (2) Lines 156-166. The authors claim that changes in gene expression precede morphological changes. I am not convinced this is supported by their data. Fig. 1g (epithelial thickness) and Fig. 1k (PAX6 expression) seem to have similar dynamics. The authors can perform a cross-correlation between the two plots to see which Δt gives maximum correlation. If Δt < 0, then it would suggest that gene expression precedes morphology, as they claim. Fig. 1j shows that NANOG drops before the morphological changes, but loss of NANOG is not specific to neural differentiation and therefore should not be related to the observed morphological changes.

      (3) Figure 2d. The laser ablation experiment in the presence of ROCK inhibitor is clear, as I can easily see the cell outlines before and after the experiment. In the absence of ROCK inhibitor, the cell edges are blurry, and I am not convinced the outline that the authors drew is really the cell boundary. Perhaps the authors can try to ablate a larger cell patch so that the change in area is more defined.

      (4) Figure 2d. Do the cells become thicker after recoil?

      (5) Figure 3. The authors mention their previous study in which they show that Vangl2 is not cell-autonomously required for neural closure. It will be interesting to study whether this also the case in the present human model by using mosaic cultures.

      (6) Lines 403-415. The authors report poor neural induction and neuronal differentiation in GOSB2. As far as I understand, this phenotype does not represent the in vivo situation. Thus, it is not clear to what extent the in vitro 2D model describes the human patient.

      (7) The experimental feat to derive cell lines from amniotic fluid and to perform experiments before birth is, in my view, heroic. However, I do not feel I learned much from the in vitro assays. There are many genetic changes that may cause the in vivo phenotype in the patient. The authors focus on MED24, but there is not enough convincing evidence that this is the key gene. I would like to suggest overexpression of MED24 as a rescue experiment, but I am not sure this is a single-gene phenotype. In addition, the fact that one patient line does not differentiate properly leads me to think that the patient lines do not strengthen the manuscript, and that perhaps additional clean mutations might contribute more.

      Significance:

      This study establishes a quantitative, reproducible 2D human iPSC-to-neural-progenitor platform for analyzing cell-shape dynamics during differentiation. Using VANGL2 mutations and patient-derived iPSCs, the work shows that (1) human phenotypes can be captured in a 2D differentiation assay and (2) VANGL2 is required for neural contraction (apical constriction), consistent with animal studies. The results are solid, the data are quantitative, and the manuscript is well written. Although the planar system lacks non-neural ectoderm and does not exhibit tissue folding or tube closure, it provides a tractable baseline for mechanistic dissection and genotype-phenotype mapping. The derivation of patient lines from amniotic fluid and execution of experiments before birth is a remarkable demonstration that points toward precision-medicine applications, while motivating rescue strategies and additional clean genetic models. However, overall, I did not learn anything substantively new from this manuscript; the conclusions largely corroborate prior observations rather than extend them. In addition, the model was unsuccessful in one of the two patient-derived lines, which limits generalizability and weakens claims of patient-specific predictive value.

    2. Author response:

      General Statements

      In this manuscript we characterize an exquisitely reproducible model of iPSC differentiation into neuroepithelial cells, use it to mechanistically study cell shape changes and planar cell polarity signaling activation during this transition, then apply it to identify patient-specific cell deficiencies in both forward and reverse genetic screens as a power tool for patient-stratification in personalized medicine. To our knowledge, we provide the first evidence of a human pathogenic mutation directly impairing apical constriction: an evolutionarily conserved behavior of epithelial cells which is the subject of intense research. 

      We are very pleased with the balanced and rigorous reviews generated through Review Commons, which we have already used to improve our manuscript. Reviewer 1 highlights that our study “is significant not only for verifying the cell behaviors necessary for neural tube closure in a human iPSC model, but also for establishing a robust assay for the functional testing of NTD-associated sequence variants.” Reviewer 2 agrees that “results are solid and convincing, the data are quantitative, and the manuscript is well written”, and that our “derivation of patient lines from amniotic fluid and execution of experiments before birth is a remarkable demonstration that points toward precision-medicine applications, while motivating rescue strategies and additional clean genetic models.” Reviewer 3 is “enthusiastic about this work and believe it represents a significant step forward in the effort to establish precision medicine approaches for diagnoses of the patient-specific causative cellular defects underlying human neural tube closure defects.” 

      Below, we have replied to each of the reviewers’ comments.

      Description of the planned revisions

      R2.2. Lines 156-166. The authors claim that changes in gene expression precede morphological changes. I am not convinced this is supported by their data. Fig. 1g (epithelial thickness) and Fig. 1k (PAX6 expression) seem to have similar dynamics. The authors can perform a cross-correlation between the two plots to see which Δt gives maximum correlation. If Δt < 0, then it would suggest that gene expression precedes morphology, as they claim. Fig. 1j shows that NANOG drops before the morphological changes, but loss of NANOG is not specific to neural differentiation and therefore should not be related to the observed morphological changes.

      We are happy to do this analysis fully in revision. Our initial analysis performing crosscorrelation between apical area and CDH2 protein in one line shows the highest crosscorrelation at Δt = -1, suggesting neuroepithelial CDH2 increases before apical area decreases. In contrast, the same analysis comparing apical area versus PAX6 shows Δt = 0, suggesting concurrence. This analysis will be expanded to include the other markers we quantified and the manuscript text amended accordingly. We are keen to undertake additional experiments to test whether these cells swap their key cadherins – CDH1 and CDH2 - before they begin to undergo morphological changes (see the response to Reviewer 3’s minor comment 1 immediately below).

      R3.1(Minor) There seems to be a critical window at day 5 of the differentiation protocol, both in terms of cell morphology and the marker panel presented in Figure 1i. Do the authors have any data spanning the hours from day 5 to 6? If not, I don't think they need to generate any, but do I think this is a very interesting window worthy of further discussion for a couple of reasons. First, several studies of mouse neural tube closure have shown that various aspects of cell remodeling are temporally separable. For example, between Grego-Bessa et al 2016 and Brooks et al 2020 we can infer that apicobasal elongation rapidly increases starting at E8.5, whereas apical surface area reduction and constriction are apparent somewhat earlier at E8.0. I think it would be interesting to see if this separability is conserved in humans. Second, is there a sense of how the temporal correlation between the pluripotent and early neural fate marker data presented here corroborate or contradict the emerging set of temporally resolved RNA seq data sets of mouse development at equivalent early neural stages?

      Cell shape analysis between days 5 and 6 has now been added (see the response to point 2.1 below). As the reviewer predicted, this is a transition point when apical area begins to decrease and apicobasal elongation begins to increase.

      We also thank the reviewer for this prompt to more closely compare our data to the previous mouse publications, which we have added to the discussion. The Grego-Bessa 2016 paper appears to show an increase in thickness between E7.75 and E8.5, but these are not statistically compared. Previous studies showed rapid apicobasal elongation during the period of neural fold elevation, when neuroepithelial cells apically constrict. This has now been added to the discussion: 

      Discussion: “In mice, neuroepithelial apicobasal thickness is spatially-patterned, with shorter cells at the midline under the influence of SHH signalling[14,77,78]. Apicobasal thickness of the cranial neural folds increases from ~25 µm at E7.75 to ~50 µm at E8.5[79]: closely paralleling the elongation between days 2 and 8 of differentiation in our protocol. The rate of thickening is non-uniform, with the greatest increase occurring during elevation of the neural folds[80], paralleled in our model by the rapid increase in thickness between days 4-6 as apical areas decrease. Elevation requires neuroepithelial apical constriction and these cells’ apical area also decreases between E7.75 and E8.5 in mice[79], but we and others have recently shown that this reduction is both region and sex-specific[14,81]. Specifically, apical constriction occurs in the lateral (future dorsal) neuroepithelium: this corresponds with the identity of the cells generated by the dual SMAD inhibition model we use[56]. More recently, Brooks et al[82] showed that the rapid reduction in apical area from E8-E8.5 is associated with cadherin switching from CDH1 (E-cadherin) to CDH2 (N-cadherin). This is also directly paralleled in our human system, which shows low-level co-expression of CDH1 and CDH2 at day 4 of differentiation, immediately before apical area shrinks and apicobasal thickness increases.”

      Prompted by the in vivo data in Brooks et al (2025)[82], we are keen to further explore the timing of CDH1/CDH2 switching versus apical constriction with new experimental data in revisions.

      R3.2(Minor) 2) Can the authors elaborate a bit more on what is known regarding apicobasal thickening and pseudo-stratification and how their work fits into the current understanding in the discussion? This is a very interesting and less well studied mechanism critical to closure, which their model is well suited to directly address. I am thinking mainly of the Grego-Bessa at al., 2016 work on PTEN, though interestingly the work of Ohmura et al., 2012 on the NUAK kinases also shows reduced tissue thickening (and apical constriction) and I am sure I have missed others. Given that the authors identify MED24 as a likely candidate for the lack of apicobasal thickening in one of their patient derived lines, is there any evidence that it interacts with any of the known players?

      We have now added further discussion on the mechanisms by which the neuroepithelium undergoes apicobasal elongation. Nuclear compaction is likely to be necessary to allow pseudostratification and apicobasal elongation. The reviewer’s comment has led us to realise that diminished chromatin compaction is a potential outcome of MED24 down-regulation in our GOSB2 patient-derived line. Figure 4D suggests the nuclei of our MED24 deficient patientderived line are less compacted than control equivalents and we propose to quantify nuclear volume in more detail to explore this possibility.

      Additionally, we have already expanded our discussion as suggested by the reviewer:

      Discussion: “Mechanistic separability of apical constriction and apicobasal elongation is consistent with biomechanical modelling of Xenopus neural tube closure showing that both are independently required for tissue bending[61]. Nonetheless, neuroepithelial apical constriction and apicobasal elongation are co-regulated in mouse models: for example, deletion of Nuak1/2[83], Cfl1[84], and Pten[79] all produce shorter neuroepithelium with larger apical areas. Neuroepithelial cells of the GOSB2 line described here, which has partial loss of MED24, similarly produces a thinner neuroepithelium with larger apical areas. Although apical areas were not analysed in mouse models of Med24 deletion, these embryos also have shorter and non-pseudostratified neuroepithelium.

      Our GOSB2 line – which retains readily detectable MED24 protein – is clearly less severe than the mouse global knockout, and the clinical features of the patient from which this line was derived are milder than the phenotype of Med24 knockout embryos[68]. Mouse embryos lacking one of Med24’s interaction partners in the mediator complex, Med1, also have thinner neuroepithelium and diminished neuronal differentiation but successfully close their neural tube[85]. As general regulators of polymerase activity, MED proteins have the potential to alter the timing or level of expression of many other genes, including those already known to influence pseudostratification or apicobasal elongation. MED depletion also causes redistribution of cohesion complexes[86] which may impact chromatin compaction, reducing nuclear volume during differentiation.”

      R3.3(Minor) 3) Is there any indication that Vangl2 is weakly or locally planar polarized in this system? Figure 2F seems to suggest not, but Supplementary Figure 5 does show at least more supracellular cable like structures that may have some polarity. I ask because polarization seems to be one of the properties that differs along the anteroposterior axis of the neural plate, and I wonder if this offers some insight into the position along the axis that this system most closely models?

      VANGL2 does not appear to be planar polarised in this system. This is similar to the mouse spinal neuroepithelium, in which apical VANGL2 is homogenous but F-actin is planar polarised (Galea et al Disease Models and Mechanisms 2018). We do observe local supracellular cablelike enrichments of F-actin in the apical surface of iPSC-derived neuroepithelial cells:

      Author response image 1.

      Preliminary identification of apical supracellular cables suggestive of local polarity. Top: F-actin staining shown in inverted grey LUT highlighting enrichment along directionally-polarised cell borders (blue arrows). Bottom: Staining orientation (blue ~ X axis, red ~ Y axis) based on OrientationJ analysis illustrating localised organisation of F-actin enrichment.

      We propose to compare the length of F-actin cables and coherency of their orientation at the start and end of neuroepithelial differentiation, and in wild-type versus VANGL2mutant epithelia.

      Description of the revisions that have already been incorporated in the transferred manuscript

      Reviewer #1:

      Major points

      (1) It is mentioned throughout the manuscript that 3 plates were evaluated per line. I believe these are independently differentiated plates. This detail is critical concerning rigor and reproducibility. This should be clearly stated in the Methods section and in the first description of the experimental system in the Results section for Figure 1.

      These experimental details have now been clarified. Unless otherwise stated, all findings were confirmed in three independently differentiated plates from the same line or at least one differentiation from each of three lines. 

      Methods: Unless otherwise stated, for each iPSC line three independently differentiated plates were generated and analysed, with each plate representing a separate differentiation experiment performed on different days.

      (2) For the patient-specific lines - how many lines were derived per patient?

      This has now been clarified in the methods. Microfluidic reprogramming of a small number of amniocytes produces one line per patient representing a pool of clones. Subcloning from individual cells would not be possible within the timeframe of a pregnancy. 

      Methods: For patient-specific iPSC lines, one independent iPSC line was obtained per patient following microfluidic mmRNA reprogramming.

      (3) Was the Vangl2 variant introduced by prime editing? Base editing? The details of the methods are sparse.

      We have now expanded these details:

      Methods: “VANGL2 knock-in lines were generated using CRSIPR-Cas9 homology directed repair editing by Synthego (SO-9291367-1). The guide sequence was AUGAGCGAAGGGUGCGCAAG and the donor sequence was CAATGAGTACTACTATGAGGAGGCTGAGCATGAGCGAAGGGTGTGCAAGAGGAGGGCCAGGTGGGTCCCTGGGGGAGAAGAGGAGAG.

      Sequence modification was confirmed by Sanger sequencing before delivery of the modified clones, and Sanger sequencing was repeated after expansion of the lines (Supplementary Figure 5) as well as SNP arrays (Illumina iScan, not shown) confirming genomic stability.”

      Author response image 2.

      Snapshot of Illumina iScan SNP array showing absence of chromosomal duplications or deletions in the CRISPR-modified VANGL2-knockin lines or their congenic control.

      (4) Suggested text changes.

      Some additional suggestions for improvement.

      The abstract could be more clearly written to effectively convey the study's importance. Here are some suggestions

      Line 26: Insert "apicobasal" before "elongation" - the way it is written, I initially interpreted it as anterior-posterior elongation.

      Line 29: Please specify that the lines refer to 3 different established parent iPSC lines with distinct origins and established using different reprogramming methods, plus 2 control patient-derived lines. - The reproducibility of the cell behaviors is impressive, but this is not captured in the abstract.

      Line 32: add that this mutation was introduced by CRISPR-Cas9 base/prime editing.

      The last sentence of the abstract states that the study only links apical constriction to human NTDs, but also reveals that neural differentiation and apical-basal elongation were found. The introduction could also use some editing.

      Line 71: insert "that pulls actin filaments together" after "power strokes" Line 73: "apically localized," do you mean "mediolaterally" or "radially"?

      Line 75: Can you specify that PCP components promote "mediolaterally orientated" apical constriction Lines 127: Specify that NE functions include apical basal elongation and neurodifferentiation are disrupted in patient-derived models

      All have now been corrected.

      Reviewer #2:

      Major comments:

      (1) Figure 1. The authors use F-actin to segment cell areas. Perhaps this could be done more accurately with ZO-1, as F-actin cables can cross the surface of a single cell. In any case, the authors need to show a measure of segmentation precision: segmented image vs. raw image plus a nuclear marker (DAPI, H2B-GFP), so we can check that the number of segmented cells matches the number of nuclei.

      We used ZO-1 to quantify apical areas of the VANGL2-konckin lines in Figure 3. Segmentation of neuroepithelial apical areas based on F-actin staining is commonplace in the field (e.g. in the Brooks et al 2022 paper cited by another reviewer), and is generally robust because the cell junctions are much brighter than any apical fibres not associated with the apical cortex. However, we accept that at earlier stages of differentiation there may be more apical fibres when cells are cuboidal. We have therefore repeated our analysis of apical area using ZO-1 staining as suggested, analysing a more temporally-detailed time course in one iPSC line. This new analysis confirms our finding of lack of apical area change between days 2-4 of differentiation, then progressive reduction of apical area between days 4-8, further validating our system. Including nuclear images is not helpful because of the high nuclear index of pseudostratified epithelia (e.g. see Supplementary Figure 7) which means that nuclei overlap along the apicobasal axis. Individual nuclei cannot be related to their apical surface in projected images.

      (3) Figure 2d. The laser ablation experiment in the presence of ROCK inhibitor is clear, as I can easily see the cell outlines before and after the experiment. In the absence of ROCK inhibitor, the cell edges are blurry, and I am not convinced the outline that the authors drew is really the cell boundary. Perhaps the authors can try to ablate a larger cell patch so that the change in area is more defined.

      The outlines on these images are not intended to show cell boundaries, but rather link landmarks visible at both timepoints to calculate cluster (not cell) change in area. This is as previously shown in Galea et al Nat Commun 2021 and Butler et al J Cell Sci 2019. We have now amended the visualisation of retraction to make representation of differences between conditions more intuitive. 

      (4) Figure 2d. Do the cells become thicker after recoil?

      This is unlikely because the ablated surface remains in the focal plane. Unfortunately, we are unable to image perpendicularly to the direction of ablation to test whether their apical surface moves in Z even by a very small amount. This has now been clarified in the results:

      Results: “The ablated surface remained within the focal plane after ablation, indicating minimal movement along the apical-basal axis.”

      (6) Lines 403-415. The authors report poor neural induction and neuronal differentiation in GOSB2. As far as I understand, this phenotype does not represent the in vivo situation. Thus, it is not clear to what extent the in vitro 2D model describes the human patient.

      The GOSB2 iPSC line we describe does represent the in vivo situation in Med24 knockout mouse embryos, but is clearly less severe because we are still able to detect MED24 protein expressed in this line. We do not have detailed clinical data of the patient from which this line was obtained to determine whether their neurological development is normal. However, it is well established that some individuals who have spina bifida also have abnormalities in supratentorial brain development. It is therefore likely that abnormalities in neuron differentiation/maturation are concomitant with spina bifida. Our findings in the GOSB2 line complement earlier studies which also identified deficiencies in the ability of patient-derived lines to form neurons, but were unable to functionally assess neuroepithelial cell behaviours we studied. This has now been clarified in the discussion:

      Discussion: “Neuroepithelial cells of the GOSB2 line described here, which has partial loss of MED24, similarly produces a thinner neuroepithelium with larger apical areas. Although apical areas were not analysed in mouse models of Med24 deletion, these embryos also have shorter and non-pseudostratified neuroepithelium. 

      Our GOSB2 line – which retains readily detectable MED24 protein – is clearly less severe than the mouse global knockout, and the clinical features of the patient from which this line was derived are milder than the phenotype of Med24 knockout embryos[68].

      Mouse embryos lacking one of Med24’s interaction partners in the mediator complex, Med1, also have thinner neuroepithelium and diminished neuronal differentiation but successfully close their neural tube[85].”

      (7) The experimental feat to derive cell lines from amniotic fluid and to perform experiments before birth is, in my view, heroic. However, I do not feel I learned much from the in vitro assays. There are many genetic changes that may cause the in vivo phenotype in the patient. The authors focus on MED24, but there is not enough convincing evidence that this is the key gene. I would like to suggest overexpression of MED24 as a rescue experiment, but I am not sure this is a single-gene phenotype. In addition, the fact that one patient line does not differentiate properly leads me to think that the patient lines do not strengthen the manuscript, and that perhaps additional clean mutations might contribute more.

      We appreciate the reviewer’s praise of our personalised medicine approach and fully agree that neural tube defects are rarely monogenic. The patient lines we studied were not intended to provide mechanistic insight, but rather to demonstrate the future applicability of our approach to patient care. Our vision is that every patient referred for fetal surgery of spina bifida will have amniocytes (collected as part of routine cystocentesis required before surgery) reprogrammed and differentiated into neuroepithelial cells, then neural progenitors, to help stratify their postnatal care. One could also picture these cells becoming an autologous source for future cellbased therapies if they pass our reproducible analysis pipeline as functional quality control. This has now been clarified in the discussion:

      Discussion: “The multi-genic nature of neural tube defect susceptibility, compounded by uncontrolled environmental risk factors (including maternal age and parity[102]), mean that patient-derived iPSC models are unlikely to provide mechanistic insight. They do provide personalised disease models which we anticipate will enable functional validation of genetic diagnoses for patients and their parents’ recurrence risk in future pregnancies, and may eventually stratify patients’ postnatal care. We also envision this model will enable quality control of patient-derived cells intended for future autologous cell replacement therapies, as is being developed in post-natal spinal cord injury[103]. Thus, the highly reproducible modelling platform we evaluate – which is robust to differences in iPSC reprogramming method, sex and ethnicity – represents a valuable tool for future mechanistic insights and personalised disease modelling applications.”

      Significance:

      In addition, the model was unsuccessful in one of the two patient-derived lines, which limits generalizability and weakens claims of patient-specific predictive value.

      We disagree with the reviewer that “the model was unsuccessful in one of the two patientderived lines”. The GOSB1 line demonstrated deficiency of neuron differentiation independently of neuroepithelial biomechanical function, whereas the GOSB2 line showed earlier failure of neuroepithelial function. We also do not, at this stage, make patient-specific predictive claims: this will require longer-term matching of cell model findings with patient phenotypes over the next 5-10 years.  

      Reviewer #3:

      Major comments

      (1) One of my few concerns with this work is that the relative constriction of the apical surface with respect to the basal surface is not directly quantified for any of the experiments. This worry is slightly compounded by the 3D reconstructions Figure 1h, and the observation that overall cell volume is reduced and cell height increased simultaneously to area loss. Additionally, the net impact of apical constriction in tissues in vivo is to create local or global curvature change, but all the images in the paper suggest that the differentiated neural tissues are an uncurved monolayer even missing local buckles. I understand that these cells are grown on flat adherent surfaces limiting global curvature change, but is there evidence of localized buckling in the monolayer? While I believe-along with the authors-that their phenotypes are likely failures in apical constriction, I think they should work to strengthen this conclusion. I think the easiest way (and hopefully using data they already have) would be to directly compare apical area to basal area on a cell wise basis for some number of cells. Given the heterogeneity of cells, perhaps 30-50 cells per condition/line/mutant would be good? I am open to other approaches; this just seems like it may not require additional experiments.

      As the reviewer observes, our cultures cannot bend because they are adhered on a rigid surface. The apical and basal lengths of the cultures will therefore necessarily be roughly equal in length. Some inwards bending of the epithelium is expected at the edges of the dish, but these cannot be imaged. The live imaging we show in Figure 2 illustrates that, just as happens in vivo, apical constriction is asynchronous. This means not all cells will have ‘bottle’ shapes in the same culture. We now illustrate the evolution of these shapes in more detail in Supplementary Figure 1.

      Additionally, the reviewer’s comment motivated us to investigate local buckles in the apical surface of our cultures when their apical surfaces are dilated by ROCK inhibition. We hypothesised that the very straight apical surface in normal cultures is achieved by a balance of apical cell size and tension with pressure differences at the cell-liquid interface. Consistent with our expectation, the apical surface of ROCK-inhibited cultures becomes wrinkled (Supplementary figure 4). The VANGL2-KI lines do not develop this tortuous apical surface (as shown in Figure 3), which is to be expected given their modification is present throughout differentiation unlike the acute dilation caused by ROCK inhibition.

      This new data complements our visualisation of apical constriction in live imaging, apical accumulation of phospho-myosin, and quantification of ROCK-dependent apical tension as independent lines of evidence that our cultures undergo apical constriction. 

      (2) Another slight experimental concern I have regards the difference in laser ablation experiments detailed in Figure 3h-i from those of Figure 2d-e. It seems like WT recoil values in 3h-I are more variable and of a lower average than the earlier experiments and given that it appears significance is reached mainly by impact of the lower values, can the authors explain if this variability is expected to be due to heterogeneity in the tissue, i.e. some areas have higher local tension? If so, would that correspond with more local apical constriction?

      There is no significant difference in recoil between the control lines in Figures 2 and 3, albeit the data in Figure 3 is more variable (necessitating more replicates: none were excluded). We also showed laser ablation recoil data in Supplementary Figure 10, in which we did identify a graphing error (now corrected, also no significant difference in recoil from the other control groups as shown in Author response image 3).

      Author response image 3.

      Recoil following laser ablation is not significantly different between different experiments. X axis labels indicate the figure panel each set of ablation data is shown in. Points represent an independent differentiation dish.

      (4)(Minor) I think some of the commentary on the strengths and limitations of the model found in the Results section should be collated and moved to the discussion in a single paragraph. For example, this could also briefly touch on/compare to some of the other models utilizing hiPSCs (These are mentioned briefly in the intro, but this comparison could be elaborated on a bit after seeing all the great data in this work).

      These changes have now been made:

      Discussion: “Some of these limitations, potentially including inclusion of environmental risk factors, can be addressed by using alternative iPSC-derived models[93,94]. For example, if patients have suspected causative mutations in genes specific to the surface (non-neural) ectoderm, such as GRHL2/3, 3D models described by Karzbrun et al[49] or Huang et al[95] may be informative. Characterisation of surface ectoderm behaviours in those models is currently lacking. These models are particularly useful for high-throughput screens of induced mutations[95], but their reproducibility between cell lines, necessary to compare patient samples to non-congenic controls, remains to be validated. Spinal cell identities can be generated in human spinal cord organoids, although these have highly variable morphologies[96,97]. As such, each iPSC model presents limitations and opportunities, to which this study contributes a reductionist and highly reproducible system in which to quantitatively compare multiple neuroepithelial functions.”

      (5) While the authors are generally good about labeling figures by the day post smad inhibition, in some figures it is not clear either from the images or the legend text. I believe this includes supplemental figures 2,5,6,8, and 10 (apologies if I simply missed it in one or more of them)

      These have now been added.

      (6) The legend for Figure 2 refers to a panel that is not present and the remaining panel descriptions are off by a letter. I'm guessing this is a versioning error as the text itself seems largely correct, but it may be good to check for any other similar errors that snuck in

      This has now been corrected.

      (7) The cell outlines in Figure 3d are a bit hard to see both in print and on the screen, perhaps increase the displayed intensity?

      This has now been corrected.

      Description of analyses that authors prefer not to carry out

      R2.5. Figure 3. The authors mention their previous study in which they show that Vangl2 is not cell-autonomously required for neural closure. It will be interesting to study whether this also the case in the present human model by using mosaic cultures.

      The reviewer is correct that this is one of the exciting potential future applications of our model, which will first require us to generate stable fluorescently-tagged lines (to identify those cells which lack VANGL2). We will also need to extensively analyze controls to validate that mixing fluo-tagged and untagged lines does not alter the homogeneity of differentiation, or apical constriction, independently of VANGL2 deletion. As such, the reviewer is suggesting an altogether new project which carries considerable risk and will require us to secure dedicated funding to undertake.

      R3.8(Minor) The authors show a fascinating piece of data in Supplementary Figure 1, demonstrating that nuclear volume is halved by day 8. Do they have any indication if the DNA content remains constant (e.g., integrated DAPI density)? I suppose it must, and this is a minor point in the grand scheme, but this represents a significant nuclear remodeling and may impact the overall DNA accessibility.

      We agree with the reviewer that the reduction in nuclear volume is important data both because it informs understanding of the reduction in total cell volume, and because it suggests active chromatin compaction during differentiation. Unfortunately, the thicker epithelium and superimposition of nuclei in the differentiated condition means the laser light path is substantially different, making direct comparisons of intensity uninterpretable. Additionally, the apical-most nuclei will mostly be in G2/M phase due to interkinetic nuclear migration. As such, the comparison of DAPI integrated density between epithelial morphologies would not be informative (Author response image 4).

      Author response image 4.

      Lateral views of DAPI-stained nuclei on Days 2 and 8 of differentiation. Note the rapid loss of staining intensity below the apical pseudo-row of nuclei on Day 8. This intensity change is likely due to the apical nuclei being in G2/M phase and therefore having more DNA, and rapid loss of 405nm wavelength signal at depth.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This work addresses a key question in cell signalling: how does the membrane composition affect the behaviour of a membrane signalling protein? Understanding this is important, not just to understand basic biological function but because membrane composition is highly altered in diseases such as cancer and neurodegenerative disease. Although parts of this question have been addressed on fragments of the target membrane protein, EGFR, used here, Srinivasan et al. harness a unique tool, membrane nanodisks, which allow them to probe full-length EGFR in vitro in great detail with cutting-edge fluorescent tools. They find interesting impacts on EGFR conformation in differently charged and fluid membranes, explaining previously identified signalling phenotypes.

      Strengths:

      The nanodisk system enables full-length EGFR to be studied in vitro and in a membrane with varying lipid and cholesterol concentrations. The authors combine this with single-molecule FRET utilising multiple pairs of fluorophores at different places on the protein to probe different conformational changes in response to EGF binding under different anionic lipid and cholesterol concentrations. They further support their findings using molecular dynamics simulations, which help uncover the full atomistic detail of the conformations they observe.

      Weaknesses:

      Much of the interpretation of the results comes down to a bimodal model of an 'open' and 'closed' state between the intracellular tail of the protein and the membrane. Some of the data looks like a bimodal model is appropriate, but its use is not sufficiently justified (statistically or otherwise) in this work in its current form. The experiments with varying cholesterol in particular appear to suggest an alternate model with longer fluorescent lifetimes. More justification of these interpretations of the central experiment of this work would strengthen the paper.

      We thank the reviewer for highlighting the strengths of the study, including the use of nanodiscs, single-molecule FRET, and MD simulations to probe full-length EGFR in controlled membrane environments.

      We agree that statistical justification is important for interpreting the distributions. To address this, we performed global fits of the data with both two- and three-Gaussian models and evaluated them using the Bayesian Information Criterion (BIC), which balances the model fit with a penalty for additional parameters. The three-Gaussian model gave a substantially lower BIC, indicating statistical preference for the more complex model. However, we also assessed the separability of the Gaussian components using Ashman’s D, which quantifies whether peaks are distinct. This analysis showed that two Gaussians (µ = 2.64 and 3.43 ns) are not separable, implying they represent one broad distribution rather than two states.

      Author response table 1.

      Both the two- and three-Gaussian models include a low-value component (µ = ~1.3 ns), but the apparent improvement of the three-Gaussian model arises only from splitting the central population into two overlapping Gaussians. Thus, while the BIC favors the three-Gaussian model statistically, Ashman’s D demonstrates that the central peak should not be interpreted as bimodal. Therefore, when all the distributions are fit globally, the data are best explained as two Gaussians, one centered at ~1.3 ns and the other at ~2.7 ns, with cholesterol-dependent shifts reflecting changes in the distribution of this population rather than the emergence of a separate state. Finally, we acknowledge that additional conformations may exist, but based on this analysis a bimodal model describes the populations captured in our data and so we limit ourselves to this simplest framework.

      We have clarified this in the revised manuscript by adding a section in the Methods (page 26) titled Model Selection and Statistical Analysis, which describes the results of the global two- versus three-Gaussian fits evaluated using BIC and Ashman’s D. Additional details of these analyses are also provided in response to Reviewer #1, Question 8 (Recommendations for the authors).

      Reviewer #2 (Public review):

      Summary:

      Nanodiscs and synthesized EGFR are co-assembled directly in cell-free reactions. Nanodiscs containing membranes with different lipid compositions are obtained by providing liposomes with corresponding lipid mixtures in the reaction. The authors focus on the effects of lipid charge and fluidity on EGFR activity.

      Strengths:

      The authors implement a variety of complementary techniques to analyze data and to verify results. They further provide a new pipeline to study lipid effects on membrane protein function.

      We thank the reviewer for noting the strengths of our approach, particularly the use of complementary techniques and the development of a new pipeline to study lipid effects on membrane protein function.

      Weaknesses:

      Due to the relative novelty of the approach, a number of concerns remain.

      (1) I am a little skeptical about the good correlation of the nanodisc compositions with the liposome compositions. I would rather have expected a kind of clustering of individual lipid types in the liposome membrane, in particular of cholesterol. This should then result in an uneven distribution upon nanodisc assembly, i.e., in a notable variation of lipid composition in the individual nanodiscs. Could this be ruled out by the implemented assays, or can just the overall lipid composition of the complete nanodisc fraction be analyzed?

      We monitored insertion of anionic lipids into nanodiscs by performing zeta potential measurements, which report on surface charge, and cholesterol insertion by Laurdan fluorescence, which reports on membrane order. Both assays provide information at the ensemble level, not single-nanodisc resolution. We clarified this in the Methods section (see below).

      Cholesterol clustering is well documented in ternary systems with saturated lipids and sphingolipids [Veatch, Biophys J., 2003; Risselada, PNAS, 2008]. However, in unsaturated POPC-cholesterol mixtures such as those used here, cholesterol primarily alters bilayer order and large-scale segregation is not typically observed.  The addition of POPS to the POPC-cholesterol mixture perturbs cholesterol-induced ordering, lowering the likelihood of cholesterol-rich domains [Kumar, J. Mol. Graphics Modell., 2021].

      Lipid heterogeneity between nanodiscs would be expected to give rise to heterogeneity in hydrodynamic properties, including potentially broadening the dynamic light scattering (DLS) distributions. However, the full width at half maximum (FWHM) values from the DLS measurements (see Author response table 2) do not indicate a broadening with cholesterol. Statistical testing (Mann-Whitney U test for non-normal data) showed no significant difference between samples with and without cholesterol (p = 0.486; n = 4 per group). While the sample size is small making firm conclusions challenging, these results suggest that large-scale heterogeneity is unlikely.

      Author response table 2.

      In the case of POPS lipids, clustering of POPS in EGFR embedded nanodiscs is a recognized property of receptor-lipid interactions. Molecular dynamics simulations have shown that POPS, although constituting only 30% of the inner leaflet, accounts for ~50% of the lipids directly contacting EGFR [Arkhipov, Cell, 2013], underscoring that anionic lipids are preferentially recruited to the receptor’s immediate environment.

      For nanodiscs containing cholesterol and anionic lipids, our smFRET experiments were designed to isolate the effect of EGF binding. The nanodisc population is the same in the ± EGF conditions as EGF was introduced just prior to performing sm-FRET experiments, and not during nanodisc assembly. Thus, for a given lipid composition, any observed differences between ligand-free and ligand-bound states reflect conformational changes of EGFR.

      Methods, page 23, “Zeta potential measurements to quantify surface charge of nanodiscs: Data analysis was processed using the instrumental Malvern’s DTS software to obtain the mean zeta-potential value. This ensemble measurement reports the average surface charge of the nanodisc population, verifying incorporation of anionic POPS lipids.”

      Methods, page 23, “Fluorescence measurements with Laurdan to confirm cholesterol insertion into nanodiscs: The excitation spectrum was recorded by collecting the emission at 440 nm and emission spectra was recorded by exciting the sample at 385 nm. Laurdan fluorescence provides an ensemble readout of membrane order and confirms cholesterol incorporation into the nanodisc population. While laurdan does not resolve the composition of individual nanodiscs, prior work has shown that POPC–cholesterol mixtures are miscible without forming cholesterol-rich domains[91,92], thus the observed ordering changes likely reflect the intended input cholesterol content at the ensemble level.”

      (91) Veatch, S. L. & Keller, S. L. Separation of liquid phases in giant vesicles of ternary mixtures of phospholipids and cholesterol. Biophysical journal, 85(5), 3074-3083 (2003).

      (92) Risselada, H. J. & Marrink, S. J. The molecular face of lipid rafts in model membranes. Proceedings of the National Academy of Sciences 105(45), 17367–17372 (2008).

      (2) Both templates have been added simultaneously, with a 100-fold excess of the EGFR template. Was this the result of optimization? How is the kinetics of protein production? As EGFR is in far excess, a significant precipitation, at least in the early period of the reaction, due to limiting nanodiscs, should be expected. How is the oligomeric form of the inserted EGFR? Have multiple insertions into one nanodisc been observed?

      We thank the reviewer for these insightful questions. Yes, the EGFR:ApoA1∆49 template ratio of 100:1 was empirically determined through optimization experiments now shown in the revised Supplementary Fig. 3. Cell-free reactions were performed across a range of EGFR:ApoA1∆49 template ratios (1:2 to 1:200) and sampled at different time points (2-19 hours). As shown in the gels, EGFR expression increased with higher template ratios and longer reaction times up to ~9 hours, while ApoA1 expression became clearly detectable only after 6 hours. Based on these results, we selected an EGFR:ApoA1∆49 ratio of 100:1 and 8-hour reaction time as the optimal condition, which yielded sufficient full-length EGFR incorporated into nanodiscs for ensemble and single-molecule experiments.

      In cell-free systems, protein yield does not scale directly with DNA template concentration, as translation efficiency is limited by factors such as ribosome availability and co-translational membrane insertion [Hunt, Chem. Rev., 2024; Blackholly, Front. Mol. Biosci., 2022]. Consistent with this, we observed that ApoA1∆49 is produced at higher levels than EGFR despite the lower DNA input (Supplementary Fig. 2b). Providing an excess EGFR template prevents the reaction from becoming limited by scaffold availability and helps compensate for the fact that, as a large multi-domain receptor, EGFR expression can yield truncated as well as full-length products. This strategy ensures that sufficient full-length receptors are available for nanodisc incorporation. We will clarify this in the Methods section (see below).

      We observed little to no visible precipitation under the reported cell-free conditions, likely due to the following reasons: (i) EGFR and ApoA1∆49 are co-expressed in the cell-free reaction, and ApoA1∆49 assembles into nanodiscs concurrently with receptor translation, providing an immediate membrane sink (ii) ApoA1∆49 is expressed at high levels, maintaining disc concentrations that keep the reaction in a soluble regime.

      The sample contains donor-labeled EGFR (snap surface 594) together with acceptor-labeled lipids (cy5-labeled PE doped in the nanodisc). We assess the oligomerization state of EGFR in nanodiscs using single-molecule photobleaching of the donor channel. Snap surface 594 is a benzyl guanine derivative of Atto 594 that reacts with the SNAP tag with near-stoichiometry efficiency [Sun, Chembiochem, 2011]. Most molecules (~75%) exhibited a single photobleaching step, consistent with incorporation of a single EGFR per nanodisc [Srinivasan, Nat. Commun., 2022]. A minority of traces (~15%) showed two photobleaching steps and about ~10% of traces showed three or more photobleaching steps, consistent with occasional multiple insertions. For all smFRET analysis, we restricted the dataset to single-step photobleaching traces, ensuring measurements were performed on monomeric EGFR.

      Methods, page 20, “Production of labeled, full-length EGFR nanodiscs: Briefly, the E.Coli slyD lysate, in vitro protein synthesis E.Coli reaction buffer, amino acids (-Methionine), Methionine, T7 Enzyme, protease inhibitor cocktail (Thermofisher Scientific), RNAse inhibitor (Roche) and DNA plasmids (20ug of EGFR and 0.2ug of ApoA1∆49) were mixed with different lipid mixtures. The DNA template ratio of EGFR:ApoA1∆49 = 100:1 was empirically chosen by testing different ratios on SDS-PAGE gels and selecting the condition that maximized full-length EGFR expression in DMPC lipids (Supplementary Fig. 3).”

      (3) The IMAC purification does not discriminate between EGFR-filled and empty nanodiscs. Does the TEM study give any information about the composition of the particles (empty, EGFR monomers, or EGFR oligomers)? Normalizing the measured fluorescence, i.e., the total amount of solubilized receptor, with the total protein concentration of the samples could give some data on the stoichiometry of EGFR and nanodiscs.

      Negative-stain TEM was performed to confirm nanodisc formation and morphology, but this method does not resolve whether a given disc contains EGFR. To directly assess receptor stoichiometry, we instead relied on single-molecule photobleaching of snap surface 594-labeled EGFR (see response to Point 2). These experiments showed that the majority of nanodiscs contain a single receptor, with a minority containing two receptors. For all smFRET analyses, we restricted data to single-step photobleaching traces, ensuring measurements were performed on monomeric EGFR.

      We did not normalize EGFR fluorescence to total protein concentration because the bulk protein fraction after IMAC purification includes both receptor-loaded and empty nanodiscs. The latter contribute to ApoA1∆49 mass but do not contain receptors and including them would underestimate receptor occupancy. Importantly, the presence of empty nanodiscs does not affect our measurements as photobleaching and single-molecule FRET analyses selectively report only on receptor-containing nanodiscs. This clarification has been added to the Methods.

      Methods, page 26, “Fluorescence Spectroscopy: Traces with a single photobleaching step for the donor and acceptor were considered for further analysis. Regions of constant intensity in the traces were identified by a change-point algorithm95. Donor traces were assigned as FRET levels until acceptor photobleaching. The presence of empty nanodiscs does not influence these measurements, as photobleaching and single-molecule FRET analyses selectively report on receptor-containing nanodiscs.”

      (4) The authors generally assume a 100% functional folding of EGFR in all analyzed environments. While this could be the case, with some other membrane proteins, it was shown that only a fraction of the nanodisc solubilized particles are in functional conformation. Furthermore, the percentage of solubilized and folded membrane protein may change with the membrane composition of the supplied nanodiscs, while non-charged lipids mostly gave rather poor sample quality. The authors normalize the ATP binding to the total amount of detectable EGFR, and variations are interpreted as suppression of activity. Would the presence of unfolded EGFR fractions in some samples with no access to ATP binding be an alternative interpretation?

      We agree that not all nanodisc-embedded EGFR molecules may be fully functional and that the fraction of folded protein could vary with lipid composition. In our ATP-binding assay, EGFR detection relies on the C-terminal SNAP-tag fused to an intrinsically disordered region. Successful labeling requires that this segment be translated, accessible, and folded sufficiently to accommodate the SNAP reaction, which imposes an additional requirement compared to the rigid, structured kinase domain where ATP binds. Misfolded or truncated EGFR molecules would therefore likely fail to label at the C-terminus. These factors strongly imply that our assay predominantly reports on receptor molecules that are intact and well folded.

      Additionally, our molecular dynamics simulations at 0% and 30% POPS support the experimental ATP-binding measurements (Fig. 2c, d). This consistency between both the experimental and simulated evidence, including at 0% POPS where reduced receptor folding might be expected, suggests that the observed lipid-dependent changes are more likely due to modulation of the functional receptor rather than receptor misfolding. We have clarified these points by adding the following

      Results, page 7, “Role of anionic lipids in EGFR kinase activity: In the presence of EGF, increasing the anionic lipid content decreased the number of contacts from 71.8 ± 1.8 to 67.8 ± 2.4, indicating increased accessibility, again in line with the experimental findings. Because detection of EGFR relies on labeling at the C-terminus and ATP binding requires an intact kinase domain, the ATPbinding assay is for receptors that are properly folded and competent for nucleotide binding. The consistency between experimental results and MD simulations suggests that the observed lipiddependent changes are more likely due to modulation of functional EGFR than to artifacts from misfolding.”

      Reviewer #1 (Recommendations for the authors):

      The experimental program presented here is excellent, and the results are highly interesting. My enthusiasm is dampened by the presentation in places which is confusing, especially Figure 3, which contains so many of the results. I also have some reservations about the bimodal interpretation of the lifetime data in Figure 3.

      We thank the reviewer for their positive assessment of our experimental approach and results. In the revised version, we have improved figure organization and readability by adding explicit labels for lipid composition and EGF presence/absence in all lifetime distributions, moving key supplementary tables into main text, and reorganizing the supplementary figures as Extended Data Figures following eLife’s format. Figures and tables now appear in the order in which they are referenced in the text to further improve readability.

      Regarding the bimodal interpretation of the lifetime distribution, we have performed global fits of the data with both two- and three-Gaussian models and evaluated them using the Bayesian Information Criterion (BIC) and Ashman’s D analysis, which supported the bimodal interpretation. Details of this analysis are provided in our response to comment (8) below and included in the manuscript.

      Specific comments below:

      (1) Abstract -"Identifying and investigating this contribution have been challenging owing to the complex composition of the plasma membrane" should be "has".

      We have corrected this error in the revised manuscript.

      (2) Results - p4 - some explanation of what POPC/POPS are would be helpful.

      We have added the text below discussing POPC and POPS.

      Results, page 4, “POPC is a zwitterionic phospholipid forming neutral membranes, whereas POPS carries a net negative charge and provides anionic character to the bilayer[56]. Both PC and PS lipids are common constituents of mammalian plasma membranes, with PC enriched in the outer leaflet and PS in the inner leaflet[22].”

      (22) Lorent, J. H., Levental, K. R., Ganesan, L., Rivera-Longsworth, G., Sezgin, E., Doktorova, M., Lyman, E. & Levental, I. Plasma membranes are asymmetric in lipid unsaturation, packing and protein shape. Nature Chemical Biology 16, 644–652 (2020).

      (56) Her, C., Filoti, D. I., McLean, M. A., Sligar, S. G., Ross, J. A., Steele, H. & Laue, T. M. The charge properties of phospholipid nanodiscs. Biophysical journal 111(5), 989–998 (2016).

      (3) Figure 2b - it would be easier to compare if these were plotted on top of each other. Are we at saturating ATP binding concentration or below it? Also, please put a key to say purple - absent and orange +EGF on the figure. I am also confused as to why, with no EGF, ATP binding is high with 0% POPS, but low when EGF is present, but that then reverses with physiological lipid content.

      While we agree that a direct comparison would be easier, the ATP-binding experiments for the ± EGF conditions were actually performed independently on separate SDS-PAGE gels, which unfortunately precludes such a comparison. We have added a color key to clarify the -EGF and +EGF datasets.

      The experiments were carried out at 1 µM of the fluorescently labeled ATP analogue (atto647Nγ ATP). Reported kinetic measurements for the isolated EGFR kinase domain indicate an K<sub>m</sub> of 5.2 µM suggesting that our experimental concentration is below, but close to the saturating range ensuring sensitivity to changes in accessibility of the binding site rather than saturating all available receptors.

      We have revised the manuscript to clarify these details by including the following text:

      Results, page 6, “To investigate how the membrane composition impacts accessibility, we measured ATP binding levels for EGFR in membranes with different anionic lipid content. 1 µM of fluorescently-labeled ATP analogue, atto647N-γ ATP, which binds irreversibly to the active site, was added to samples of EGFR nanodiscs with 0%, 15%, 30% or 60% anionic lipid content in the absence or presence of EGF.”

      Methods, page 24, “ATP binding experiments: Full-length EGFR in different lipid environments was prepared using cell-free expression as described above. 1μM of snap surface 488 (New England Biolabs) and atto647N labeled gamma ATP (Jena Bioscience) was added after cell-free expression and incubated at 30 °C , 300 rpm for 60 minutes. 1μM of atto647N-γ ATP was used, corresponding to a concentration near the reported Km of 5.2 µM for ATP binding to the isolated EGFR kinase domain[93], ensuring sensitivity to lipid-dependent changes in ATP accessibility.”

      (ii) Nucleotide binding is suppressed under basal conditions, likely to ensure that the catalytic activity is promoted only upon EGF stimulation.

      The molecular dynamics simulations at 0% and 30% POPS further support this interpretation, showing that anionic lipids modulate the accessibility of the ATP-binding site in a manner consistent with experimental trends (Fig. 2c and 2d).

      We have clarified these points in the main text with the following additions:

      Results, page 6, “In the presence of EGF, ATP binding overall increased with anionic lipid content with the highest levels observed in 60% POPS bilayers. In the neutral bilayer, ligand seemed to suppress ATP binding, indicating anionic lipids are required for the regulated activation of EGFR.”

      Results, page 7, “In the absence of EGF, increasing the anionic lipid content from 0\% POPS to 30% POPS increased the number of ATP-lipid contacts 58.6±0.7 to 74.4±1.2, indicating reduced accessibility, consistent with the experimental results and suggesting anionic lipids are required for ligand-induced EGFR activity.”

      (93) Yun, C. H., Mengwasser, K. E., Toms, A. V., Woo, M. S., Greulich, H., Wong, K. K., Meyerson,M. & Eck, M.J. The T790M mutation in EGFR kinase causes drug resistance by increasing the affinity for ATP. PNAS, 105(6), 2070–2075 (2008).

      (4) Figure 2d - how was the 16A distance arrived at?

      We thank the reviewer for pointing this out. The 16 Å cutoff was chosen based on the physical dimensions of the ATP analogue used in the experiments. Specifically, the largest radius of the atto647N-γ ATP molecule is ~16.9 Å, which defines the maximum distance at which lipid atoms could sterically obstruct access of ATP to the binding pocket. Accordingly, in the simulations, contacts were defined as pairs of coarse-grained atoms between lipid molecules and the residues forming the ATP-binding site (residues 694-703, 719, 766-769, 772-773, 817, 820, and 831) separated by less than 16 Å.

      We have rewritten the rationale for selecting the 16 Å cutoff in the Methods section to improve clarity.

      Methods, page 28, “Coarse-grained, Explicit-solvent Simulations with the MARTINI Force Field: We analyzed our simulations using WHAM[108,109] to reweight the umbrella biases and compute the average values of various metrics introduced in this manuscript. Specifically, we calculated the distance between Residue 721 and Residue 1186 (EGFR C-terminus) of the protein. To quantify the accessibility of the ATP-binding site, we calculated the number of contacts between lipid molecules and the residues forming the ATP-binding pocket (residues 694-703, 719, 766-769, 772-773, 817, 820, and 831)[110]. Close contact between the bilayer and these residues would sterically hinder ATP binding; thus, the contact number serves as a proxy for ATP-site accessibility. The cutoff distance for defining a contact was set to 16 Å, corresponding to the largest molecular radius of the fluorescent ATP analogue (atto647N-γ ATP, 16.96 Å111). Accordingly, we defined a contact as a pair of coarse-grained atoms, one from the lipid membrane and one from the ATP binding site, within a mutual distance of less than 16 Å.”

      (5) Figure 2e-h - I think a bar chart/violin plot/jitter plot would make it easier to compare the peak values. The statistics in the table should just be quoted in the text as value +/- error from the 95% confidence interval. The way it is written currently is confusing, as it implies that there is no conformational change with the addition of EGF in neutral lipids, but there is ~0.4nm one from the table. I don't understand what you mean by "The larger conformational response of these important domains suggests that the intracellular conformation may play a role in downstream signaling steps, such as binding of adaptor proteins"?

      We thank the reviewer for these suggestions. For the smFRET lifetime distributions (Figure 2j, k; previously Figure 2e, f), we have now included jitter plots of the donor lifetimes in the Supplementary Figure 11 to facilitate direct visual comparison of the median and distribution widths for each lipid composition and ±EGF conditions. The distance distributions for the ATP to C-terminus in Figure 2e, f (previously Figure 2g, h) were obtained from umbrella-sampling simulations that calculate free-energy profiles rather than raw, unbiased distance values. Because the sampling is guided by biasing potentials, individual distance values cannot be used to construct violin or jitter plots. We therefore present the simulation data only as probability density distributions, which best reflect the equilibrium distributions derived from them.

      We have also revised the text to report the median ± 95% confidence interval, improving clarity and consistency with the statistical table.

      Results, page 9: “In the neutral bilayer (0% POPS), the distributions in the absence of EGF peaks at 8.1 nm (95% CI: 8.0–8.2 nm) and in the presence of EGF peaks at 8.6 nm (95% CI: 8.5–8.7 nm) (Table 1, Supplementary Table 1). In the physiological regime of 30% POPS nanodiscs, the peak of the donor lifetime distribution shifts from 9.1 nm (95% CI: 8.9–9.2 nm) in the absence of EGF to 11.6 nm (95% CI: 11.1–12.6 nm) in the presence of EGF (Table 1, Supplementary Table 1), which is a larger EGF-induced conformational response than in neutral lipids.”

      Finally, we have rephrased the sentence in question for clarity. The revised text now reads:

      Results, page 9: “The larger conformational response observed in the presence of anionic lipids suggests that these lipids enhance the responsiveness of the intracellular domains to EGF, potentially ensuring interactions between C-terminal sites and adaptor proteins during downstream signaling.”

      (6) "r, highlighting that the charged lipids can enhance the conformational response even for protein regions far away from the plasma membrane" - is it not that the neutral membrane is just very weird and not physiological that EGFR and other proteins don't function properly?

      We agree with the reviewer that completely neutral (0% POPS) membranes are not physiological and likely do not support the native organization or activity of EGFR. We have revised the text to clarify that the 30% POPS condition represents a more native-like lipid environment that restores or stabilizes the expected conformational response, rather than "enhancing" it. The revised sentence now reads:

      Results, page 10: “Both experimental and computational results show a larger EGF-induced conformational change in the partially anionic bilayer, consistent with the notion that a partially anionic lipid bilayer provides a more native environment that supports proper receptor activation, compared to the non-physiological neutral membrane.”

      (7) "snap surface 594 on the C-terminal tail as the donor and the fluorescently-labeled lipid (Cy5) as the acceptor (Supplementary Fig. 2, 11)." Why not refer to Figure 3a here to make it easier to read?

      We have added the reference to Figure 3a, and we thank the Reviewer for the suggestion.

      (8) Figure 3 - the bimodality in many of these plots is dubious. It's very clear in some, i.e. 0% POPS +EGF, but not others. Can anything be done to justify bimodality better?

      We agree that statistical justification is important for interpreting lifetime distributions. To address this, we performed global fits of the data with both two- and three-Gaussian models and evaluated them using the Bayesian Information Criterion (BIC), which balances the model fit with a penalty for additional parameters. The three-Gaussian model gave a substantially lower BIC, indicating statistical preference for the more complex model. However, we also assessed the separability of the Gaussian components using Ashman’s D, which quantifies whether peaks are distinct. This analysis showed that two of the Gaussians are not separable, implying they represent one broad distribution rather than two discrete states. Therefore, when all the distributions are fit globally, the data are best described as two Gaussians, one centered at ~1.3 ns and the other at ~2.7 ns, with cholesterol-dependent shifts reflecting changes in the distribution of this population rather than the emergence of a separate state. We better justified our choice of model by incorporating the results of the global two- vs three-Gaussian fits with BIC and Ashman’s D analysis in the revised manuscript.

      Methods, page 27: “Model Selection and Statistical Analysis

      Global fitting of lifetime distributions was performed across all experimental conditions using maximum likelihood estimation. Both two-Gaussian and three-Gaussian distribution models were evaluated as described previously.62 Model performance was compared using the Bayesian Information Criterion (BIC),[101] which balances model likelihood and complexity according to

      BIC = -2 ln L + k ln n

      where L is the likelihood, k is the number of free parameters, and n is the number of singlemolecule photon bunches across all experimental conditions. A lower BIC value indicates a statistically better model[101]. The separation between Gaussian components was subsequently assessed using the Ashman’s D where a score above 2 indicates good separation[102]. For two Gaussian components with means µ1, µ2 and standard deviations σ1, σ2,

      where Dij represents the distance metric between Gaussian components i and j. All fitted parameters, likelihood values, BIC scores, and Ashman’s D values are summarized in Supplementary Table 5.”

      (101) Schwarz, G. Estimating the dimension of a model. The Annals of Statistics, 461–464 (1978).

      (102) Ashman, K. M., Bird, C. M. & Zepf, S. E. Detecting bimodality in astronomical datasets. The Astronomical Journal 108(6), 2348–2361 (1994).

      (9) Figure 3c - can you better label the POPS/POPC on here?

      We thank the reviewer for this suggestion. In the revised manuscript, Figure 3b (previously Figure 3c) has been updated to label the lipid composition corresponding to each smFRET distribution to make the comparison across conditions easier to follow.

      (10) Figure 3g - it looks like cholesterol causes a shift in both the peaks, such that the previous open and closed states are not the same, but that there are 2 new states. This is key as the authors state: "Remarkably, high anionic lipids and cholesterol content produce the same EGFR conformations but with opposite effects on signaling-suppression or enhancement." But this is only true if there really are the same conformational states for all lipid/cholesterol conditions. Again, the bimodal models used for all conditions need to be justified.

      We appreciate the reviewer’s insightful comment. We agree that the interpretation of the lifetime distributions depends on whether cholesterol and anionic lipids modulate existing conformational states or create new ones. To test this, we performed global fits of all distributions using the two- and three-Gaussian models and compared them using the Bayesian Information Criterion (BIC) and Ashman’s D, the results of which are described in detail in response to (8) above.

      Both fitting models, two- and three-Gaussian, identified the same short lifetime component (µ = 1.3 ns), suggesting this reflects a well separated conformation. While the three-Gaussian model gave a lower BIC, Ashman’s D analysis indicated that the two of the three components (µ = 2.6 ns and 3.4 ns) are not statistically separable, suggesting they represent a single broad conformational population rather than distinct states. If instead these two components reflected distinct states present under different conditions, Ashman’s D analysis would have found the opposite result. This supports our interpretation that high cholesterol and high anionic lipid content produce similar conformation ensembles with opposite effects on signaling output.

      Finally, we acknowledge that additional conformations may exist, but based on this analysis a bimodal model describes the populations captured in our data and so we limit ourselves to this simplest framework. We have clarified this rationale in the revised manuscript and added the results of the BIC and Ashman’s D analysis to support this interpretation.

      (11) Why are we jumping about between figures in the text? Figure 1d is mentioned after Figure 2. Also, DMPC is shown in the figures way before it is described in the text. It is very confusing. Figure 3 is so compact. I think it should be spread out and only shown in the order presented in the text. Different parts of the figure are referred to seemingly at random in the text. Why is DMPC first in the figure, when it is referred to last in the text?

      Following the Reviewer’s comment, we have revised the figure order and layout to improve readability and ensure consistency with the text. The previous Figures 1d-f which introduce the single-molecule fluorescence setup are now Figure 2g-i, positioned immediately before the first single-molecule FRET experiments (Fig 2j, k). The DMPC distribution in Figure 3 has been moved to the Supplementary Information (Supplementary Fig. 17), where it is shown alongside POPC, as these datasets are compared in the section “Mechanism of cholesterol inhibition of EGFR transmembrane conformational response”. The smFRET distributions in Figure 3 are now presented in the same sequence as they are discussed in the text, and the figure has been spread out for better clarity.

      (12) Throughout, I find the presentation of numerical results, their associated error, and whether they are statistically significantly different from each other confusing. A lot of this is in supplementary tables, but I think these need to go in the main text.

      To improve clarity and ensure that key quantitative results are easily accessible, we have moved the relevant supplementary tables to the main text. Specifically, the following tables have been incorporated into the main manuscript:

      (i) Median distance between the ATP binding site and the EGFR C-terminus, or between membrane and EGFR C-terminus from smFRET measurements (previously supplementary table 1 is now main table 1)

      (ii) Median distance between the membrane and the EGFR C-terminus in different anionic lipid environments (previously supplementary table 4 is now main table 2)

      (iii) Median distance between the membrane and the EGFR C-terminus in different cholesterol environments (previously supplementary table 8 and 12 is now combined to be main table 3)

      (13) Supplementary figures - in general, there is a need to consider how to combine or simplify these for eLife, as they will have to become extended data figures.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have reorganized the supplementary figures into extended data figures in accordance with eLife’s format. Specifically:

      - Supplementary Figs. 1–7 are now grouped as Extended Data Figures for Figure 1 in the main text. They are now Figure 1 - figure supplements 1–7.

      - Supplementary Fig. 8–11 is now Extended Data Figure associated with Figure 2. It is now Figure 2 - figure supplements 1–4.

      - Supplementary Figs. 12–17 are now grouped as Extended Data Figures for Figure 3. They are now Figure 3 - figure supplements 1–6.

      (14) Supplementary Figure 2 - label what the two bands are in the EGFR and pEGFR sets at the bottom of panel c.

      We thank the reviewer for this comment. The two bands shown in the EGFR and pEGFR blots in Supplementary Fig. 2d (previously Supplementary Fig. 2c) corresponds to replicate samples under identical conditions. We have now clarified this in the figure legend and labeled the lanes as “Rep 1” and “Rep 2” in the revised figure and modified the figure legend.

      Supplementary Figure 2, page 31: “(d) Western blots were performed on labelled EGFR in nanodiscs. Anti-EGFR Western blots (left) and anti-phosphotyrosine Western blots (right) tested the presence of EGFR and its ability to undergo tyrosine phosphorylation, respectively, consistent with previous experiments on similar preparations[18, 54, 55]. The two lanes in each blot correspond to replicate samples under identical conditions.”

      (15) Supplementary Figures 3+4 - a bar chart/boxplot or similar would be easier for comparison here.

      In the revised version, we have replaced the histograms with jitter plots showing the nanodisc size distributions for each condition in supplementary figures 4 and 5 (previously supplementary figures 3 and 4). The plots display individual measurements with a horizontal line indicating the mean size (mean ± standard deviation values provided in the caption).

      (16) Supplementary Figures 10, 12, 13, 15, 16 - I would jitter these.

      We have incorporated jitter plots for the relevant datasets in Supplementary Figures 11, 13, 15, 16 and 17 (previously supplementary figures 10, 12 13, 15 and 16) to provide a clearer visualization of the data distributions and median values.

      Reviewer #2 (Recommendations for the authors):

      (1) Reactions were performed in 250 µL volumes. What is the average yield of solubilized EGFR in those reactions? Are there differences in the EGFR solubilization with the various lipid mixtures?

      The amount of solubilized EGFR produced in each 250 µL cell-free reaction was below the reliable detection limit for quantitative absorbance assays. At these protein levels, little to no EGFR precipitation was observed for all lipid compositions. Although exact yields could not be determined, fluorescence-based detection confirmed the presence of functional, nanodiscincorporated EGFR suitable for smFRET and ensemble fluorescence experiments. We observed variability in total yield between independent reactions within the same lipid composition, which is common for cell-free systems, but no consistent trend attributable to lipid composition.

      (2) Figure S2: It would be better to have a larger overview of the particles on a grid to get a better impression of sample homogeneity.

      TEM images showing a larger field of view have been added for each lipid composition in Supplementary Figures 4 and 5.

      (3) Figure 2b: It appears that there is some variation in the stoichiometry of ApoA1 and EGFR within the samples. Have equal amounts of each sample been analyzed? Are there, in addition, some precipitates of EGFR? It would further be good to have a negative control without expression to get more information about the additional bands in Figure S2b. As they do not appear in the fluorescent gel, it is unlikely that they represent premature terminations of EGFR.

      The fluorescence intensity from the bound ATP analogue (Atto 647N-ATP) and from the snap surface 488 label, which binds stoichiometrically to the SNAP tag at the EGFR C-terminus, was measured for each sample. The relative amount of ATP binding was quantified for each sample by normalizing to the EGFR content (Figure 2b). This normalization accounts for the different amounts of EGFR produced in each condition.

      We did not observe any visible precipitation under the reported cell-free conditions, likely due to the following reasons:

      (i) EGFR and ApoA1 are co-expressed in the cell-free reaction, and ApoA1 assembles into nanodiscs concurrently with receptor translation, providing an immediate membrane sink

      (ii) ApoA1 is expressed at high levels, maintaining disc concentrations that keep the reaction in a soluble regime.

      A control cell-free reaction containing only ApoA1∆49 (1 µg) and no EGFR template, analyzed after affinity purification, showed a single prominent band at ~ 25 kDa (gel image below), corresponding to ApoA1, along with faint background bands typical of Ni-NTA purification from cell-lysates. These weak, non-specific bands likely arise from co-purification of endogenous E.coli proteins.  

      The ApoA1∆49-only control gel has now been included as part of the supplementary figure 2.

      (4) Figure S2c: It would be better to show the whole lanes to document the specificity of the antibodies. Anti-Phosphor antibodies are frequently of poor selectivity. In that case, a negative control with corresponding tyrosine mutations would be helpful.

      We have updated Figure S2d (previously Figure S2c) to include the full gel lanes to better illustrate the specificity of both the total EGFR and phospho-EGFR (Y1068) antibodies. The results show a single clear band at the expected molecular weight for EGFR, conforming antibody specificity.

      (5) The Results section already contains quite some discussion. I would thus recommend combining both sections.

      We thank the reviewer for the suggestion. We have now created a results and discussion section to better reflect the content of these paragraphs, with the previous discussion section now a subsection focused on implications of these results.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer # 1 (Public review):

      Significance:

      While most MAVEs measure overall function (which is a complex integration of biochemical properties, including stability), VAMP-seqtype measurements more strongly isolate stability effects in a cellular context. This work seeks to create a simple model for predicting the response for a mutation on the "abundance" measurement of VAMPseq.

      We thank the reviewer for their evaluation of our work and for their comments and feedback below.

      Of course, there is always another layer of the onion, VAMP-seq measures contributions from isolated thermodynamic stability, stability conferred by binding partners (small molecule and protein), synthesis/degradation balance (especially important in "degron" motifs), etc. Here the authors' goal is to create simple models that can act as a baseline for two main reasons:

      (1) how to tell when adding more information would be helpful for a global model;

      (2) how to detect when a residue/mutation has an unusual profile indicative of an unbalanced contribution from one of the factors listed above.

      As such, the authors state that this manuscript is not intended to be a state-of-the-art method in variant effect prediction, but rather a direction towards considering static structural information for the VAMP-seq effects. At its core, the method is a fairly traditional asymmetric substitution matrix (I was surprised not to see a comparison to BLOSUM in the manuscript) - and shows that a subdivision by burial makes the model much more predictive. Despite only having 6 datasets, they show predictive power even when the matrices are based on a smaller number. Another success is rationalizing the VAMPseq results on relevant oligomeric states.

      We thank the reviewer for their summary of the main points of our work. Based on the suggestion by the reviewer, we have added a comparison to predictions with BLOSUM62 to our revised manuscript, noting that we have previously compared the BLOSUM62 matrix to a broader and more heterogeneous set of scores generated by MAVEs (Høie et al, 2022).

      Specific Feedback:

      Major points:

      The authors spend a good amount of space discussing how the six datasets have different distributions in abundance scores. After the development of their model is there more to say about why? Is there something that can be leveraged here to design maximally informative experiments?

      We believe that these effects arise from a combination of intrinsic differences between the systems and assay-specific effects. For example, biophysical differences between the systems, such as differences in absolute folding stabilities or melting temperatures, will play a role, as will the fact that some proteins contain multiple domains.

      Also, the sequencing-based score for an individual variant in a sort-seq experiment (such as VAMP-seq) depends both on the properties of that variant and on the composition of the entire FACS-sorted cell library. This is because cells are sorted into bins depending on the composition of the entire library, which means that library-to-library composition differences can contribute to the differences between VAMP-seq score distributions. 

      From our developed models and outliers in predictions from these, it is difficult to tell which of the several possible underlying reasons cause the differences. We have briefly expanded the discussion of these points in the manuscript, and we have moreover elaborated on this in subsequent work (Schulze et al., 2025).

      They compare to one more "sophisticated model" - RosettaddG - which should be more correlated with thermodynamic stability than other factors measured by VAMP-seq. However, the direct head-tohead comparison between their matrices and ddG is underdeveloped. How can this be used to dissect cases where thermodynamics are not contributing to specific substitution patterns OR in specific residues/regions that are predicted by one method better than the other? This would naturally dovetail into whether there is orthogonal information between these two that could be leveraged to create better predictions.

      We thank the reviewer for this suggestion and indeed had spent substantial effort trying to gain additional biological insights from variants for which MAVE scores or MAVE predictions do not match predicted ∆∆G values. One major caveat in this analysis is that the experimental MAVE scores, MAVE predictions and the predicted ∆∆G values are rather noisy, making it difficult to draw conclusions based on individual variants or even small subsets of variants.

      In our revised manuscript, we have added an analysis to discover residue substitution profiles that are predicted most accurately either by a ∆∆G model or by our substitution matrix model, thereby avoiding analysis of individual variant effect scores. 

      We find that many substitution profiles are predicted equally well by the two model types, but also that there are residues for which one method predicts substitution effects better than the other method. We have added an analysis of the characteristics of the residues and variants for which either the ∆∆G model or the substitution matrix model is most useful to rank variants. Since we only find relatively few residues for which this is the case, we do not expect a model that leverages predicted scores from both methods to perform better than ThermoMPNN across variants. 

      Perhaps beyond the scope of this baseline method, there is also ThermoMPNN and the work from Gabe Rocklin to consider as other approaches that should be more correlated only with thermodynamics.

      We acknowledge that there are other approaches to predict ∆∆G beyond Rosetta including for example ThermoMPNN and our own method called RaSP (Blaabjerg et al, eLIFE, 2023), and we have added comparisons to ThermoMPNN and RaSP in the revised manuscript. We are unsure how one would use the data from Rocklin and colleagues directly, but we note that e.g. RaSP has been benchmarked on this data and other methods have been trained on this data. We originally used Rosetta since the Rosetta model is known to be relatively robust and because it has never seen large databases during training (though we do not think that training of ThermoMPNN and RaSP would be biased towards the VAMP-seq data). We note also that we have previously compared both Rosetta calculations and RaSP with VAMP-seq data for TPMT, PTEN and NUDT15 (Blaabjerg et al, eLIFE, 2023)

      I find myself drawn to the hints of a larger idea that outliers to this model can be helpful in identifying specific aspects of proteostasis. The discussion of S109 is great in this respect, but I can't help but feel there is more to be mined from Figure S9 or other analyses of outlier higher than predicted abundance along linear or tertiary motifs.

      We agree with these points and have previously spent substantial time trying to make sense of outliers in Figure S9 and Figure S18 (Figure S8 and Figure S18 of revised manuscript). The outlier analysis was challenging, in part due to the relatively high noise levels in both experimental data and predictions, and we did not find any clear signals. Some outliers in e.g. Figure S9 are very likely the result of dataset-specific abundance score distributions, which further complicates the outlier analysis. We now note this in the revised paper and hope others will use the data to gain additional insights on proteostasis-specific effects.  

      Reviewer # 2 (Public review):

      Summary:

      This study analyzes protein abundance data from six VAMP-seq experiments, comprising over 31,000 single amino acid substitutions, to understand how different amino acids contribute to maintaining cellular protein levels. The authors develop substitution matrices that capture the average effect of amino acid changes on protein abundance in different structural contexts (buried vs. exposed residues). Their key finding is that these simple structure-based matrices can predict mutational effects on abundance with accuracy comparable to more complex physics-based stability calculations (ΔΔG).

      Major strengths:

      (1) The analysis focuses on a single molecular phenotype (abundance) measured using the same experimental approach (VAMP-seq), avoiding confounding factors present when combining data from different phenotypes (e.g., mixing stability, activity, and fitness data) or different experimental methods.

      (2) The demonstration that simple structural features (particularly solvent accessibility) can capture a significant portion of mutational effects on abundance.

      (3) The practical utility of the matrices for analyzing protein interfaces and identifying functionally important surface residues.

      We thank the reviewer for the comments above and the detailed assessment of our work.

      Major weaknesses:

      (1) The statistical rigor of the analysis could be improved. For example, when comparing exposed vs. buried classification of interface residues, or when assessing whether differences between prediction methods are significant.

      We agree with the reviewer that it is useful to determine if interface residues (or any of the residues in the six proteins) can confidently be classified as buried- or exposed-like in terms of their substitution profiles. Thus, we have expanded our approach to compare individual substitution profiles to the average profiles of buried and exposed residues to now account for the noise in the VAMP-seq data. In our updated approach, we resample the abundance score substitution profile for every residue several thousand times based on the experimental VAMP-seq scores and score standard deviations, and we then compare every resampled profile to the average profiles for buried and exposed residues, thereby obtaining residue-specific distributions of RMSD<sub>buried</sub> and RMSD<sub>exposed</sub> values. These RMSD distributions are typically narrow, since many variants in several datasets have small standard deviations. In the revised manuscript, we report a residue to have e.g. a buried-like substitution profile if RMSD<sub>buried</sub> <RMSD<sub>exposed</sub> for at least 95% of the resampled profiles. We do not recalculate average scores in substitution matrices for this analysis. 

      Moreover, to illustrate potential overlap in predictive performance between prediction methods more clearly than in our preprint, we have added confidence intervals in Fig. 2 and Fig. 3 of the revised manuscript. We note that the analysis in Fig. 2 is performed using a leave-one-protein-out approach, which we believe provides the cleanest assessment of how well the different models perform.

      (2) The mechanistic connection between stability and abundance is assumed rather than explained or investigated. For instance, destabilizing mutations might decrease abundance through protein quality control, but other mechanisms like degron exposure could also be at play.

      We agree that we have not provided much description of the relation between stability and abundance in our original preprint. In the revised manuscript, we provide some more detail as well as references to previous literature explaining the ways in which destabilising mutations can cause degradation. We have moreover performed and added additional analyses of the relationship between thermodynamic stability and abundance through comparisons of stability predictions and predictions performed with our substitution matrix models.

      (3) The similar performance of simple matrix-based and complex physics-based predictions calls for deeper analysis. A systematic comparison of where these approaches agree or differ could illuminate the relationship between stability and abundance. For instance, buried sites showing exposed-like behavior might indicate regions of structural plasticity, while the link between destabilization and degradation might involve partial unfolding exposing typically buried residues. The authors have all the necessary data for such analysis but don't fully exploit this opportunity.

      This is similar to a point made by reviewer 1, and our answer is similar. We were indeed hoping that our analyses would have revealed clearer differences between effects on thermodynamic protein stability and cellular abundance and have tried to find clear signals. One major caveat in performing the suggested analysis is that both the experimental MAVE scores, ∆∆G predictions and our simple matrix-based predictions are rather noisy, making it difficult to make conclusions based on individual variants or even small subsets of variants. 

      To address this point, we have added an analysis to discover residue substitution profiles that are predicted most accurately either by a ∆∆G model or by our substitution matrix model, thereby avoiding analysis of individual variant effect scores. We find that many substitution profiles are predicted equally well by the two model types, but we also, in particular, find solvent-exposed residues for which the substitution matrix model is the better predictor. These residues are often aspartate, glutamate and proline, suggesting that surface-level substitutions of these amino acid types often can have effects that are not captured well by a thermodynamical model, either because this model does not describe thermodynamic effects perfectly, or because in-cell effects are necessary to account for to provide an accurate description.

      (4) The pooling of data across proteins to construct the matrices needs better justification, given the observed differences in score distributions between proteins (for example, PTEN's distribution is shifted towards high abundance scores while ASPA and PRKN show more binary distributions).

      We agree with the reviewer that the differences between the score distributions are important to investigate further and keep in mind when analysing e.g. prediction outliers. However, our results show that the pooling of VAMP-seq scores across proteins does result in substitution matrices that make sense biochemically and can identify outlier residues with proteostatic functions. As we also respond to a related point by reviewer 1, the differences in score distributions likely have complex origins. In that sense, we also hope that our results can inspire experimentalists to design methods to generate data that are more comparable across proteins.

      For example, biophysical differences between the systems, such as differences in absolute folding stabilities or melting temperatures will play a role, as will the fact that some proteins contain multiple domains. Also, the sequence-based score for an individual variant in a sort-seq experiment (such as VAMP-seq) depends both on the properties of that variant and from the composition of the entire FACS-sorted cell library. This is because cells are sorted into bins depending on the composition of the entire library, which means that library-to-library composition can contribute to the differences between VAMP-seq score distributions. From our developed models and outliers in predictions from these, it is difficult to tell which of the several possible underlying reasons cause the differences.

      Thus, even when experiments on different proteins are performed using the same technique (VAMP-seq), quantifying the same phenomenon (cellular abundance) and done in similar ways (saturation mutagenesis, sort-seq using four FACS bins), there can still be substantial differences in the results across different systems. An interesting side result of our work is to highlight this including how such variation makes it difficult to learn across experiments. We now elaborate on these points in the revised manuscript.

      (5) Some key methodological choices require better justification. For example, combining "to" and "from" mutation profiles for PCA despite their different behaviors, or using arbitrary thresholds (like 0.05) for residue classification.

      We hope we have explained our methodological choices clearer in the revised paper.

      We removed the dependency of the threshold of 0.05 used for residue classification in Fig. S19 of the original manuscript; in the revised manuscript we only report a residue to have e.g. a buried-like substitution profile if RMSD<sub>buried</sub> <RMSD<sub>exposed</sub> for at least 95% of the abundance score profiles that we resampled according to VAMP-seq score noise levels, as explained above.

      With respect to combining “to” and “from” mutational profiles for PCA, we could have also chosen to analyse these two sets of profiles separately to take potentially different behaviours along the two mutational axes into account. We do not think that there should be anything wrong with concatenating the two sets of profiles in a single analysis, since the analysis on the concatenated profiles simply expresses amino acid similarities and differences in a more general manner.

      The authors largely achieve their primary aim of showing that simple structural features can predict abundance changes. However, their secondary goal of using the matrices to identify functionally important residues would benefit from more rigorous statistical validation. While the matrices provide a useful baseline for abundance prediction, the paper could offer deeper biological insights by investigating cases where simple structure-based predictions differ from physics-based stability calculations.

      This work provides a valuable resource for the protein science community in the form of easily applicable substitution matrices. The finding that such simple features can match more complex calculations is significant for the field. However, the work's impact would be enhanced by a deeper investigation of the mechanistic implications of the observed patterns, particularly in cases where abundance changes appear decoupled from stability effects.

      We agree that disentangling stability and other effects on cellular abundance is one of the goals of this work. As discussed above, it has been difficult to find clear cases where amino acid substitutions affect abundance without stability beyond for example the (rare) effects of creating surface exposed degrons. Our new analysis, in which we compare substitution matrix-based predictions to stability predictions, does offer deeper insight into the relationship between the two predictor types and hence possibly between folding stability and abundance. 

      Reviewer #3 (Public review): 

      "Effects of residue substitutions on the cellular abundance of proteins" by Schulze and Lindorff-Larsen revisits the classical concept of structure-aware protein substitution matrices through the scope of modern protein structure modelling approaches and comprehensive phenotypic readouts from multiplex assays of variant effects (MAVEs). The authors explore 6 unique protein MAVE datasets based on protein abundance (and thus stability) by utilizing structural information, specifically residue solvent accessibility and secondary structure type, to derive combinations of context-specific substitution matrices predicting variant abundance. They are clear to outline that the aim of the study is not to produce a new best abundance predictor but to showcase the degree of prediction afforded simply by utilizing information on residue accessibility. The performance of their matrices is robustly evaluated using a leave-one-out approach, where the abundance effects for a single protein are predicted using the remaining datasets. Using a simple classification of buried and solvent-exposed residues, and substitution matrices derived respectively for each residue group, the authors convincingly demonstrate that taking structural solvent accessibility contexts into account leads to more accurate performance than either a structureunaware matrix, secondary structure-based matrix, or matrices combining both solvent accessibility or secondary structure. Interestingly, it is shown that the performance of the simple buried and exposed residue substitution matrices for predicting protein abundance is on par with Rosetta, an established and specialized protein variant stability predictor. More importantly, the authors finish off the paper by demonstrating the utility of the two matrices to identify surface residues that have buried-like substitution profiles, that are shown to correspond to protein interface residues, posttranslational modification sites, functional residues, or putative degrons.

      Strengths:

      The paper makes a strong and well-supported main point, demonstrating the utility of the authors' approach through performance comparisons with alternative substitution matrices and specialized methods alike. The matrices are rigorously evaluated without introducing bias, exploring various combinations of protein datasets. Supplemental analyses are extremely comprehensive and detailed. The applicability of the substitution matrices is explored beyond abundance prediction and could have important implications in the future for identifying functionally relevant sites.

      We thank the reviewer for the supportive comments on our work. 

      Comments:

      (1) A wider discussion of the possible reasons why matrices for certain proteins seem to correlate better than others would be extremely interesting, touching upon possible points like differences or similarities in local environments, degradation pathways, posttranslation modifications, and regulation. While the initial data structure differences provide a possible explanation, Figure S17A, B correlations show a more complicated picture.

      We agree with the reviewer that biochemical and biophysical differences between the proteins might contribute to the fact that some matrices correlate better than others. We also agree that it would be very interesting to understand these differences better. While it might be possible to examine some of the suggested causes of the differences, like differences or similarities in local environments, we have generally found that noise and differences in score distributions make such analyses difficult (see also responses to reviewers 1 and 2). For now, we will defer additional analyses to future work.

      (2) The performance analysis in Figure 2D seems to show that for particular proteins "less is more" when it comes to which datasets are best to derive the matrix from (CYP2C9, ASPA, PRKN). Are there any features (direct or proxy), that would allow to group proteins to maximize accuracy? Do the authors think on top of the buried vs exposed paradigm, another grouping dimension at the protein/domain level could improve performance?

      We don’t currently know if any protein- or domain-level features could be used to further split residues into useful categories for constructing new substitution matrices, but it is an interesting suggestion. We note that every substitution matrix consists of 380 averages, and creating too many residue groupings will cause some matrix entries to be averaged over very few abundance scores, at least with the current number of scores in the pooled VAMP-seq dataset. For example, while previous work has shown different mutational effects e.g. in helices and sheets (as one would expect), we find that a model with six matrices ({buried,exposed}x{helix,sheet,other}) does not lead to improved predictions (Fig. 2C), presumably because of an unfavourable balance between parameters and data.

      (3) While the matrices and Rosetta seem to show similar degrees of correlation, do the methods both fail and succeed on the same variants? Or do they show a degree of orthogonality and could potentially be synergistic?

      These are good questions and are related to similar questions from reviewers 1 and 2. In the revised manuscript, we have added additional analyses of differences between predictions from our substitution matrix model and a stability model, and we indeed find that the two methods show a degree of orthogonality. However, since we identify only relatively few residues for which one method performs better than the other, we don’t expect a synergistic model to outperform the stability predictor across all variants in any of the six proteins.  

      Overall, this work presents a valuable contribution by creatively utilizing a simple concept through cutting-edge datasets, which could be useful in various.

      Reviewing Editor:

      As discussed in more detail below, to strengthen the assessment, the authors are encouraged to:

      (1) Include more thorough statistical analyses, such as confidence intervals or standard errors, to better validate key claims (e.g., RMSD comparisons).

      (2) Perform a deeper comparison between substitution response matrices and ΔΔG-based predictions to uncover areas of agreement or orthogonality

      (3) Clarify the relationship between structural features, stability, and abundance to provide more mechanistic insights.

      As discussed above and below, we have added new analyses and clarifications to the revised manuscript.

      Reviewer #1 (Recommendations for the authors):

      Minor points:

      Why is a continuous version of the contact number used here, instead of a discrete count of neighbouring residues? WCN values of the residues in the core domain can be affected by residues far away (small contribution but not strictly zero; if there are many of them, it adds up).

      We have previously found WCN, which quantifies residue contact numbers in a continuous manner, to be a useful input feature for a classifier that determines whether individual residues are important for maintaining protein abundance or function (Cagiada et al, 2023). We have also found WCN and the cellular abundance of single substitution variants to correlate well in individual analyses of different proteins (Grønbæk-Thygesen et al., 2024; Gersing et al., 2024; Clausen et al., 2024).

      We have calculated the WCN as well as a contact number based on discrete counts of neighbouring residues for the six proteins in our dataset. When distances between residues are evaluated in the same way (i.e. using the shortest distance between any pair of heavy atoms in the side chains), and when the cutoff value used for the discrete count is equal to the r<sub>0</sub> of the WCN function, the continuous and discrete evaluations of residue contact numbers are highly and linearly correlated, and their rank correlation with the VAMP-seq data are very similar. We only observe minor contributions from residues far away in the structure on the WCN.

      Typos in SI figure captions e.g. Figure S8-11 "All predictions were performed using using...."

      Thank you for pointing this out. We have corrected the typos in Figure S8-11 (Figure S7-S10 in the revised manuscript).

      Personally, I'd appreciate a definition of these new substitution matrices under the constraints of rASA/WCN values. It was unclear to me until I read the code but we think that the definition is averaging the substitution matrix based on the clusters they are assigned to. If so, this could be straightforwardly defined in the method section with a heaviside step function.

      We have added a definition of the “buried” and “exposed” substitution matrices as a function of rASA in the methods section (“Definitions of buried and exposed residues” and “Definition of substitution matrices”) of the manuscript, as well as a definition of how we classified residues as either buried or exposed using both rASA and WCN as input. Our final substitution matrices, as shown in e.g. Fig. 2, do not depend on the WCN; only the substitution matrix results in Figure S6 (Figure S20 in the revised manuscript) depend on both WCN and rASA.

      Reviewer #2 (Recommendations for the authors):

      The following suggestions aim to strengthen the analysis and clarify the presentation of your findings:

      (1) Specific analyses to consider:

      (1.1) Analyze buried positions where the exposed matrix performs better. Understanding these cases might reveal properties of protein core regions that show unexpected mutational tolerance.

      We agree with the reviewer that a more detailed analysis of buried residues with exposed-like substitution profiles would be very interesting.

      We note that for proteins where the VAMP-seq score distribution is shifted towards high values (as it is the case for PTEN, TPMT and CYP2C9), our identification of such residues may be a result of the score distribution differences between the six datasets. To confidently identify mutationally tolerant core regions, it would be best to (a) correct for the distribution differences prior to the analysis or (b) focus the analysis on residues that fall far below the diagonal in Figure S18.

      In additional data (which can be found at https://github.com/KULL-Centre/_2024_Schulze_abundance-analysis)) ,we provide, for each of the proteins, a list of buried residues for which RMSD<sub>exposed</sub> <RMSD<sub>buried</sub> (for more than 95% of resampled substitution profiles, as described under 1.6). We have not analysed these residues further.

      (1.2) A systematic comparison of matrix-based vs. ΔΔG-based predictions could help understand both exposed sites that behave as buried (as analyzed in the paper) and buried sites that behave as exposed (1.1), potentially revealing mechanisms underlying abundance changes.

      In our revised manuscript, we have added additional analyses to compare matrixbased and ΔΔG-based predictions, focusing on exposed sites for which one prediction method captures variant effects on abundance considerably better the other prediction method. We have not investigated buried sites with exposed-like behaviour any further in this work.

      (1.3) Explore different normalization approaches when pooling data across proteins. In particular, consider using log(abundance score): if the experimental error in abundance measurements is multiplicative (which can be checked from the reported standard errors), then log transformation would convert this into a constant additive error, making the analysis more statistically sound.

      As we answer below to point 2.2, the abundance scores are, within each dataset, min-max normalised to nonsense and synonymous variant scores, and the score scale is thus in this way consistent across the six datasets. We have explained above and in the revised manuscript that abundance score distribution differences across datasets are likely partially a result of the FACS binning of assay-specific variant libraries. Using only the VAMP-seq scores (that is, without further information about the individual experiments), we cannot correct for the influence of the sorting strategy on the reported scores. A score normalisation across datasets that places all data points on a single scale would require inter-dataset references variant scores, which we do not have. We note that in a subsequent manuscript (Schulze et al, bioRxiv, 2025) we have attempted to take system- and experimentspecific score distributions into account. We now refer to this work in the revised manuscript.

      (1.4) Consider using correlation coefficients between predicted and observed abundance profiles as an alternative to RMSD, which is sensitive to the absolute values of the scores.

      We agree with the reviewer that using correlation coefficients to compare substitution profiles might also be useful, in particular for datasets with relatively unique VAMP-seq score distributions, such as the ASPA dataset. To explore this idea, we have repeated the analysis presented in Fig. S18 using the Pearson correlation coefficient r rather than the RMSD.

      As in Fig. S18, we derive r<sub>buried</sub> and r<sub>exposed</sub> for every residue in the six proteins, specifically by calculating r between the abundance score substitution profile of every individual residue and the average abundance score substitution profiles of buried and exposed residues. VAMP-seq data for the protein for which r<sub>buried</sub> and r<sub>exposed</sub> are evaluated is omitted from the calculation of average abundance score substitution profiles, and we use only monomer structures to determine whether residues are buried or exposed. 

      We show the results of this analysis in an Author response image 1 below. In each panel of the figure, r<sub>buried</sub> and r<sub>exposed</sub> are shown for individual residues of a single protein. Blue datapoints indicate residues that are solvent-exposed in the wild-type protein structures, and yellow datapoints indicate residues that are buried in the wild-type structures. Residues for which it is not the case that r<sub>buried</sub> < r<sub>exposed</sub> or r<sub>exposed</sub><r<sub>buried</sub> in more than 95% of 1000 resampled residue substitution profiles (see explanation of resampling method above) are coloured grey. “Acc.” is the balanced classification accuracy, calculated using all non-grey datapoints, indicating how many buried residues have buried-like substitution profiles (r<sub>exposed</sub><r<sub>buried</sub>) and how many solvent-exposed residues have exposed-like substitution profiles (r<sub>buried</sub> < r<sub>exposed</sub>). The classification accuracy per protein in this figure cannot be compared to the classification accuracy of the same protein in Fig. S18, since the number of datapoints used in the accuracy calculation differ between the r- and RMSD-based analyses. 

      Author response image 1.

      Comparing the r-based approach to the RMSD-based approach (Fig. S18), it is clear that the r-based method is less robust than the RMSD-based method for noisy and incomplete datasets. For the noisiest and most mutationally incomplete VAMP-seq datasets (i.e., PTEN, TPMT and CYP2C9) (Fig. 1), there are relatively few residues for which we with high confidence can determine if the substitution profile is more buried- or more exposed-like. When the VAMP-seq data is less noisy and has high mutational completeness, the r-based method becomes more robust and may thus be relevant in potential future work on new VAMP-seq data with small error bars.

      In conclusion, we find that RMSD-based approach to compare substitution profiles is more robust than an r-based approach for several of the VAMP-seq datasets that are included in our analysis. We do believe than an approach based on the correlation coefficient, or potentially several metrics, could be relevant to use, since abundance score distributions from VAMP-seq datasets can differ significantly across datasets. So as not to increase the length of the main text of our manuscript, we have not added this analysis to the revised manuscript.

      (1.5) Consider treating missing abundance scores as zero values, as they might indicate variants with very low abundance, rather than omitting them from the analysis.

      This suggestion would be most relevant for the PTEN, TPMT and CYP2C9 datasets, which all have a relatively small average mutational depth and completeness, as shown in Fig. 1B and 1C. To assess if setting missing abundance scores as zero values would be reasonable, we have compared the distributions of predicted ΔΔG values (from RaSP and ThermoMPNN) and of predicted abundance scores (from our exposure-based substitution matrices) for variants with reported and missing VAMP-seq data. We show the result in Author response image 2, with data aggregated across the six protein systems:

      Author response image 2.

      We find that variants with and without VAMP-seq data have similar ΔΔG score distributions and similar predicted abundance score distributions, and there is thus no clear enrichment of predicted loss of abundance for variants with missing VAMP-seq scores. This suggests that missing abundance scores do not necessarily indicate very low abundance. One cause of missing data might instead be problems with library generation (Matreyek et al, 2018, 2021).

      We show in Fig. S9 (Fig. S8 of the revised manuscript) that predicted scores for variants with experimental abundance scores of 0 are often overestimated for NUDT15, ASPA and PRKN, but this is not so much a problem for PTEN, TMPT and CYP2C9, the datasets with most missing scores. The lack of an enrichment of low abundance variants from the various predictors would thus still support that missing scores do not necessarily indicate low abundance.

      (1.6) Develop a proper statistical framework for comparing buried vs exposed predictions (whether using RMSD or correlations), including confidence intervals, rather than using arbitrary thresholds.

      As explained above and in the methods section of our revised manuscript, we have expanded our approach to compare the substitution profile of a residue to the average profiles of buried and exposed residues, and our method now accounts for the noise in the VAMP-seq data, making the analysis more statistically rigorous. In our expanded approach, we compare the substitution profiles of individual residues to the average profiles for buried and exposed residues 10,000 times per residue to get a residue-specific distribution of RMSD<sub>buried</sub> and RMSD<sub>exposed</sub> values. Individual RMSD<sub>buried</sub> and RMSD<sub>exposed</sub> values are calculated by resampling abundance scores from a Gaussian distribution defined by the experimentally reported abundance score and abundance score standard deviation per variant. We now only report a residue to have e.g. a buried-like substitution profile if RMSD<sub>buried</sub> < RMSD<sub>exposed</sub> in at least 95% of our samples. We do not recalculate average scores in substitution matrices for this analysis. We have updated the plots in our manuscript, e.g. in Fig. S18 and S19 of the revised version, to indicate which residues are confidently classified as buried- or exposed-like.

      (2) Presentation improvements:

      (2.1) In Figure 4, consider removing the average abundance scores, which are not directly related to the RMSD comparison being shown.

      We have decided to keep the average abundance scores in Fig. 4 (now Fig. 5), as we find the average abundance scores useful for guiding interpretation of the RMSD values. For example, an unusually small average abundance score with a relatively small standard deviation may explain a case where RMSD<sub>buried</sub> and RMSD<sub>exposed</sub> are both large. This is for example the case for residue G185 in ASPA. 

      In our preprint, the error bars on the average abundance scores in Fig. 4 (now Fig. 5) indicated the standard deviation across the abundance scores that were used to calculate the average per position. We have removed these error bars in the revised manuscript, as we realised that these were not necessarily helpful to the reader.

      (2.2) I am assuming that abundance scores are defined as the ratio abundance_variant/abundance_wt throughout the analysis, but I don't think this has been explicitly defined. If this is correct, please state it explicitly. In such case, log(abundance_score) would have a simple interpretation as the difference in abundance between variant and wild-type.

      Abundance scores are defined throughout the manuscript as sequence-based scores that have been min-max normalised to the abundance of nonsense and synonymous variants, i.e. abundance_score = (abundance_variant abundance_nonsense)/(abundance_wt–abundance_nonsense). We have described the normalisation of scores to wild-type and nonsense variant abundance in lines 164-166 of the original manuscript. We have now added additional information about the normalisation scheme in the methods section. We note that we did not ourselves apply this normalisation to the data; the scores were reported in this manner in the original publications that reported the VAMP-seq experiments for the six proteins.

      (2.3) Consider renaming "rASA" to the more commonly used "RSA" for relative solvent accessibility.

      We have decided to keep using “rASA” throughout the manuscript.

      (2.4) The weighted contact number function used differs from the established WCN measure (Σ1/rij²) introduced by Lin et al. (2008, Proteins). This should be acknowledged and the choice of alternative weighting scheme justified.

      As we have also responded to the first minor point of reviewer 1, we have previously found WCN, as it is defined in our manuscript, to be a useful input feature for a classifier that determines whether individual residues are important for maintaining protein abundance or function (Cagiada et al, 2023). We have also previously found this type of WCN to correlate well with variant abundance of individual proteins, as measured with VAMP-seq or protein fragment complementation assays (Grønbæk-Thygesen et al., 2024; Clausen et al., 2024; Gersing et al., 2024). We acknowledge that residue contact numbers or weighted contact numbers could also be expressed in other ways and that alternative contact number definitions would likely also produce values that correlate well with VAMP-seq data. Since the WCN, as defined in our manuscript, already correlates relatively well with abundance scores, we have not explored whether alternative definitions produce better correlations.  

      (2.5) Replace the phrase "in the above" with specific references to sections or simply "above" where appropriate. Also, consider replacing many instances of "moreover" with simpler alternatives such as "also" or "in addition" to improve readability.

      We have changed several sentences according to this suggestion and hope that we have improved the readability of our manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) It should be explicitly confirmed earlier that complex structures are used for NUDT15 and ASPA when assessing rASA/WCN. Additionally, it would be interesting to see the effect that deriving the matrices using NUDT15 and ASPA monomers would have.

      We have commented on the use of NUDT15 and ASPA homodimer structures earlier in the revised manuscript (specifically already in the subsection Abundance scores correlate with the degree of residue solvent-exposure section).

      When residues are classified using monomer rather than dimer structures of NUDT15 and ASPA, there is a small effect on the resulting “buried” and “exposed” substitution matrices. Entries in this set of substitution matrices calculated using either monomer or dimer structures typically differ by less than 0.05, and only a single entry differ by more than 0.1. As expected, the “exposed” matrix tend to contain slightly larger numbers when derived from dimer structures than when derived from monomer structures, meaning that when the interface residues are included in the exposed residue category, the average abundance scores of the “exposed” matrix are lowered. For buried residues, the picture is more mixed, although the overall tendency is that the interface residues make the “buried” matrix contain smaller average abundance scores for dimer compared to monomer structures. These results generally support the use of dimer structures for the residue classification.

      We here show the differences between the substitution matrices calculated with dimer or monomer structures of NUDT15 and ASPA and using data for all six proteins in our combined VAMP-seq dataset (average_abundance_score_differece = average_abundance_score_dimers – average_abundance_score _monomers):

      Author response image 3.

      We have not explored these alternative matrices further.

      (2) While the supplemental analyses are rigorous, the abundance of various metrics being presented can be confusing, especially when they seem to differ in their result. For instance, the discussion of Figure S17 (paragraph starting 428) contains mentions of mean differences but then switches to correlations, while both are presented for all panels. The claim "The datasets thus mainly differ due to differences in substitution effects in buried environments. " is well supported by the observed mean differences, but for Pearson's correlations the average panel A ,B values of buried 0.421 vs exposed 0.427 are hardly different. Which of the metrics is more meaningful, and are both needed?

      We agree with the reviewer that the claim that “The datasets thus mainly differ due to differences in substitution effects in buried environments” is not well-supported by the r between the substitution matrices, and we have removed this claim from the text.

      Since some datasets share VAMP-seq score distribution features, while others do not, the absolute difference between scores or matrices may be relevant to check for some dataset pairs, while the r may be more relevant to check for other dataset pairs. Hence, we have included both metrics in Fig S17 (Fig S11 in the revised manuscript).

      (3) Lines 337-340 - does not feel like S7 is the topic, perhaps the authors meant Figure 2A, B? In general, the supplemental figure references are out of order and panel combinations are sometimes confusing.

      We have corrected figures references to now be correct and changed the arrangement of supplemental figures so that they now occur in the correct order. We have looked through the panel combinations with clarity in mind, and hope that the current set of main and supplementary figures balances overview and detail.

      (4) Line 363 "are also are also".

      We have corrected this typo.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This is an excellent study by a superb investigator who discovered and is championing the field of migrasomes. This study contains a hidden "gem" - the induction of migrasomes by hypotonicity and how that happens. In summary, an outstanding fundamental phenomenon (migrasomes) en route to becoming transitionally highly significant.

      Strengths:

      Innovative approach at several levels. Migrasomes - discovered by Dr Yu's group - are an outstanding biological phenomenon of fundamental interest and now of potentially practical value.

      Weaknesses:

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      We sincerely thank the reviewer for the encouraging and insightful comments. We fully agree that the fundamental aspects of migrasome biology are of great importance and deserve deeper exploration.

      In line with the reviewer’s suggestion, we have expanded our discussion on the basic biology of engineered migrasomes (eMigs). A recent study by the Okochi group at the Tokyo Institute of Technology demonstrated that hypoosmotic stress induces the formation of migrasome-like vesicles, involving cytoplasmic influx and requiring cholesterol for their formation (DOI: 10.1002/1873-3468.14816, February 2024). Building on this, our study provides a detailed characterization of hypoosmotic stressinduced eMig formation, and further compares the biophysical properties of natural migrasomes and eMigs. Notably, the inherent stability of eMigs makes them particularly promising as a vaccine platform.

      Finally, we would like to note that our laboratory continues to investigate multiple aspects of migrasome biology. In collaboration with our colleagues, we recently completed a study elucidating the mechanical forces involved in migrasome formation (DOI: 10.1016/j.bpj.2024.12.029), which further complements the findings presented here.

      Reviewer #2 (Public review):

      Summary:

      The authors' report describes a novel vaccine platform derived from a newly discovered organelle called a migrasome. First, the authors address a technical hurdle in using migrasomes as a vaccine platform. Natural migrasome formation occurs at low levels and is labor intensive, however, by understanding the molecular underpinning of migrasome formation, the authors have designed a method to make engineered migrasomes from cultured, cells at higher yields utilizing a robust process. These engineered migrasomes behave like natural migrasomes. Next, the authors immunized mice with migrasomes that either expressed a model peptide or the SARSCoV-2 spike protein. Antibodies against the spike protein were raised that could be boosted by a 2nd vaccination and these antibodies were functional as assessed by an in vitro pseudoviral assay. This new vaccine platform has the potential to overcome obstacles such as cold chain issues for vaccines like messenger RNA that require very stringent storage conditions.

      Strengths:

      The authors present very robust studies detailing the biology behind migrasome formation and this fundamental understanding was used to form engineered migrasomes, which makes it possible to utilize migrasomes as a vaccine platform. The characterization of engineered migrasomes is thorough and establishes comparability with naturally occurring migrasomes. The biophysical characterization of the migrasomes is well done including thermal stability and characterization of the particle size (important characterizations for a good vaccine).

      Weaknesses:

      With a new vaccine platform technology, it would be nice to compare them head-tohead against a proven technology. The authors would improve the manuscript if they made some comparisons to other vaccine platforms such as a SARS-CoV-2 mRNA vaccine or even an adjuvanted recombinant spike protein. This would demonstrate a migrasome-based vaccine could elicit responses comparable to a proven vaccine technology. 

      We thank the reviewer for the thoughtful evaluation and constructive suggestions, which have helped us strengthen the manuscript. 

      Comparison with proven vaccine technologies:

      In response to the reviewer’s comment, we now include a direct comparison of the antibody responses elicited by eMig-Spike and a conventional recombinant S1 protein vaccine formulated with Alum. As shown in the revised manuscript (Author response image 1), the levels of S1-specific IgG induced by the eMig-based platform were comparable to those induced by the S1+Alum formulation. This comparison supports the potential of eMigs as a competitive alternative to established vaccine platforms. 

      Author response image 1.

      eMigrasome-based vaccination showed similar efficacy compared with adjuvanted recombinant spike protein The amount of S1-specific IgG in mouse serum was quantified by ELISA on day 14 after immunization. Mice were either intraperitoneally (i.p.) immunized with recombinant Alum/S1 or intravenously (i.v.) immunized with eM-NC, eM-S or recombinant S1. The administered doses were 20 µg/mouse for eMigrasomes, 10 µg/mouse (i.v.) or 50 µg/mouse (i.p.) for recombinant S1 and 50 µl/mouse for Aluminium adjuvant.

      Assessment of antigen integrity on migrasomes:

      To address the reviewer’s suggestion regarding antigen integrity, we performed immunoblotting using antibodies against both S1 and mCherry. Two distinct bands were observed: one at the expected molecular weight of the S-mCherry fusion protein, and a higher molecular weight band that may represent oligomerized or higher-order forms of the Spike protein (Figure 5b in the revised manuscript).

      Furthermore, we performed confocal microscopy using a monoclonal antibody against Spike (anti-S). Co-localization analysis revealed strong overlap between the mCherry fluorescence and anti-Spike staining, confirming the proper presentation and surface localization of intact S-mCherry fusion protein on eMigs (Figure 5c in the revised manuscript). These results confirm the structural integrity and antigenic fidelity of the Spike protein expressed on eMigs.

      Recommendations for the authors

      Reviewer #1 (Recommendations For The Authors):

      I feel that the overemphasis on practical aspects (vaccine), however important, eclipses some of the fundamental aspects that may be just as important and actually more interesting. If this can be expanded, the study would be outstanding.

      I know that the reviewers always ask for more, and this is not the case here. Can the abstract and title be changed to emphasize the science behind migrasome formation, and possibly add a few more fundamental aspects on how hypotonic shock induces migrasomes?

      Alternatively, if the authors desire to maintain the emphasis on vaccines, can immunological mechanisms be somewhat expanded in order to - at least to some extent - explain why migrasomes are a better vaccine vehicle?

      One way or another, this reviewer is highly supportive of this study and it is really up to the authors and the editor to decide whether my comments are of use or not.

      My recommendation is to go ahead with publishing after some adjustments as per above.

      We’d like to thank the reviewer for the suggestion. We have changed the title of the manuscript and modified the abstract, emphasizing the fundamental science behind the development of eMigrasome. To gain some immunological information on eMig illucidated antibody responses, we characterized the type of IgG induced by eM-OVA in mice, and compared it to that induced by Alum/OVA. The IgG response to Alum/OVA was dominated by IgG1. Quite differently, eM-OVA induced an even distribution of IgG subtypes, including IgG1, IgG2b, IgG2c, and IgG3 (Figure 4i in the revised manuscript). The ratio between IgG1 and IgG2a/c indicates a Th1 or Th2 type humoral immune response. Thus, eM-OVA immunization induces a balance of Th1/Th2 immune responses.

      Reviewer #2 (Recommendations For The Authors):

      The study is a very nice exploration of a new vaccine platform. This reviewer believes that a more head-to-head comparison to the current vaccine SARS-CoV-2 vaccine platform would improve the manuscript. This comparison is done with OVA antigen, but this model antigen is not as exciting as a functional head-to-head with a SARS-CoV-2 vaccine.

      I think that two other discussion points should be included in the manuscript. First, was the host-cell protein evaluated? If not, I would include that point on how issues of host cell contamination of the migrasome could play a role in the responses and safety of a vaccine. Second, I would discuss antigen incorporation and localization into the platform. For example, the full-length spike being expressed has a native signal peptide and transmembrane domain. The authors point out that a transmembrane domain can be added to display an antigen that does not have one natively expressed, however, without a signal peptide this would not be secreted and localized properly. I would suggest adding a discussion of how a non-native signal peptide would be necessary in addition to a transmembrane domain.

      We thank the reviewer for these thoughtful suggestions and fully agree that the points raised are important for the translational development of eMig-based vaccines.

      (1) Host cell proteins and potential immunogenicity:

      We appreciate the reviewer’s suggestion to consider host cell protein contamination. Considering potential clinical application of eMigrasomes in the future, we will use human cells with low immunogenicity such as HEK-293 or embryonic stem cells (ESCs) to generate eMigrasomes. Also, we will follow a QC that meets the standard of validated EV-based vaccination techniques. 

      (2) Antigen incorporation and localization—signal peptide and transmembrane domain:

      We also agree with the reviewer’s point that proper surface display of antigens on eMigs requires both a transmembrane domain and a signal peptide for correct trafficking and membrane anchoring. For instance, in the case of full-length Spike protein, the native signal peptide and transmembrane domain ensure proper localization to the plasma membrane and subsequent incorporation into eMigs. In case of OVA, a secretary protein that contains a native signal peptide yet lacks a transmembrane domain, an engineered transmembrane domain is required. For antigens that do not naturally contain these features, both a non-native signal peptide and an artificial transmembrane domain are necessary. We have clarified this point in the revised discussion and explicitly noted the requirement for a signal peptide when engineering antigens for surface display on migrasomes.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      (1) It might be good to further discuss potential molecular mechanisms for increasing the TF off rate (what happens at the mechanistic level). 

      This is now expanded in the Discussion

      (2) To improve readability, it would be good to make consistent font sizes on all figures to make sure that the smallest font sizes are readable. 

      We have normalised figure text as much as is feasible.

      (3) upDARs and downDARs - these abbreviations are defined in the figure legend but not in the main text. 

      We have removed references to these terms from the text and included a definition in the figure legend. 

      (4) Figure 3B - the on-figure legend is a bit unclear; the text legend does not mention the meaning of "DEG". 

      We have removed this panel as it was confusing and did not demonstrate any robust conclusion. 

      (5) The values of apparent dissociation rates shown in Figure 5 are a bit different from values previously reported in literature (e.g., see Okamoto et al., 20203, PMC10505915). Perhaps the authors could comment on this. Also, it would be helpful to add the actual equation that was used for the curve fitting to determine these values to the Methods section. 

      We have included an explanation of the curve fitting equation in the Methods as suggested.

      The apparent dissociation rate observed is a sum of multiple rates of decay – true dissociation rate (k<sub>off</sub>), signal loss caused by photobleaching k<sub>pb</sub>, and signal loss caused by defocusing/tracking error (k<sub>tl</sub>).

      k<sub>off</sub><sup>app</sup> = k<sub>off</sub>+ k<sub>pb</sub> + k<sub>tl</sub>

      We are making conclusions about relative changes in k<sub>off</sub><sup>app</sup> upon CHD4 depletion, not about the absolute magnitude of true in k<sub>off</sub> or TF residence times.Our conclusions extend to true in k<sub>off</sub> on the assumption that k<sub>pb</sub> and k<sub>tl</sub> are equal across all samples imaged due to identical experimental conditions and analysis. k<sub>pb</sub> and k<sub>tl</sub> vary hugely across experimental set-ups, especially with different laser powers, so other k<sub>off</sub> or k<sub>off</sub><sup>app</sup> values reported in the literature would be expected to differ from ours. Time-lapse experiments or independent determination of k<sub>pb</sub> (and k<sub>tl</sub>) would be required to make any statements about absolute values of k<sub>off</sub>

      (6) Regarding the discussion about the functionality of low-affinity sites/low accessibility regions, the authors may wish to mention the recent debates on this (https://www.nature.com/articles/s41586-025-08916-0; https://www.biorxiv.org/content/10.1101/2025.10.12.681120v1). 

      We have now included a discussion of this point and referenced both papers.

      (7) It may be worth expanding figure legends a bit, because the definitions of some of the terms mentioned on the figures are not very easy to find in the text. 

      We have endeavoured to define all relevant terms in the figure legends. 

      Reviewer #2 (Public review): 

      (1) Figure 2 shows heat maps of RNA-seq results following a time course of CHD4 depletion (0, 1, 2 hours...). Usually, the red/blue colour scale is used to visualise differential expression (fold-difference). Here, genes are coloured in red or blue even at the 0-hour time point. This confused me initially until I discovered that instead of folddifference, a z-score is plotted. I do not quite understand what it means when a gene that is coloured blue at the 0-hour time point changes to red at a later time point. Does this always represent an upregulation? I think this figure requires a better explanation. 

      The heatmap displays z-scores, meaning expression for each gene has been centred and scaled across the entire time course. As a result, time zero is not a true baseline, it simply shows whether the gene’s expression at that moment is above or below its own mean. A transition from blue to red therefore indicates that the gene increases relative to its overall average, which typically corresponds to upregulation, but it doesn’t directly represent fold-change from the 0-hour time point. We have now included a brief explanation of this in the figure legend to make this point clear.  

      (2) Figure 5D: NANOG, SOX2 binding at the KLF4 locus. The authors state that the enhancers 68, 57, and 55 show a gain in NANOG and SOX2 enrichment "from 30 minutes of CHD4 depletion". This is not obvious to me from looking at the figure. I can see an increase in signal from "WT" (I am assuming this corresponds to the 0 hours time point) to "30m", but then the signals seem to go down again towards the 4h time point. Can this be quantified? Can the authors discuss why TF binding seems to increase only temporarily (if this is the case)? 

      We have edited the text to more accurately reflect what is going on in the screen shot. We have also replaced “WT” with “0” as this more accurately reflects the status of these cells. 

      (3) There is no real discussion of HOW CHD4/NuRD counteracts TF binding (i.e. by what molecular mechanism). I understand that the data does not really inform us on this. Still, I believe it would be worthwhile for the authors to discuss some ideas, e.g., local nucleosome sliding vs. a direct (ATP-dependent?) action on the TF itself. 

      We now include more speculation on this point in the Discussion.

      Reviewer #3 (Public review): 

      The main weakness can be summarised as relating to the fact that authors interpret all rapid changes following CHD4 degradation as being a direct effect of the loss of CHD4 activity. The possibility that rapid indirect effects arise does not appear to have been given sufficient consideration. This is especially pertinent where effects are reported at sites where CHD4 occupancy is initially low. 

      We acknowledge that we cannot definitively say any effect is a direct consequence of CHD4 depletion and have mitigated statements in the Results and Discussion. 

      Reviewing Editor Comments: 

      I am pleased to say all three experts had very complementary and complimentary comments on your paper - congratulations. Reviewer 3 does suggest toning down a few interpretations, which I suggest would help focus the manuscript on its greater strengths. I encourage a quick revision to this point, which will not go back to reviewers, before you request a version of record. I would also like to take this opportunity to thank all three reviewers for excellent feedback on this paper. 

      As advised we have mitigated the points raised by the reviewers. 

      Reviewer #2 (Recommendations for the authors): 

      p9, top: The sentence starting with "Genes increasing in expression after four hours...." is very difficult to understand and should be rephrased or broken up. 

      We agree. This has been completely re-written. 

      Reviewer #3 (Recommendations for the authors): 

      Sites of increased chromatin accessibility emerge more slowly than sites of lost chromatin accessibility. Figure 1D, a little increase in accessibility at 30min, but a more noticeable decrease at 30min. The sites of increased accessibility also have lower absolute accessibility than observed at locations where accessibility is lost. This raises the possibility that the sites of increased accessibility represent rapid but indirect changes occurring following loss of CHD4. Consistent with this, enrichment for CHD4 and MDB3 by CUT and TAG is far higher at sites of decreased accessibility. The low level of CHD4 occupancy observed at sites where accessibility increases may not be relevant to the reason these sites are affected. Such small enrichments can be observed when aligning to other genomic features. The authors interpret their findings as indicating that low occupancy of CHD4 exerts a long-lasting repressive effect at these locations. This is one possible explanation; however, an alternative is that these effects are indirect. Perhaps driven by the very large increase in TF binding that is observed following CHD4 degradation and which appears to occur at many locations regardless of whether CHD4 is present. 

      The reviewer is right to point out that we don’t know what is direct and what is indirect. All we know is that changes happen very rapidly upon CHD4 depletion. The changes in standard ATAC-seq signal appear greater at the sites showing decreased accessibility than those increasing, however the starting points are very different: a small increase from very low accessibility will likely be a higher fold change than a more visible decrease from very high accessibility (Fig. 1D). In contrast, Figure 6 shows a more visible increase in Tn5 integrations at sites increasing in accessibility at 30 minutes than the change in sites decreasing in accessibility at 30 minutes. We therefore disagree that the sites increasing in accessibility are more likely to be indirect targets. In further support of this, there is a rapid increase in MNase resistance at these sites upon MBD3 reintroduction (Fig. 6I), possibly indicating a direct impact of NuRD on these sites. 

      Substantial changes in Nanog and SOX2 binding are observed across the time course. These changes are very large, with 43k or 78k additional sites detected. How is this possible? Does the amount of these TF's present in cells change? The argument that transient occupancy of CHD4 acts to prevent TF's binding to what is likely to be many 100's of thousands of sites (if the data for Nanog and SOX2 are representative of other transcription factors such as KLF4) seems unlikely. 

      The large number of different sites identified gaining TF binding is likely to be a reflection of the number of cells being analysed: within the 10<sup>5</sup>-10<sup>6</sup> cells used for a Cut&Run experiment we detect many sites gaining TF binding. In individual cells we agree it would be unlikely for that many sites to become bound at the same time. We detect no changes in the amounts of Nanog or Sox2 in our cells across 4 hour CHD4 depletion time course. However, we maintain that low frequency interactions of CHD4 with a site can counteract low frequency TF binding and prevent it from stimulating opening of a cryptic enhancer. 

      While increased TF binding is observed at sites of gained accessibility, the changes in TF occupancy at the lost sites do not progress continuously across the time course. In addition, the changes in occupancy are small in comparison to those observed at the gained sites. The text comments on an increase in SOX2 and Nanog occupancy at 30 min, but there is either no change or a loss by 4 hours. It's difficult to know what to conclude from this. 

      At sites losing accessibility the enrichment of both Nanog and Sox2 increases at 30 minutes. We suspect this is due to the loss of CHD4’s TF-removal activity. Thereafter the two TFs show different trends: Nanog enrichment then decreases again, probably due to the decrease in accessibility at these sites. Sox2, by contrast, does not change very much, possibly due to its higher pioneering ability. It is true that the amounts of change are very small here, however Cut&Run was performed in triplicate and the summary graphs are plotted with standard error of the mean (which is often too small to see), demonstrating that the detected changes are highly significant. (We neglected to refer to the SEM  in our figure legends: this has now been corrected.) At sites where CHD4 maintains chromatin compaction, the amount of transcription factor binding goes from zero or nearly zero to some finite number, hence the fold change is very large. In contrast the changes at sites losing accessibility starts from high enrichment so fold changes are much smaller. 

      Changes in the diffusive motion of tagged TF's are measured. The data is presented as an average of measurements of individual TF's. What might be anticipated is that subpopulations of TF's would exhibit distinct behaviours. At many locations, occupancy of these TF's are presumably unchanged. At 1 hour, many new sites are occupied, and this would represent a subpopulation with high residence. A small population of TF's would be subject to distinct effects at the sites where accessibility reduces at the onehour time point. The analysis presented fails to distinguish populations of TF's exhibiting altered mobility consistent with the proportion of the TF's showing altered binding. 

      We agree that there are likely subpopulations of TFs exhibiting distinct binding behaviours, and our modality of imaging captures this, but to distinguish subpopulations within this would require a lot more data.

      However, there is no reason to believe that the TF binding at the new sites being occupied at 1 hr would have a difference in residence time to those sites already stably bound by TFs in the wildtype, i.e. that they would exhibit a different limitation to their residence time once bound compared to those sites. We do capture more stably bound trajectories per cell, but that’s not what we’re reporting on - it’s the dissociation rate of those that have already bound in a stable manner at sites where TF occupancy is detected also by ChIP.

      The analysis of transcription shown in Figure 2 indicates that high-quality data has been obtained, showing progressive changes to transcription. The linkage of the differentially expressed genes to chromatin changes shown in Figure 3 is difficult to interpret. The curves showing the distance distribution for increased or decreased DARs are quite similar for up- and down-regulated genes. The frequency density for gained sites is slightly higher, but not as much higher as would be expected, given these sites are c6fold more abundant than the sites with lost accessibility. The data presented do not provide a compelling link between the CHD4-induced chromatin changes and changes to transcription; the authors should consider revising to accommodate this. It is possible that much of the transcriptional response even at early time points is indirect. This is not unprecedented. For example, degradation of SOX2, a transcriptional activator, results in both repression and activation of similar numbers of genes https://pmc.ncbi.nlm.nih.gov/articles/PMC10577566/ 

      We agree that these figures do not provide a compelling link between the observed chromatin changes and gene expression changes. That 50K increased sites are, on average, located farther away from misregulated genes than are the 8K decreasing sites highlights that this is rarely going to be a case of direct derepression of a silenced gene, but rather distal sites could act as enhancers to spuriously activate transcription. This would certainly be a rare event, but could explain the low-level transcriptional noise seen in NuRD mutants. We have edited the wording to make this clearer.

      The model presented in Figure 7 includes distinct roles at sites that become more or less accessible following inactivation of CHD4. This is perplexing as it implies that the same enzymes perform opposing functions at some of the different sites where they are bound. 

      Our point is that it does the same thing at both kinds of sites, but the nature of the sites means that the consequences of CHD4 activity will be different. We have tried to make this clear in the text. 

      At active sites, it is clear that CHD4 is bound prior to activation of the degron and that chromatin accessibility is reduced following depletion. Changes in TF occupancy are complex, perhaps reflecting slow diffusion from less accessible chromatin and a global increase in the abundance of some pluripotency transcription factors such as SOX2 and Nanog that are competent for DNA binding. The link between sites of reduced accessibility and transcription is less clear. 

      At the inactive sites, the increase in accessibility could be driven by transcription factor binding. There is very little CHD4 present at these sites prior to activation of the degron, and TF binding may induce chromatin opening, which could be considered a rapid but indirect effect of the CHD4 degron. The link to transcription is not clear from the data presented, but it would be anticipated that in some cases it would drive activation. 

      We acknowledge these points and have indicated this possibility in the Results and the Discussion.

      No Analysis is performed to identify binding sequences enriched at the locations of decreased accessibility. This could potentially define transcription factors involved in CHD4 recruitment or that cause CHD4 to function differently in different contexts. 

      HOMER analyses failed to provide any unique insights. The sites going down are highly accessible in ES cells: they have TF binding sites that one would expect in ES cells. The increasing sites show an enrichment for G-rich sequences, which reflects the binding preference of CHD4.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents Altair-LSFM, a solid and well-documented implementation of a light-sheet fluorescence microscope (LSFM) designed for accessibility and cost reduction. While the approach offers strengths such as the use of custom-machined baseplates and detailed assembly instructions, its overall impact is limited by the lack of live-cell imaging capabilities and the absence of a clear, quantitative comparison to existing LSFM platforms. As such, although technically competent, the broader utility and uptake of this system by the community may be limited.

      We thank the editors and reviewers for their thoughtful evaluation of our work and for recognizing the technical strengths of the Altair-LSFM platform, including the custom-machined baseplates and detailed documentation provided to promote accessibility and reproducibility. Below, we provide point-by-point responses to each referee comment. In the process, we have significantly revised the manuscript to include live-cell imaging data and a quantitative evaluation of imaging speed. We now more explicitly describe the different variants of lattice light-sheet microscopy—highlighting differences in their illumination flexibility and image acquisition modes—and clarify how Altair-LSFM compares to each. We further discuss challenges associated with the 5 mm coverslip and propose practical strategies to overcome them. Additionally, we outline cost-reduction opportunities, explain the rationale behind key equipment selections, and provide guidance for implementing environmental control. Altogether, we believe these additions have strengthened the manuscript and clarified both the capabilities and limitations of AltairLSFM.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      The article presents the details of the high-resolution light-sheet microscopy system developed by the group. In addition to presenting the technical details of the system, its resolution has been characterized and its functionality demonstrated by visualizing subcellular structures in a biological sample.

      Strengths: 

      (1) The article includes extensive supplementary material that complements the information in the main article.

      (2) However, in some sections, the information provided is somewhat superficial.

      We thank the reviewer for their thoughtful assessment and for recognizing the strengths of our manuscript, including the extensive supplementary material. Our goal was to make the supplemental content as comprehensive and useful as possible. In addition to the materials provided with the manuscript, our intention is for the online documentation (available at thedeanlab.github.io/altair) to serve as a living resource that evolves in response to user feedback. We would therefore greatly appreciate the reviewer’s guidance on which sections were perceived as superficial so that we can expand them to better support readers and builders of the system.

      Weaknesses:

      (1) Although a comparison is made with other light-sheet microscopy systems, the presented system does not represent a significant advance over existing systems. It uses high numerical aperture objectives and Gaussian beams, achieving resolution close to theoretical after deconvolution. The main advantage of the presented system is its ease of construction, thanks to the design of a perforated base plate.

      We appreciate the reviewer’s assessment and the opportunity to clarify our intent. Our primary goal was not to introduce new optical functionality beyond that of existing high-performance light-sheet systems, but rather to substantially reduce the barrier to entry for non-specialist laboratories. Many open-source implementations, such as OpenSPIM, OpenSPIN, and Benchtop mesoSPIM, similarly focused on accessibility and reproducibility rather than introducing new optical modalities, yet have had a measureable impact on the field by enabling broader community participation. Altair-LSFM follows this tradition, providing sub-cellular resolution performance comparable to advanced systems like LLSM, while emphasizing reproducibility, ease of construction through a precision-machined baseplate, and comprehensive documentation to facilitate dissemination and adoption.

      (2) Using similar objectives (Nikon 25x and Thorlabs 20x), the results obtained are similar to those of the LLSM system (using a Gaussian beam without laser modulation). However, the article does not mention the difficulties of mounting the sample in the implemented configuration.

      We appreciate the reviewer’s comment and agree that there are practical challenges associated with handling 5 mm diameter coverslips in this configuration. In the revised manuscript, we now explicitly describe these challenges and provide practical solutions. Specifically, we highlight the use of a custommachined coverslip holder designed to simplify mounting and handling, and we direct readers to an alternative configuration using the Zeiss W Plan-Apochromat 20×/1.0 objective, which eliminates the need for small coverslips altogether.

      (3) The authors present a low-cost, open-source system. Although they provide open source code for the software (navigate), the use of proprietary electronics (ASI, NI, etc.) makes the system relatively expensive. Its low cost is not justified.

      We appreciate the reviewer’s perspective and understand the concern regarding the use of proprietary control hardware such as the ASI Tiger Controller and NI data acquisition cards. Our decision to use these components was intentional: relying on a unified, professionally supported and maintained platform minimizes complexity associated with sourcing, configuring, and integrating hardware from multiple vendors, thereby reducing non-financial barriers to entry for non-specialist users.

      Importantly, these components are not the primary cost driver of Altair-LSFM (they represent roughly 18% of the total system cost). Nonetheless, for individuals where the price is prohibitive, we also outline several viable cost-reduction options in the revised manuscript (e.g., substituting manual stages, omitting the filter wheel, or using industrial CMOS cameras), while discussing the trade-offs these substitutions introduce in performance and usability. These considerations are now summarized in Supplementary Note 1, which provides a transparent rationale for our design and cost decisions.

      Finally, we note that even with these professional-grade components, Altair-LSFM remains substantially less expensive than commercial systems offering comparable optical performance, such as LLSM implementations from Zeiss or 3i.

      (4) The fibroblast images provided are of exceptional quality. However, these are fixed samples. The system lacks the necessary elements for monitoring cells in vivo, such as temperature or pH control.

      We thank the reviewer for their positive comment regarding the quality of our data. As noted, the current manuscript focuses on validating the optical performance and resolution of the system using fixed specimens to ensure reproducibility and stability.

      We fully agree on the importance of environmental control for live-cell imaging. In the revised manuscript, we now describe in detail how temperature regulation can be achieved using a custom-designed heated sample chamber, accompanied by detailed assembly instructions on our GitHub repository and summarized in Supplementary Note 2. For pH stabilization in systems lacking a 5% CO₂ atmosphere, we recommend supplementing the imaging medium with 10–25 mM HEPES buffer. Additionally, we include new live-cell imaging data demonstrating that Altair-LSFM supports in vitro time-lapse imaging of dynamic cellular processes under controlled temperature conditions.

      Reviewer #2 (Public review): 

      Summary: 

      The authors present Altair-LSFM (Light Sheet Fluorescence Microscope), a high-resolution, open-source microscope, that is relatively easy to align and construct and achieves sub-cellular resolution. The authors developed this microscope to fill a perceived need that current open-source systems are primarily designed for large specimens and lack sub-cellular resolution or are difficult to construct and align, and are not stable. While commercial alternatives exist that offer sub-cellular resolution, they are expensive. The authors' manuscript centers around comparisons to the highly successful lattice light-sheet microscope, including the choice of detection and excitation objectives. The authors thus claim that there remains a critical need for high-resolution, economical, and easy-to-implement LSFM systems. 

      We thank the reviewer for their thoughtful summary. We agree that existing open-source systems primarily emphasize imaging of large specimens, whereas commercial systems that achieve sub-cellular resolution remain costly and complex. Our aim with Altair-LSFM was to bridge this gap—providing LLSM-level performance in a substantially more accessible and reproducible format. By combining high-NA optics with a precision-machined baseplate and open-source documentation, Altair offers a practical, high-resolution solution that can be readily adopted by non-specialist laboratories.

      Strengths: 

      The authors succeed in their goals of implementing a relatively low-cost (~ USD 150K) open-source microscope that is easy to align. The ease of alignment rests on using custom-designed baseplates with dowel pins for precise positioning of optics based on computer analysis of opto-mechanical tolerances, as well as the optical path design. They simplify the excitation optics over Lattice light-sheet microscopes by using a Gaussian beam for illumination while maintaining lateral and axial resolutions of 235 and 350 nm across a 260-um field of view after deconvolution. In doing so they rest on foundational principles of optical microscopy that what matters for lateral resolution is the numerical aperture of the detection objective and proper sampling of the image field on to the detection, and the axial resolution depends on the thickness of the light-sheet when it is thinner than the depth of field of the detection objective. This concept has unfortunately not been completely clear to users of high-resolution light-sheet microscopes and is thus a valuable demonstration. The microscope is controlled by an open-source software, Navigate, developed by the authors, and it is thus foreseeable that different versions of this system could be implemented depending on experimental needs while maintaining easy alignment and low cost. They demonstrate system performance successfully by characterizing their sheet, point-spread function, and visualization of sub-cellular structures in mammalian cells, including microtubules, actin filaments, nuclei, and the Golgi apparatus.

      We thank the reviewer for their thoughtful and generous assessment of our work. We are pleased that the manuscript’s emphasis on fundamental optical principles, design rationale, and practical implementation was clearly conveyed. We agree that Altair’s modular and accessible architecture provides a strong foundation for future variants tailored to specific experimental needs. To facilitate this, we have made all Zemax simulations, CAD files, and build documentation openly available on our GitHub repository, enabling users to adapt and extend the system for diverse imaging applications.

      Weaknesses:

      There is a fixation on comparison to the first-generation lattice light-sheet microscope, which has evolved significantly since then:

      (1) The authors claim that commercial lattice light-sheet microscopes (LLSM) are "complex, expensive, and alignment intensive", I believe this sentence applies to the open-source version of LLSM, which was made available for wide dissemination. Since then, a commercial solution has been provided by 3i, which is now being used in multiple cores and labs but does require routine alignments. However, Zeiss has also released a commercial turn-key system, which, while expensive, is stable, and the complexity does not interfere with the experience of the user. Though in general, statements on ease of use and stability might be considered anecdotal and may not belong in a scientific article, unreferenced or without data.

      We thank the reviewer for this thoughtful and constructive comment. We have revised the manuscript to more clearly distinguish between the original open-source implementation of LLSM and subsequent commercial versions by 3i and ZEISS. The revised Introduction and Discussion now explicitly note that while open-source and early implementations of LLSM can require expert alignment and maintenance, commercial systems—particularly the ZEISS Lattice Lightsheet 7—are designed for automated operation and stable, turn-key use, albeit at higher cost and with limited modifiability. We have also moderated earlier language regarding usability and stability to avoid anecdotal phrasing.

      We also now provide a more objective proxy for system complexity: the number of optical elements that require precise alignment during assembly and maintenance thereafter. The original open-source LLSM setup includes approximately 29 optical components that must each be carefully positioned laterally, angularly, and coaxially along the optical path. In contrast, the first-generation Altair-LSFM system contains only nine such elements. By this metric, Altair-LSFM is considerably simpler to assemble and align, supporting our overarching goal of making high-resolution light-sheet imaging more accessible to non-specialist laboratories.

      (2) One of the major limitations of the first generation LLSM was the use of a 5 mm coverslip, which was a hinderance for many users. However, the Zeiss system elegantly solves this problem, and so does Oblique Plane Microscopy (OPM), while the Altair-LSFM retains this feature, which may dissuade widespread adoption. This limitation and how it may be overcome in future iterations is not discussed.

      We thank the reviewer for this helpful comment. We agree that the use of 5 mm diameter coverslips, while enabling high-NA imaging in the current Altair-LSFM configuration, may pose a practical limitation for some users. We now discuss this more explicitly in the revised manuscript. Specifically, we note that replacing the detection objective provides a straightforward solution to this constraint. For example, as demonstrated by Moore et al. (Lab Chip, 2021), pairing the Zeiss W Plan-Apochromat 20×/1.0 detection objective with the Thorlabs TL20X-MPL illumination objective allows imaging beyond the physical surfaces of both objectives, eliminating the need for small-format coverslips. In the revised text, we propose this modification as an accessible path toward greater compatibility with conventional sample mounting formats. We also note in the Discussion that Oblique Plane Microscopy (OPM) inherently avoids such nonstandard mounting requirements and, owing to its single-objective architecture, is fully compatible with standard environmental chambers.

      (3) Further, on the point of sample flexibility, all generations of the LLSM, and by the nature of its design, the OPM, can accommodate live-cell imaging with temperature, gas, and humidity control. It is unclear how this would be implemented with the current sample chamber. This limitation would severely limit use cases for cell biologists, for which this microscope is designed. There is no discussion on this limitation or how it may be overcome in future iterations.

      We thank the reviewer for this important observation and agree that environmental control is critical for live-cell imaging applications. It is worth noting that the original open-source LLSM design, as well as the commercial version developed by 3i, provided temperature regulation but did not include integrated control of CO2 or humidity. Despite this limitation, these systems have been widely adopted and have generated significant biological insights. We also acknowledge that both OPM and the ZEISS implementation of LLSM offer clear advantages in this respect, providing compatibility with standard commercial environmental chambers that support full regulation of temperature, CO₂, and humidity.

      In the revised manuscript, we expand our discussion of environmental control in Supplementary Note 2, where we describe the Altair-LSFM chamber design in more detail and discuss its current implementation of temperature regulation and HEPES-based pH stabilization. Additionally, the Discussion now explicitly notes that OPM avoids the challenges associated with non-standard sample mounting and is inherently compatible with conventional environmental enclosures.

      (4) The authors' comparison to LLSM is constrained to the "square" lattice, which, as they point out, is the most used optical lattice (though this also might be considered anecdotal). The LLSM original design, however, goes far beyond the square lattice, including hexagonal lattices, the ability to do structured illumination, and greater flexibility in general in terms of light-sheet tuning for different experimental needs, as well as not being limited to just sample scanning. Thus, the Alstair-LSFM cannot compare to the original LLSM in terms of versatility, even if comparisons to the resolution provided by the square lattice are fair.

      We agree that the original LLSM design offers substantially greater flexibility than what is reflected in our initial comparison, including the ability to generate multiple lattice geometries (e.g., square and hexagonal), operate in structured illumination mode, and acquire volumes using both sample- and lightsheet–scanning strategies. To address this, we now include Supplementary Note 3 that provides a detailed overview of the illumination modes and imaging flexibility afforded by the original LLSM implementation, and how these capabilities compare to both the commercial ZEISS Lattice Lightsheet 7 and our AltairLSFM system. In addition, we have revised the discussion to explicitly acknowledge that the original LLSM could operate in alternative scan strategies beyond sample scanning, providing greater context for readers and ensuring a more balanced comparison.

      (5) There is no demonstration of the system's live-imaging capabilities or temporal resolution, which is the main advantage of existing light-sheet systems.

      In the revised manuscript, we now include a demonstration of live-cell imaging to directly validate AltairLSFM’s suitability for dynamic biological applications. We also explicitly discuss the temporal resolution of the system in the main text (see Optoelectronic Design of Altair-LSFM), where we detail both software- and hardware-related limitations. Specifically, we evaluate the maximum imaging speed achievable with Altair-LSFM in conjunction with our open-source control software, navigate.

      For simplicity and reduced optoelectronic complexity, the current implementation powers the piezo through the ASI Tiger Controller, which modestly reduces its bandwidth. Nonetheless, for a 100 µm stroke typical of light-sheet imaging, we achieved sufficient performance to support volumetric imaging at most biologically relevant timescales. These results, along with additional discussion of the design trade-offs and performance considerations, are now included in the revised manuscript and expanded upon in the supplementary material.

      While the microscope is well designed and completely open source, it will require experience with optics, electronics, and microscopy to implement and align properly. Experience with custom machining or soliciting a machine shop is also necessary. Thus, in my opinion, it is unlikely to be implemented by a lab that has zero prior experience with custom optics or can hire someone who does. Altair-LSFM may not be as easily adaptable or implementable as the authors describe or perceive in any lab that is interested, even if they can afford it. The authors indicate they will offer "workshops," but this does not necessarily remove the barrier to entry or lower it, perhaps as significantly as the authors describe.

      We appreciate the reviewer’s perspective and agree that building any high-performance custom microscope—Altair-LSFM included—requires a basic understanding of (or willingness to learn) optics, electronics, and instrumentation. Such a barrier exists for all open-source microscopes, and our goal is not to eliminate this requirement entirely but to substantially reduce the technical and logistical challenges that typically accompany the construction of custom light-sheet systems.

      Importantly, no machining experience or in-house fabrication capabilities are required. Users can simply submit the provided CAD design files and specifications directly to commercial vendors for fabrication. We have made this process as straightforward as possible by supplying detailed build instructions, recommended materials, and vendor-ready files through our GitHub repository. Our dissemination strategy draws inspiration from other successful open-source projects such as mesoSPIM, which has seen widespread adoption—over 30 implementations worldwide—through a similar model of exhaustive documentation, open-source software, and community support via user meetings and workshops.

      We also recognize that documentation alone cannot fully replace hands-on experience. To further lower barriers to adoption, we are actively working with commercial vendors to streamline procurement and assembly, and Altair-LSFM is supported by a Biomedical Technology Development and Dissemination (BTDD) grant that provides resources for hosting workshops, offering real-time community support, and developing supplementary training materials.

      In the revised manuscript, we now expand the Discussion to explicitly acknowledge these implementation considerations and to outline our ongoing efforts to support a broad and diverse user base, ensuring that laboratories with varying levels of technical expertise can successfully adopt and maintain the Altair-LSFM platform.

      There is a claim that this design is easily adaptable. However, the requirement of custom-machined baseplates and in silico optimization of the optical path basically means that each new instrument is a new design, even if the Navigate software can be used. It is unclear how Altair-LSFM demonstrates a modular design that reduces times from conception to optimization compared to previous implementations.

      We thank the reviewer for this insightful comment and agree that our original language regarding adaptability may have overstated the degree to which Altair-LSFM can be modified without prior experience. It was not our intention to imply that the system can be easily redesigned by users with limited technical background. Meaningful adaptations of the optical or mechanical design do require expertise in optical layout, optomechanical design, and alignment.

      That said, for laboratories with such expertise, we aim to facilitate modifications by providing comprehensive resources—including detailed Zemax simulations, complete CAD models, and alignment documentation. These materials are intended to reduce the development burden for expert users seeking to tailor the system to specific experimental requirements, without necessitating a complete re-optimization of the optical path from first principles.

      In the revised manuscript, we clarify this point and temper our language regarding adaptability to better reflect the realistic scope of customization. Specifically, we now state in the Discussion: “For expert users who wish to tailor the instrument, we also provide all Zemax illumination-path simulations and CAD files, along with step-by-step optimization protocols, enabling modification and re-optimization of the optical system as needed.” This revision ensures that readers clearly understand that Altair-LSFM is designed for reproducibility and straightforward assembly in its default configuration, while still offering the flexibility for modification by experienced users.

      Reviewer #3 (Public review):

      Summary: 

      This manuscript introduces a high-resolution, open-source light-sheet fluorescence microscope optimized for sub-cellular imaging. The system is designed for ease of assembly and use, incorporating a custommachined baseplate and in silico optimized optical paths to ensure robust alignment and performance. The authors demonstrate lateral and axial resolutions of ~235 nm and ~350 nm after deconvolution, enabling imaging of sub-diffraction structures in mammalian cells. The important feature of the microscope is the clever and elegant adaptation of simple gaussian beams, smart beam shaping, galvo pivoting and high NA objectives to ensure a uniform thin light-sheet of around 400 nm in thickness, over a 266 micron wide Field of view, pushing the axial resolution of the system beyond the regular diffraction limited-based tradeoffs of light-sheet fluorescence microscopy. Compelling validation using fluorescent beads and multicolor cellular imaging highlights the system's performance and accessibility. Moreover, a very extensive and comprehensive manual of operation is provided in the form of supplementary materials. This provides a DIY blueprint for researchers who want to implement such a system.

      We thank the reviewer for their thoughtful and positive assessment of our work. We appreciate their recognition of Altair-LSFM’s design and performance, including its ability to achieve high-resolution, imaging throughout a 266-micron field of view. While Altair-LSFM approaches the practical limits of diffraction-limited performance, it does not exceed the fundamental diffraction limit; rather, it achieves near-theoretical resolution through careful optical optimization, beam shaping, and alignment. We are grateful for the reviewer’s acknowledgment of the accessibility and comprehensive documentation that make this system broadly implementable.

      Strengths:

      (1) Strong and accessible technical innovation: With an elegant combination of beam shaping and optical modelling, the authors provide a high-resolution light-sheet system that overcomes the classical light-sheet tradeoff limit of a thin light-sheet and a small field of view. In addition, the integration of in silico modelling with a custom-machined baseplate is very practical and allows for ease of alignment procedures. Combining these features with the solid and super-extensive guide provided in the supplementary information, this provides a protocol for replicating the microscope in any other lab.

      (2) Impeccable optical performance and ease of mounting of samples: The system takes advantage of the same sample-holding method seen already in other implementations, but reduces the optical complexity.

      At the same time, the authors claim to achieve similar lateral and axial resolution to Lattice-light-sheet microscopy (although without a direct comparison (see below in the "weaknesses" section). The optical characterization of the system is comprehensive and well-detailed. Additionally, the authors validate the system imaging sub-cellular structures in mammalian cells.

      (3) Transparency and comprehensiveness of documentation and resources: A very detailed protocol provides detailed documentation about the setup, the optical modeling, and the total cost.

      We thank the reviewer for their thoughtful and encouraging comments. We are pleased that the technical innovation, optical performance, and accessibility of Altair-LSFM were recognized. Our goal from the outset was to develop a diffraction-limited, high-resolution light-sheet system that balances optical performance with reproducibility and ease of implementation. We are also pleased that the use of precisionmachined baseplates was recognized as a practical and effective strategy for achieving performance while maintaining ease of assembly.

      Weaknesses: 

      (1) Limited quantitative comparisons: Although some qualitative comparison with previously published systems (diSPIM, lattice light-sheet) is provided throughout the manuscript, some side-by-side comparison would be of great benefit for the manuscript, even in the form of a theoretical simulation. While having a direct imaging comparison would be ideal, it's understandable that this goes beyond the interest of the paper; however, a table referencing image quality parameters (taken from the literature), such as signalto-noise ratio, light-sheet thickness, and resolutions, would really enhance the features of the setup presented. Moreover, based also on the necessity for optical simplification, an additional comment on the importance/difference of dual objective/single objective light-sheet systems could really benefit the discussion.

      In the revised manuscript, we have significantly expanded our discussion of different light-sheet systems to provide clearer quantitative and conceptual context for Altair-LSFM. These comparisons are based on values reported in the literature, as we do not have access to many of these instruments (e.g., DaXi, diSPIM, or commercial and open-source variants of LLSM), and a direct experimental comparison is beyond the scope of this work.

      We note that while quantitative parameters such as signal-to-noise ratio are important, they are highly sample-dependent and strongly influenced by imaging conditions, including fluorophore brightness, camera characteristics, and filter bandpass selection. For this reason, we limited our comparison to more general image-quality metrics—such as light-sheet thickness, resolution, and field of view—that can be reliably compared across systems.

      Finally, per the reviewer’s recommendation, we have added additional discussion clarifying the differences between dual-objective and single-objective light-sheet architectures, outlining their respective strengths, limitations, and suitability for different experimental contexts.

      (2) Limitation to a fixed sample: In the manuscript, there is no mention of incubation temperature, CO₂ regulation, Humidity control, or possible integration of commercial environmental control systems. This is a major limitation for an imaging technique that owes its popularity to fast, volumetric, live-cell imaging of biological samples.

      We fully agree that environmental control is critical for live-cell imaging applications. In the revised manuscript, we now describe the design and implementation of a temperature-regulated sample chamber in Supplementary Note 2, which maintains stable imaging conditions through the use of integrated heating elements and thermocouples. This approach enables precise temperature control while minimizing thermal gradients and optical drift. For pH stabilization, we recommend the use of 10–25 mM HEPES in place of CO₂ regulation, consistent with established practice for most light-sheet systems, including the initial variant of LLSM. Although full humidity and CO₂ control are not readily implemented in dual-objective configurations, we note that single-objective designs such as OPM are inherently compatible with commercial environmental chambers and avoid these constraints. Together, these additions clarify how environmental control can be achieved within Altair-LSFM and situate its capabilities within the broader LSFM design space.

      (3) System cost and data storage cost: While the system presented has the advantage of being opensource, it remains relatively expensive (considering the 150k without laser source and optical table, for example). The manuscript could benefit from a more direct comparison of the performance/cost ratio of existing systems, considering academic settings with budgets that most of the time would not allow for expensive architectures. Moreover, it would also be beneficial to discuss the adaptability of the system, in case a 30k objective could not be feasible. Will this system work with different optics (with the obvious limitations coming with the lower NA objective)? This could be an interesting point of discussion. Adaptability of the system in case of lower budgets or more cost-effective choices, depending on the needs.

      We agree that cost considerations are critical for adoption in academic environments. We would also like to clarify that the quoted $150k includes the optical table and laser source. In the revised manuscript, Supplementary Note 1 now includes an expanded discussion of cost–performance trade-offs and potential paths for cost reduction.

      Last, not much is said about the need for data storage. Light-sheet microscopy's bottleneck is the creation of increasingly large datasets, and it could be beneficial to discuss more about the storage needs and the quantity of data generated.

      In the revised manuscript, we now include Supplementary Note 4, which provides a high-level discussion of data storage needs, approximate costs, and practical strategies for managing large datasets generated by light-sheet microscopy. This section offers general guidance—including file-format recommendations, and cost considerations—but we note that actual costs will vary by institution and contractual agreements.

      Conclusion:

      Altair-LSFM represents a well-engineered and accessible light-sheet system that addresses a longstanding need for high-resolution, reproducible, and affordable sub-cellular light-sheet imaging. While some aspects-comparative benchmarking and validation, limitation for fixed samples-would benefit from further development, the manuscript makes a compelling case for Altair-LSFM as a valuable contribution to the open microscopy scientific community. 

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) A picture, or full CAD design of the complete instrument, should be included as a main figure.

      A complete CAD rendering of the microscope is now provided in Supplementary Figure 4.

      (2) There is no quantitative comparison of the effects of the tilting resonant galvo; only a cartoon, a figure should be included.

      The cartoon was intended purely as an educational illustration to conceptually explain the role of the tilting resonant galvo in shaping and homogenizing the light sheet. To clarify this intent, we have revised both the figure legend and corresponding text in the main manuscript. For readers seeking quantitative comparisons, we now reference the original study that provides a detailed analysis of this optical approach, as well as a review on the subject.

      (3) Description of L4 is missing in the Figure 1 caption.

      Thank you for catching this omission. We have corrected it.

      (4) The beam profiles in Figures 1c and 3a, please crop and make the image bigger so the profile can be appreciated. The PSFs in Figure 3c-e should similarly be enlarged and presented using a dynamic range/LUT such that any aberrations can be appreciated.

      In Figure 1c, our goal was to qualitatively illustrate the uniformity of the light-sheet across the full field of view, while Figure 1d provided the corresponding quantitative cross-section. To improve clarity, we have added an additional figure panel offering a higher-magnification, localized view of the light-sheet profile. For Figure 3c–e, we have enlarged the PSF images and adjusted the display range to better convey the underlying signal and allow subtle aberrations to be appreciated.

      (5) It is unclear why LLSM is being used as the gold standard, since in its current commercial form, available from Zeiss, it is a turn-key system designed for core facilities. The original LLSM is also a versatile instrument that provides much more than the square lattice for illumination, including structured illumination, hexagonal lattices, live-cell imaging, wide-field illumination, different scan modes, etc. These additional features are not even mentioned when compared to the Altair-LSFM. If a comparison is to be provided, it should be fair and balanced. Furthermore, as outlined in the public review, anecdotal statements on "most used", "difficult to align", or "unstable" should not be provided without data.

      In the revised manuscript, we have carefully removed anecdotal statements and, where appropriate, replaced them with quantitative or verifiable information. For instance, we now explicitly report that the square lattice was used in 16 of the 20 figure subpanels in the original LLSM publication, and we include a proxy for optical complexity based on the number of optical elements requiring alignment in each system.

      We also now clearly distinguish between the original LLSM design—which supports multiple illumination and scanning modes—and its subsequent commercial variants, including the ZEISS Lattice Lightsheet 7, which prioritizes stability and ease of use over configurational flexibility (see Supplementary Note 3).

      (6) The authors should recognize that implementing custom optics, no matter how well designed, is a big barrier to cross for most cell biology labs.

      We fully understand and now acknowledge in the main text that implementing custom optics can present a significant barrier, particularly for laboratories without prior experience in optical system assembly. However, similar challenges were encountered during the adoption of other open-source microscopy platforms, such as mesoSPIM and OpenSPIM, both of which have nonetheless achieved widespread implementation. Their success has largely been driven by exhaustive documentation, strong community support, and standardized design principles—approaches we have also prioritized in Altair-LSFM. We have therefore made all CAD files, alignment guides, and detailed build documentation publicly available and continue to develop instructional materials and community resources to further reduce the barrier to adoption.

      (7) Statements on "hands on workshops" though laudable, may not be appropriate to include in a scientific publication without some documentation on the influence they have had on implanting the microscope.

      We understand the concern. Our intention in mentioning hands-on workshops was to convey that the dissemination effort is supported by an NIH Biomedical Technology Development and Dissemination grant, which includes dedicated channels for outreach and community engagement. Nonetheless, we agree that such statements are not appropriate without formal documentation of their impact, and we have therefore removed this text from the revised manuscript.

      (8) It is claimed that the microscope is "reliable" in the discussion, but with no proof, long-term stability should be assessed and included.

      Our experience with Altair-LSFM has been that it remains well-aligned over time—especially in comparison to other light-sheet systems we worked on throughout the last 11 years—we acknowledge that this assessment is anecdotal. As such, we have omitted this claim from the revised manuscript.

      (9) Due to the reliance on anecdotal statements and comparisons without proof to other systems, this paper at times reads like a brochure rather than a scientific publication. The authors should consider editing their manuscript accordingly to focus on the technical and quantifiable aspects of their work.

      We agree with the reviewer’s assessment and have revised the manuscript to remove anecdotal comparisons and subjective language. Where possible, we now provide quantitative metrics or verifiable data to support our statements.

      Reviewer #3 (Recommendations for the authors):

      Other minor points that could improve the manuscript (although some of these points are explained in the huge supplementary manual): 

      (1) The authors explain thoroughly their design, and they chose a sample-scanning method. I think that a brief discussion of the advantages and disadvantages of such a method over, for example, a laserscanning system (with fixed sample) in the main text will be highly beneficial for the users.

      In the revised manuscript, we now include a brief discussion in the main text outlining the advantages and limitations of a sample-scanning approach relative to a light-sheet–scanning system. Specifically, we note that for thin, adherent specimens, sample scanning minimizes the optical path length through the sample, allowing the use of more tightly focused illumination beams that improve axial resolution. We also include a new supplementary figure illustrating how this configuration reduces the propagation length of the illumination light sheet, thereby enhancing axial resolution.

      (2) The authors justify selecting a 0.6 NA illumination objective over alternatives (e.g., Special Optics), but the manuscript would benefit from a more quantitative trade-off analysis (beam waist, working distance, sample compatibility) with other possibilities. Within the objective context, a comparison of the performances of this system with the new and upcoming single-objective light-sheet methods (and the ones based also on optical refocusing, e.g., DAXI) would be very interesting for the goodness of the manuscript.

      In the revised manuscript, we now provide a quantitative trade-off analysis of the illumination objectives in Supplementary Note 1, including comparisons of beam waist, working distance, and sample compatibility. This section also presents calculated point spread functions for both the 0.6 NA and 0.67 NA objectives, outlining the performance trade-offs that informed our design choice. In addition, Supplementary Note 3 now includes a broader comparison of Altair-LSFM with other light-sheet modalities, including diSPIM, ASLM, and OPM, to further contextualize the system’s capabilities within the evolving light-sheet microscopy landscape.

      (3) The modularity of the system is implied in the context of the manuscript, but not fully explained. The authors should specify more clearly, for example, if cameras could be easily changed, objectives could be easily swapped, light-sheet thickness could be tuned by changing cylindrical lens, how users might adapt the system for different samples (e.g., embryos, cleared tissue, live imaging), .etc, and discuss eventual constraints or compatibility issues to these implementations.

      Altair-LSFM was explicitly designed and optimized for imaging live adherent cells, where sample scanning and short light-sheet propagation lengths provide optimal axial resolution (Supplementary Note 3). While the same platform could be used for superficial imaging in embryos, systems implementing multiview illumination and detection schemes are better suited for such specimens. Similarly, cleared tissue imaging typically requires specialized solvent-compatible objectives and approaches such as ASLM that maximize the field of view. We have now added some text to the Design Principles section that explicitly state this.

      Altair-LSFM offers varying levels of modularity depending on the user’s level of expertise. For entry-level users, the illumination numerical aperture—and therefore the light-sheet thickness and propagation length—can be readily adjusted by tuning the rectangular aperture conjugate to the back pupil of the illumination objective, as described in the Design Principles section. For mid-level users, alternative configurations of Altair-LSFM, including different detection objectives, stages, filter wheels, or cameras, can be readily implemented (Supplementary Note 1). Importantly, navigate natively supports a broad range of hardware devices, and new components can be easily integrated through its modular interface. For expert users, all Zemax simulations, CAD models, and step-by-step optimization protocols are openly provided, enabling complete re-optimization of the optical design to meet specific experimental requirements.

      (4) Resolution measurements before and after deconvolution are central to the performance claim, but the deconvolution method (PetaKit5D) is only briefly mentioned in the main text, it's not referenced, and has to be clarified in more detail, coherently with the precision of the supplementary information. More specifically, PetaKit5D should be referenced in the main text, the details of the deconvolution parameters discussed in the Methods section, and the computational requirements should also be mentioned. 

      In the revised manuscript, we now provide a dedicated description of the deconvolution process in the Methods section, including the specific parameters and algorithms used. We have also explicitly referenced PetaKit5D in the main text to ensure proper attribution and clarity. Additionally, we note the computational requirements associated with this analysis in the same section for completeness.

      (5)  Image post-processing is not fully explained in the main text. Since the system is sample-scanning based, no word in the main text is spent on deskewing, which is an integral part of the post-processing to obtain a "straight" 3D stack. Since other systems implement such a post-processing algorithm (for example, single-objective architectures), it would be beneficial to have some discussion about this, and also a brief comparison to other systems in the main text in the methods section. 

      In the revised manuscript, we now explicitly describe both deskewing (shearing) and deconvolution procedures in the Alignment and Characterization section of the main text and direct readers to the Methods section. We also briefly explain why the data must be sheared to correct for the angled sample-scanning geometry for LLSM and Altair-LSFM, as well as both sample-scanning and laser-scanning-variants of OPMs.

      (6) A brief discussion on comparative costs with other systems (LLSM, dispim, etc.) could be helpful for non-imaging expert researchers who could try to implement such an optical architecture in their lab.

      Unfortunately, the exact costs of commercial systems such as LLSM or diSPIM are typically not publicly available, as they depend on institutional agreements and vendor-specific quotations. Nonetheless, we now provide approximate cost estimates in Supplementary Note 1 to help readers and prospective users gauge the expected scale of investment relative to other advanced light-sheet microscopy systems.

      (7) The "navigate" control software is provided, but a brief discussion on its advantages compared to an already open-access system, such as Micromanager, could be useful for the users.

      In the revised manuscript, we now include Supplementary Note 5 that discusses the advantages and disadvantages of different open-source microscope control platforms, including navigate and MicroManager. In brief, navigate was designed to provide turnkey support for multiple light-sheet architectures, with pre-configured acquisition routines optimized for Altair-LSFM, integrated data management with support for multiple file formats (TIFF, HDF5, N5, and Zarr), and full interoperability with OMEcompliant workflows. By contrast, while Micro-Manager offers a broader library of hardware drivers, it typically requires manual configuration and custom scripting for advanced light-sheet imaging workflows.

      (8) The cost and parts are well documented, but the time and expertise required are not crystal clear.Adding a simple time estimate (perhaps in the Supplement Section) of assembly/alignment/installation/validation and first imaging will be very beneficial for users. Also, what level of expertise is assumed (prior optics experience, for example) to be needed to install a system like this? This can help non-optics-expert users to better understand what kind of adventure they are putting themselves through.

      We thank the reviewer for this helpful suggestion. To address this, we have added Supplementary Table S5, which provides approximate time estimates for assembly, alignment, validation, and first imaging based on the user’s prior experience with optical systems. The table distinguishes between novice (no prior experience), moderate (some experience using but not assembling optical systems), and expert (experienced in building and aligning optical systems) users. This addition is intended to give prospective builders a realistic sense of the time commitment and level of expertise required to assemble and validate AltairLSFM.

      Minor things in the main text:

      (1) Line 109: The cost is considered "excluding the laser source". But then in the table of costs, you mention L4cc as a "multicolor laser source", for 25 K. Can you explain this better? Are the costs correct with or without the laser source? 

      We acknowledge that the statement in line 109 was incorrect—the quoted ~$150k system cost does include the laser source (L4cc, listed at $25k in the cost table). We have corrected this in the revised manuscript.

      (2) Line 113: You say "lateral resolution, but then you state a 3D resolution (230 nm x 230 nm x 370 nm). This needs to be fixed.

      Thank you, we have corrected this.

      (3) Line 138: Is the light-sheet uniformity proven also with a fluorescent dye? This could be beneficial for the main text, showing the performance of the instrument in a fluorescent environment.

      The light-sheet profiles shown in the manuscript were acquired using fluorescein to visualize the beam. We have revised the main text and figure legends to clearly state this.

      (4) Line 149: This is one of the most important features of the system, defying the usual tradeoff between light-sheet thickness and field of view, with a regular Gaussian beam. I would clarify more specifically how you achieve this because this really is the most powerful takeaway of the paper.

      We thank the reviewer for this key observation. The ability of Altair-LSFM to maintain a thin light sheet across a large field of view arises from diffraction effects inherent to high NA illumination. Specifically, diffraction elongates the PSF along the beam’s propagation direction, effectively extending the region over which the light sheet remains sufficiently thin for high-resolution imaging. This phenomenon, which has been the subject of active discussion within the light-sheet microscopy community, allows Altair-LSFM to partially overcome the conventional trade-off between light-sheet thickness and propagation length. We now clarify this point in the main text and provide a more detailed discussion in Supplementary Note 3, which is explicitly referenced in the discussion of the revised manuscript.

      (5) Line 171: You talk about repeatable assembly...have you tried many different baseplates? Otherwise, this is a complicated statement, since this is a proof-of-concept paper. 

      We thank the reviewer for this comment. We have not yet validated the design across multiple independently assembled baseplates and therefore agree that our previous statement regarding repeatable assembly was premature. To avoid overstating the current level of validation, we have removed this statement from the revised manuscript.

      (6) Line 187: same as above. You mention "long-term stability". For how long did you try this? This should be specified in numbers (days, weeks, months, years?) Otherwise, it is a complicated statement to make, since this is a proof-of-concept paper.

      We also agree that referencing long-term stability without quantitative backing is inappropriate, and have removed this statement from the revised manuscript.

      (7) Line 198: "rapid z-stack acquisition. How rapid? Also, what is the limitation of the galvo-scanning in terms of the imaging speed of the system? This should be noted in the methods section.

      In the revised manuscript, we now clarify these points in the Optoelectronic Design section. Specifically, we explicitly note that the resonant galvo used for shadow reduction operates at 4 kHz, ensuring that it is not rate-limiting for any imaging mode. In the same section, we also evaluate the maximum acquisition speeds achievable using navigate and report the theoretical bandwidth of the sample-scanning piezo, which together define the practical limits of volumetric acquisition speed for Altair-LSFM.

      (8) Line 234: Peta5Kit is discussed in the additional documentation, but should be referenced here, as well.

      We now reference and cite PetaKit5D.

      (9) Line 256: "values are on par with LLSM", but no values are provided. Some details should also be provided in the main text.

      In the revised manuscript, we now provide the lateral and axial resolution values originally reported for LLSM in the main text to facilitate direct comparison with Altair-LSFM. Additionally, Supplementary Note 3 now includes an expanded discussion on the nuances of resolution measurement and reporting in lightsheet microscopy.

      Figures:

      (1) Figure 1 could be implemented with Figure 3. They're both discussing the validation of the system (theoretically and with simulations), and they could be together in different panels of the same figure. The experimental light-sheet seems to be shown in a transmission mode. Showing a pattern in a fluorescent dye could also be beneficial for the paper.

      In Figure 1, our goal was to guide readers through the design process—illustrating how the detection objective’s NA sets the system’s resolution, which defines the required pixel size for Nyquist sampling and, in turn, the field of view. We then use Figure 1b–c to show how the illumination beam was designed and simulated to achieve that field of view. In contrast, Figure 3 presents the experimental validation of the illumination system. To avoid confusion, we now clarify in the text that the light sheet shown in Figure 3 was visualized in a fluorescein solution and imaged in transmission mode. While we agree that Figures 1 and 3 both serve to validate the system, we prefer to keep them as separate figures to maintain focus within each panel. We believe this organization better supports the narrative structure and allows readers to digest the theoretical and experimental validations independently.

      (2) Figure 3: Panels d and e show the same thing. Why would you expect that xz and yz profiles should be different? Is this due to the orientation of the objectives towards the sample?

      In Figure 3, we present the PSF from all three orthogonal views, as this provides the most transparent assessment of PSF quality—certain aberration modes can be obscured when only select perspectives are shown. In principle, the XZ and YZ projections should be equivalent in a well-aligned system. However, as seen in the XZ projection, a small degree of coma is present that is not evident in the YZ view. We now explicitly note this observation in the revised figure caption to clarify the difference between these panels.

      (3) Figure 4's single boxes lack a scale bar, and some of the Supplementary Figures (e.g. Figure 5) lack detailed axis labels or scale bars. Also, in the detailed documentation, some figures are referred to as Figure 5. Figure 7 or, for example, figure 6. Figure 8, and this makes the cross-references very complicated to follow

      In the revised manuscript, we have corrected these issues. All figures and supplementary figures now include appropriate scale bars, axis labels, and consistent formatting. We have also carefully reviewed and standardized all cross-references throughout the main text and supplementary documentation to ensure that figure numbering is accurate and easy to follow.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)): The key conclusions are solid. All the claims are supported by quality data. The content is rich, and no additional experiment is needed. The data and methods are properly presented for reproduction. The experiments are adequately replicated. One comment on statistical analysis is listed below.* *

      __Summary:_ ___ This manuscript investigates how Drosophila immune pathways contribute to defense against a range of filamentous fungi with distinct ecological strategies. The work provides novel insights into Toll pathway activation through pattern recognition receptors and danger signals, relative roles of melanization, phagocytosis, and effects of antimicrobial peptides, and particularly the immune evasion strategy of E. muscae via protoplast formation. These findings are of broad relevance to insect immunology, host-pathogen interactions, and evolutionary biology. * The study is well designed, the experiments are carefully executed, and the manuscript is clearly written. It is novel to demonstrate that E. muscae evades immune recognition via protoplast formation. However, some aspects of clarity and discussion of limitations could be improved before publication.** *

      We thank the reviewer of the positive assessment of our manuscript.We thank the reviewer of the positive assessment of our manuscript.

      Major comments: 1) The Abstract is informative but a bit too long. Consider condensing some sentences and highlighting the novel contributions (e.g., role of protoplasts in immune evasion.).* *

      Good points. We have reduced the abstract. The sentence is 'Our study also reveals that the fly-specific obligate fungus Entomophthora muscae employs a vegetative development strategy, protoplasts, to hide from the host immune response.'

      We believe that the role of protoplasts is already mentioned in the abstract.

      2) The Results may use more mechanistic links. For instance, the section on E. muscae immune evasion could more explicitly connect the morphological findings (protoplasts, lack of cell wall) with specific immune recognition failures.* *

      Our article is a comparison of Drosophila host defense against fungi with various life styles. This obviously complexify the presentation of the results. We have made the maximum of effort to explain our data with clarity. We believe that having two successive sections entitled 'Natural infection with E. muscae barely induces the Toll pathway' followed by ' __Entomophthora muscae hides from the host immune response using a vegetative development strategy'____ __expose well the idea that E. muscae has a specific hiding strategy. We did not change this part.

      3) Please clarify statistical analyses used for survival data (e.g., log-rank tests, multiple testing corrections). * We have clarified the statistical analysis in the method part. The sentence is 'Statistical significance of survival data was calculated with a log-rank test (Mantel-Cox test) comparing each genotype to w*1118 flies'.

      __Minor comments:____ __ Abstract: 1) "The infection outcome depends on the complex interplay between insect immune defenses and fungal adaptive strategies." could be simplified to: "Infection outcomes depend on the interplay between insect immunity and fungal adaptation." 2) Replace "our study uncovers" with "we show" for more concise phrasing. Reduce phrases like "our study reveals" or 'we conclude" in other parts of the manuscript. * Results: p. 5: phrase "survival upon natural infection... reveals the major contribution" → reword to avoid passive tone. p. 10: clarify "vesicles push the membrane outwards" with more precise terminology (e.g., budding, extrusion). * Discussion: p. 20: streamline sentence beginning "These observations provide a mechanistic basis..." (currently too dense).

      We have taken in consideration all these comments. Note that we removed in the revised version the sentence "The infection outcome depends on the complex interplay between insect immune defenses and fungal adaptive strategies." To shorten the abstract, we have removed the sentence 'These observations provide a mechanistic basis for future exploration.'

      **Referee cross-commenting*** *

      I agree with the comments of the other two reviewers.* *

      __Reviewer #1 (Significance (Required)):____ __

      This manuscript investigates how Drosophila immune pathways contribute to defense against a range of filamentous fungi with distinct ecological strategies (generalists, specialists, opportunists). By leveraging a comprehensive panel of genetically defined fly lines and standardized infections, the authors provide a demonstration that the Toll pathway is the predominant systemic antifungal defense, extending classical findings into a comparative framework across fungal lifestyles. The work provides novel insights into Toll pathway activation through GNBP3 and fungal proteases sensed by Psh, while also dissecting the relative contributions of melanization, phagocytosis, and antimicrobial peptides to host protection. Of particular note is the compelling demonstration that the fly specialist E. muscae can evade immune recognition through protoplast-like vegetative forms, minimizing cell-wall exposure and thereby escaping Toll activation.* *

      My expertise and limitations: * Insect biochemistry and molecular biology, with particular focus on innate immunity, serine protease cascades, melanization, and host-pathogen interactions. I also have experience with genetic, biochemical, and functional approaches to dissecting immune signaling pathways in model insects. However, I do not have sufficient expertise to critically evaluate advanced statistical analyses.** *

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)):____ __

      In this work the authors describe the contribution of distinct immune responses in Drosophila melanogaster to systemic and natural infections with 5 fungal species with different lifestyles some being generalists infecting a broad range of insects while others being more specialists or opportunistic. The authors used several well characterized Drosophila mutants of the Toll, Imd, phagocytosis and melanization responses to address this question. They show that Toll pathway is the key player in anti-fungal resistance in both natural and septic infections, whereas melanization plays a minor role mainly during natural infections possibly to limit fungal invasion through the cuticle. The authors show elegantly using different combinations of mutants for antimicrobial peptides genes with antifungal activities that Bomanins and Daisho (1 and 2) are the main Toll effectors mediating resistance to fungi but the authors did not find specific fungus-by-gene interaction, but rather antifungal peptides seem to act in a more general fashion against the fungi tested with significant redundancies between certain classes. Interestingly the authors show that while generalists like Beauveria and Metarhizium strongly activate the Toll pathway, the specialist E. muscae weakly activates the pathway and the opportunistic A. fumigatus does not activate the pathway, indicating that certain fungal species are able to evade sensing by immune pathways. In the context of the Toll activation, the sensor protease Psh and not GNBP3 seem to be the main trigger of the pathway.* *

      __Minor comments____ __ This is an interesting work that compares the contributions of different arms of the fly immune response to 5 fungal species with diverse lifestyles. The use of different lines with different combinations of mutant genes is a strength to highlight the relative contribution of each immune response. Some of the data obtained is intriguing and warrants more future investigations such as the distinct phenotypes of ModSp and GNBP3 mutants in E. muscae infections. The methodology is robust and the conclusions are supported with good experimental evidence. I do not see any major concerns with the work. I just have some minor comments listed below* *

      We thank the reviewer for the positive comments on our manuscript. 1- Statistical significance should be indicated on Figures 1 and 2, although it appears in the legend.

      We have added statistical significance on Figures 1 and 2.

      2- It is not very accurate to use the term resistance of the different mutants to infections with the diverse fungal species in Figures 1 and 2 especially that the authors have reported only survival data in these figures and have not measured fungal proliferation in infected flies (although they did that in later figures). It is more accurate to mention that the mutants flies have different levels of tolerance rather than resistance to fungal infections.* *

      We agree that we cannot use the term 'resistance' in Figures 1 and 2, since this term has now a more restricted meaning in the community. We have replaced the term 'resistance' by 'host defense' or 'surviving' through the text to avoid the confusion, except when the bacterial load was monitored.

      3- The authors show that Toll is over-activated in PPO1/PPO2 double mutant possibly through a negative feedback mechanism. However, there could be another explanation for this observation: For instance, the increased fungal proliferation in the PPO double mutant results in increased protease secretion by fungi enhancing Psh activation! Also, how can fungi manage to proliferate in this double mutant if Toll is overactivated? Could it be that Toll overactivation is triggering a fitness cost?* *

      The reviewer raises a good point. It is difficult to reconcile the susceptibility of PPO1/2 mutants to fungi taking in consideration the higher Toll activation. The higher activation of Toll could be deleterious and We clearly observed higher Toll pathway activation in PPO1/2 flies upon clean injury (Fig. S9C) or injection of dead spores (data not shown). Thus, this higher expression cannot be only explained as a consequence of higher fungal growth.

      4- In Lines 654-655, it is not accurate to say that E. muscae protoplasts are not detected by the immune response since E. muscae natural infections triggers Drs expression at 24 hpi and there is possibly some melanization taking place since PPO1 and PPO2 are required for defense against this fungus. A more accurate explanation is that this fungus is possibly more resistant to the effectors of the host immune response than the other fungi. I think a major point that the authors might have missed to consider in the discussion of their data is that the different fungi used herein may exhibit different levels of resilience to the effector reactions of the host such as AMPs and melanin deposition* *

      *The observation that injection of E. muscae protoplasts do not trigger an immune response above the level of clean injury is a strong argument that support our view that E. muscae protoplasts are not immunogenic. The reviewer is correct by underlying the small but significant induction of Drs at 24h post natural infection. We hypothesize that this could be due to mechanical injury associated with the entry of E. muscae. We have added a sentence to underline the possibility raised by the reviewer: 'Although we cannot rule out that the high pathogenicity of E. muscae may be partly due to the fungus's increased resilience, we favor the interpretation that it is instead mainly driven by its capacity to evade immune detection.'

      __Reviewer #2 (Significance (Required)):____ __

      Although the importance of Toll pathway and melanization in antifungal immunity is not new per se, this work adds to this knowledge by showing that Toll has the upper hand in anti-fungal immunity and that the strength of Toll pathway activation and its effector capacity may vary depending on the type of invading fungus. The work also highlights that certain fungi may employ a delayed switch to hyphal growth to reduce the presence of cell wall sugars as a mechanism to evade immune recognition. Overall, this work significantly adds to the knowledge of Drosophila immunity and raises some interesting questions related to the evolution of host-pathogen interactions and to the complex functions of serine protease cascades regulating Toll and melanization. This work will be of interest to a broad audience in the field of host-pathogen interactions *

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)):____ __

      This is a clearly written manuscript on the immune effector mechanisms regulating Drosophila melanogaster host defense against a broad range of fungal pathogens, including entomopathogenic and saprophytic filamentous fungi. The authors systematically dissect the contribution of major arms of Drosophila immunity, including cellular and humoral responses and melanization and potential mechanisms of cross talk using genetic tools and reporter lines. They also go into detail to characterize the contribution of upstream activators of these responses by fungal PAMPs and the role of antimicrobial effectors (AMPs) in fly susceptibility. * They conclude for no important role of phagocytosis in host defense. Instead, they find important contributions of Toll pathway mainly through the detection of fungal proteases by Persephone rather than b-glucan detection by GNBP3. They also demonstrate that Toll activation is proportional to the virulence of the fungal pathogen, showing little activation of this response by Aspergillus fumigatus. Finally, they identify melanization as another line of host defense that restricts pathogen dissemination and protects fly from invasive fungal disease. A very interesting part of this study is the identification of a virulence strategy of the obligate fungus Entomophthora muscae, which employs a vegetative development strategy, by making protoplast that avoid immune recognition by masking immunostimulatory cell wall molecules to avoid immune recognition by Toll pathway until the very last stage of invasive growth. Overall, this is a very interesting study on host-pathogen interplay in Drosophila, shedding light onto novel pathogenetic mechanism employed by entomopathogenic fungi to adapt to their hosts.** *

      We thank the reviewer for his positive assessment.

      __Major comments for the authors:____ __ 1. The use of reporter fungal strains to capture the dynamic interplay of the pathogen and the different arms of the immune system precludes firm conclusions on the contribution of various immune response to infection. This should be emphasized in the discussion* *

      Unfortunately, we did not fully understand this point. Note that we monitored both survival and when possible fungal load (B. Beauveria, E. muscae and M. anisopliae for Toll; and B. Beauveria, and M. anisopliae for melanization) allowing to state that Toll and Melanization are contributing to host defense by limiting fungal growth.

      2. The route of infection and the method employed to inject fungal spores has an impact on the effector pathways being activated. For example, pricking introduces spores less efficiently in the hemolymph compared to microinjection. The inoculum size in case of microinjection also has profound impact in understanding the role of cellular and humoral immunity during the infection course. For example, the lack of Toll activation in the natural infection with A. fumigatus does not mean that this pathway is not important in host defense against this pathogen.

      We fully agree and expected to clarify this different outcome between septic injury and natural infection. In the case of A. fumigatus, we confirm that Toll is important upon systemic infection but not natural infection because this fungus has a limited ability to penetrate insect by the natural route. We have clarified this in the text by adding the sentence: 'The low Toll pathway activation by A. fumigatus is likely due the weak ability of this fungus to penetrate insect by the natural route.'.

      3. The use of total KO strains does not preclude the cross talk of cellular and humoral immunity and consequently potential defects in cellular immunity upon deletion of a master regulator of the Toll pathway or even its downstream effectors

      The observation that Toll deficient mutants are almost as susceptibility as mutant flies lacking all the four immune modules (△ITPM ) to the five fungal pathogens point to a major role of this pathway. In a previous study (Ryckebusch et al Elife 2025), we have shown that the four immune pathways largely work independently as phagocytosis was still observed in Toll deficient mutant.

      4. Did the authors validate that NimC11; Eater1 flies are not able to phagocytose fungal spores?

      In the first version of this manuscript, we did not validate that NimC1;eater flies are phagocytic deficient also for Fungal spores although our manuscript assumed it. To address the comment of the reviewer, we have extended our study to better characterize the role of the cellular immune response to fungal infection (See new Figure S1).

      Our new results show that NimC1;eater deficient flies have defect in binding to M. anisopliae GFP spores (New Supplement Figure S1E,F). We did not see clear evidence of internalization. Thus, we conclude that the use of NimC1;eater flies is adequate to study the role of the cellular response. We have monitored the survival of hemoless flies that lack nearly all plasmatocytes due to the over-expression of the proapoptotic gene Bax, to natural infection and septic injury with B. bassiana and M. anisopliae. This new piece of data (described in New Supplementary Figure S1A-D) show that hemoless flies display a wild-type survival to B. Bassiana and a mild susceptibility to M. anisopliae consistent with our previous statement that the cellular response is less important than the humoral response. In the revised version, we have added this new piece of data and nuanced our statement on the role of the cellular response to fungal infection.

      5. Is it possible that entomopathogenic fungi bypass phagocytosis as a virulence strategy by inducing large size germinating cells, which are not phagocytosed?

      Indeed, there are several studies have showed that entomopathogenic fungi have evolved sophisticated strategies to evade or survive phagocytosis.

      • Once fungal spores (conidia) germinate, penetrate host tegument and reach the hemocoel, fungi existwithin the hemocoel in the forms of blastospores with thinner cell walls than conidia (M. anisopliae, M. rileyi, B. bassiana), and cell wall-free protoplasts (E. muscae). Wang and St Leger (2006) had demonstrated that host hemocytes can recognize and ingest conidia of M. anisopliae, but this capacity is lost on production of blastospore, because of its ability to avoid detection depending on the cell surface hydrophobic protein gene Mcl1 that is expressed within 20 min of the fungal pathogen contacting hemolymph.
      • Other studieshave shown that blastospores of B. bassiana and M. anisopliae can be phagocytosed at the early stages of infection but manage to emerge from host cells and continue to propagate. Growing hyphal bodies can deform the plasmatocyte cell membrane (Gillespie et al., 2000; Hung and Boucias, 1992; Vilcinskas et al., 1997). Studies have also shown that during the infection process of entomopathogenic fungi in insects, the hemocyte count gradually decreases. For instance, during the infection of Thitarodes xiaojinensis by Ophiocordyceps sinensis, blastospores are the initial cell type present in the host hemocoel and remained for 5 months or more before transformation into hypha, which finally led to host death; and the increase in blastospores quantity coincidence with a decline in hemocyte count (Liu et al., 2019; Li et al., 2020).<br /> In a new set of experiments, we tested the ability of plasmatocytes to phagocytose M. anisopliae-GFP spores. We observed that plasmatocytes bind to the spores, but we did not obtain clear evidence of internalization (New Figure S1E,F). However, this assay was not sufficient to conclusively determine whether plasmatocytes internalize M. anisopliae spores, as GFP fluorescence may be quenched in acidic intracellular compartments. Because entomopathogenic fungi can affect hemocyte abundance, we also monitored the expression level of Hml, a hemocyte-specific marker, in flies following natural infection with B. bassiana, M. anisopliae, M. rileyi, and E. muscae at 2, 3, and 5 days post-infection (see figure below). We did not observe a reduction in hemocyte levels for any of these fungi except M. anisopliae. This suggests that M. anisopliae may reduce hemocyte numbers as a strategy to circumvent the cellular immune response. These results, although promising, were not included in the revised version of the manuscript, as a thorough analysis of the cellular immune response would require a dedicated study on its own.

      Figure: Expression of Hml by RT-qPCR upon natural infection with entomopathogenic fungi (figure not included in the revised manuscript)

      6. Is it possible that fungal toxins kill phagocytes during germination?

      There are indeed evidences that fungal toxins destruxins (DTXs) induce ultrastructural alterations of circulating plasmatocytes and sessile haemocytes of Galleria mellonella larvae. DTXs contribute to the fungal infection process by a true immune-inhibitory effect. This is evidenced by two key findings: first, the germination rate of injected Aspergillus niger spores was slightly but significantly enhanced; second, during incubation, the fungus demonstrated a greater ability to escape from the haemocyte-formed granuloma envelope (Vilcinskas et al., 1997; Vey et al., 2002). But in Drosophila, Destruxin does not appear to affect Drosophila cellular immune responses in vivo. Phagocytosis of E. coli bacterial particles in Destruxin-injected flies appeared to be the same as that seen in PBS-injected flies. The proliferation of bacteria in the Destruxin-injected flies was due to the lower expression of antimicrobial peptide genes suggesting that Destruxin A specifically suppressed the humoral immune response in Drosophila (Pal et al., 2007), which is consistent with major role of antimicrobial peptides in survival to fungi. This point is now discussed in the discussion with a new section on the cellular response to fungal infection.

      __Reviewer #3 (Significance (Required)):____ __

      This is an important work that provide new information on virulence mechanisms of entomopathogenic fungi and the host immune responses that mediate host protection. The authors should address my comments in the discussion and provide some additional evidence by using reporter fungal strains for hemocytes on whether these fungal pathogens completely bypass phagocytosis to invade the host. Therefore, rather than claiming that phagocytosis is not important it should be clarified whether phagocytes are directly involved in host defense or whether the fungus changes its cell wall surface to avoid this line of host defense. My expertise is on phagocyte biology and host-fungal interaction on human fungal pathogens.

      We have added more information showing that plasmatocytes of NimC1;eater larvae fail to bind to spores of M. anisopliae suggesting that this line provides an appropriate tool to assess phagocytosis. We have also analyzed the survival of flies depleted for plasmatocytes via the over-expression of bax, which revealed a mild role for plasmatocyte in defense against M. anisopliae but not B. bassiana. By performing additional experiments, we realized that analyzing the role of cellular immunity in host defense against these five fungi would require much more work and is beyond the scope of this study. We have however added in the revised version a para in the discussion on the the cellular response.

    1. Note that I may use homework as anexampleassignment in class. Write a note at the top of your assignment if there is a par>cular reason you would like an assignment not to be shared

      I like this because it can give students a very good outline for what an assignment should look like. I think this is especially good for an online class since we do not see our professor in person to ask questions.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Response to Reviewer 1:

      The authors introduce G2PT, a hierarchical graph transformer model that integrates genetic variants (SNPs), gene annotations, and multigenic systems (Gene Ontology) to predict and interpret complex traits.

      We thank the reviewer for this accurate summary of our approach and contributions.

      Major Comments:

      Comment 1-1. Insufficient Specification of Model Architecture: The description of the "hierarchical graph transformer" lacks technical depth. Key implementation details are missing: how node embeddings are initialized for SNPs, genes, and systems; how graph connectivity is defined at each level (e.g., adjacency matrices used in Equations 5-9, the sparsity); justification for the choice of embedding dimension and number of attention heads, including any sensitivity analysis; and the architecture of the feed-forward neural networks (e.g., number of layers, activation functions, and hidden dimensions).

      __Reply 1-1. __As requested, we have expanded the technical description of the model architecture, including the hierarchical graph transformer (HiGT), in the Materials and Methods section. Details regarding node initialization and hierarchical connectivity are now included in the new paragraph "Model Initialization and Graph Construction." Specifically, all node embeddings corresponding to SNPs, genes, and ontology-defined systems are initialized using uniform Xavier initialization (Glorot and Bengio, 2010).

      We have also clarified our hyperparameter optimization strategy. Learning rate, weight decay, hidden (embedding) dimension, and the number of attention heads were selected via grid search, as summarized in new Supplementary Fig. 8, reproduced below. Based on both performance and computational efficiency, we adopted four attention heads-consistent with the configuration commonly used in academic transformer models (Vaswani et al., 2017) (the original Transformer used eight).

      Regarding the feed-forward neural network, we follow the standard Transformer architecture consisting of two position-wise layers with hidden dimension four times larger than the node embedding size and a GeLU nonlinear activation function (Hendrycks and Gimpel, 2016). This configuration is widely established in the literature and functions as an intermediate processing step following attention; therefore, it is not a focus of hyperparameter tuning. All corresponding updates have been incorporated into the revised Methods section for clarity and completeness.

      Comment 1-2. No Simulation Studies to Validate Epistasis Detection: The ground truth epistasis interaction should use the ones that have been manually validated by literature. The central claim of discovering epistatic interactions relies heavily on the model's attention mechanism and downstream statistical filtering. However, no simulation studies are presented to validate that G2PT can reliably detect epistasis when ground-truth interactions are known. Demonstrating robust detection of non-additive interactions under varying genetic architectures and noise levels in simulated genotype-phenotype datasets is essential to substantiate the method's core capability.

      Reply 1-2. We agree that a simulation of epistasis detection using the G2PT model is a worthy addition to the manuscript. Accordingly, we have now incorporated a new section in the Results titled "Validation of Epistasis through Simulation Studies", which includes two new figures reproduced below (Supplementary Fig. 6 and Fig. 5). We have also added a new Methods section to describe this simulation study under the heading "Epistasis Simulation". These simulation studies show that G2PT recovers epistatic gene pairs with high fidelity when these pairs are coherent with the systems ontology (c.f. 'ontology coherence' in Supplementary Fig. 6, which reflects the probability that both SNPs are assigned to the same leaf system). Furthermore, G2PT outcompetes previous tools, such as PLINK-epistasis, which do not use knowledge of the systems hierarchy in the same way (Supplementary Fig 6b-d). Using simulation parameters consistent with current genome-wide association studies (n = 400,000) and understanding of heritability (h2 = 0.3 to 0.5) (Bloom et al. 2015; Speed and Evans 2023), we find that approximately 10% of all epistatic SNP pairs can be recovered at a precision of 50% (Fig. 5). We have provided the source code for this simulation study in our GitHub repository (https://github.com/idekerlab/G2PT/blob/master/Epistasis_simulation.ipynb)

      Comment 1-3. Lack of Justification for Model Complexity and Missing Ablation Insights: While Supplementary Figure 2 presents ablation studies, the manuscript needs to justify the high computational cost (168 GPU hours using 4×A30 GPUs) of the full model. It remains unclear how much performance gain is specifically due to reverse propagation (Equations 8-9), which is claimed to capture biological context. The benefit of using a full Gene Ontology hierarchy versus a flat system list is not quantified. There is also no comparison between bidirectional versus unidirectional propagation. Overall, the added complexity is not empirically shown to be necessary

      Reply 1-3. We thank the reviewer for prompting a clearer justification of complexity and ablations. We have now revised the Results to (i) quantify the specific value of the ontology and reverse propagation, and (ii) explain why a flat SNP→system model is computationally and biologically sub-optimal. We have added new ablation results to compare bidirectional (forward+reverse) versus forward-only propagation. Reverse propagation has little effect when epistatic pairs are within one system (ontology coherence ρ=1.0) but substantially improves retrieval when interactions span related systems (e.g., ρ≈0.8) (Figure reproduced below) A flat design scores a dense genes×systems map, ignoring known sparsity (sparse SNP→gene assignments; sparse ontology edges) and losing multi-scale context; our hierarchical formulation restricts computation to observed edges (SNP→gene→system) and aggregates signals across levels, yielding better efficiency and biological fidelity.

      Comment 1-4. Non-Equivalent Benchmarking Against PRS Methods: Figure 2 compares G2PT to polygenic risk score (PRS) methods such as LDpred2 and Lassosum, but G2PT is run only on SNPs pre-filtered by marginal association (p-values between 10⁻⁵ and 10⁻⁸), while the PRS methods use genome-wide SNPs. This introduces a strong bias in G2PT's favor by effectively removing noise. A fair comparison would require: (a) running LDpred2 and Lassosum on the same pre-filtered SNP sets as G2PT, or (b) running G2PT on genome-wide or LD-pruned SNP sets. The reported superior performance of G2PT may be driven primarily by this input filtering, not the model architecture.

      Reply 1-4. We appreciate the reviewer's concern regarding benchmarking equivalence. In response, we have extended our analyses to include PRS-CS (Ge et al., 2019) and SBayesRC (Zheng et al., 2024), two state-of-the-art Bayesian shrinkage methods comparable to LDpred2 and Lassosum. Although we initially attempted to run LDpred2 and Lassosum under all SNP-filtering conditions, their computational requirements at UK Biobank scale proved prohibitively time consuming. We therefore focused on PRS-CS and SBayesRC, which offer similar modeling principles with greater computational tractability. These methods have now been run at matched SNP-filtering conditions to our original study. The new results demonstrate that G2PT consistently outperforms PRS-CS and SBayesRC (new Fig. 2, reproduced below), indicating that its performance advantage is not solely attributable to SNP pre-filtering but also to its hierarchical attention-based architecture.

      Comment 1-5: No Details on Hyperparameter Optimization: Although the manuscript mentions grid search for hyperparameter tuning, it provides no information about which parameters were optimized (e.g., learning rate, dropout rate, weight decay, attention dropout, FFNN dimensions), what search space was explored, or what final values were selected. There is also no assessment of how sensitive the model's performance is to these choices. Better transparency would help facilitate reproducibility

      Reply 1-5. We agree with the reviewer and have expanded the manuscript to include full details of hyperparameter optimization. As described in the revised Methods section, we performed a grid search over learning rate {10−3,10−4,10−5} hidden dimension {64,128} and weight decay {0,10−5,10−3}. The results, summarized in Supplementary Fig. 8 (reproduced above), show that model performance is most sensitive to the learning rate, while hidden dimension and weight decay exert more moderate effects. Based on these findings, we selected a learning rate of 10−5, hidden dimension of 64, and weight decay of 10−3 for all subsequent experiments. Although a hidden dimension of 128 slightly improved performance, we adopted 64 to balance predictive accuracy with computational efficiency.

      Comment 1-6. Absence of Control for Key Confounders: In interpreting attention scores as reflecting genetic relevance (e.g., the role of the immunoglobulin system), the model includes only age, sex, and genetic principal components as covariates. Important confounders such as BMI, alcohol use, or medication (e.g., statins) have not been controlled for. Since TG/HDL levels are strongly influenced by environment and lifestyle, it is entirely plausible that some high-attention features reflect environmental tagging, not biological causality.

      Reply 1-6. In the current framework, we included age, sex, and genetic principal components to account for demographic and population-structure effects, focusing on genetic contributions within a controlled baseline. We acknowledge that non-genetic covariates can influence downstream biological states and may indirectly shape attention at the gene or system level. Accurately modeling such effects requires an extended framework where environmental variables directly modulate gene and system embeddings rather than being implicitly absorbed by the attention mechanism. We have clarified these limitations in the Discussion along with plans to incorporate explicit confounder modeling in future extensions of G2PT.

      Comment 1-7. Oversimplified Treatment of SNP-to-Gene Mapping: The SNP-to-gene mapping strategy combines cS2G, eQTL, and nearest-gene annotations, but the limitations of this approach are not adequately addressed. The manuscript does not specify how conflicts between methods are resolved or what fraction of SNPs map ambiguously to multiple genes. Supplementary Figure 2 shows model performance degrades when using only nearest-gene mapping, but there is no systematic analysis of how mapping uncertainties propagate through the hierarchy and affect attention or interpretation.

      Reply 1-7. In the revision (Results), we have clarified how conflicts between cS2G, eQTL, and nearest-gene annotations are resolved, and we have reported the proportion of SNPs that map to multiple genes across these three annotation approaches. We note that the hierarchical attention mechanism enables the model to prioritize among alternative gene mappings in a data-driven manner, and this is a major strength of the approach. As shown in Fig. 3 (Results, reproduced below), SNP-to-gene attention weights reveal dominant linkages, reducing the impact of mapping uncertainty on interpretation. We now explicitly describe this mechanism and acknowledge that further work in probabilistic mapping and fine-mapping approaches is a valuable future direction for improving resolution and interpretability.

      "For SNPs with several potential SNP-to-gene mappings (Methods), we found that G2PT often prioritized one of these genes in particular due to its membership in a high-attention system. For example, the chr11q23.3 locus contains multiple genes including the APOA1/C3/A4/A5 gene cluster (Fig. 3c) which is well-known to govern lipid transport, an important system for G2PT predictions (Fig. 3a). Due to high linkage disequilibrium in the region, all of its associated SNPs had multiple alternative gene mappings available. For example, SNP rs1145189 mapped not only to APOA5 but to the more proximal BUD13, a gene functioning in spliceosomal assembly (a system receiving substantially lower G2PT attention). Here, the relevant information flow learned by G2PT was from rs1145189 to APOA5 to lipid transport and protein-lipid complex remodeling (Fig. 3c; and conversely, deprioritizing BUD13 as an effector gene for TG/HDL). We found that this particular genetic flow was corroborated by exome sequencing, which implicates APOA5 but not BUD13 in regulation of TG/HDL, using data that were not available to G2PT. Similarly, two other SNPs at this locus - rs518547 and rs11216169 - had potential mappings to their closest gene SIK3, where they reside within an intron, but also to regulatory elements for the more distant lipid transport genes APOC3 and APOA4. Here, G2PT preferentially weighted the mappings to APOC3 and APOA4 rather than to SIK3 (Fig. 3c)."

      Comment 1-8. Naive Scoring of System Importance: The method used to quantify the biological relevance of systems (i.e., correlating attention scores with predicted phenotype values) risks circular reasoning. Since the model is trained to optimize prediction, systems that contribute strongly to prediction will naturally show high correlation-even if they are not biologically causal. No comparison is made with established gene set enrichment methods applied to GWAS summary statistics. The approach lacks an independent benchmark to validate that the "important" systems are biologically meaningful.

      Reply 1-8. As requested, we compared G2PT's system-level importance scores with results from MAGMA competitive gene-set analysis, an established enrichment approach. This analysis indeed shows significant correlation between the systems identified by the two approaches (ρ = 0.26, p .01; Supplementary Table. 2), reflecting a shared emphasis on canonical lipid processes. We also observed systems detected by G2PT but not strongly detected by MAGMA's linear enrichment model-for example, the lipopolysaccharide-mediated signaling pathway (Kalita et al. 2022)

      Comment 1-9. No External Validation to Assess Generalizability. All evaluations are performed using cross-validation within the UK Biobank. There is no assessment of generalizability to independent cohorts or diverse ancestries. Given population structure, genotyping platform, and phenotype measurement variability, external validation is essential before claiming the method is suitable for broader use in polygenic risk assessment.

      Reply 1-9. To externally validate the G2PT model requires individual level genotype data with paired TG/HDL measurements, sample size at the scale of the UK Biobank, and GPU access to this data. Thus, we approached the All of Us program, a large and diverse cohort with individual level data and T2D conditions with HbA1C measurements. We first processed the All of Us genotype and phenotype data as we had processed UKBB data (Methods), resulting in 41,849 participants with T2D and 80,491 without T2D across various ethnicities. We then transferred the trained T2D G2PT model to the AoU Workbench and evaluated its performance. The model demonstrated robust discriminative capability with an explained variance of 0.025, as shown in the new Fig. 2d, (reproduced above).

      Comment 1-10. Computational Burden and Scalability Are Not Addressed: The paper notes that training the model requires 168 GPU hours on 4×A30 GPUs for just ~5,000 SNPs. However, there is no discussion of whether G2PT can scale to larger SNP sets (e.g., genome-wide imputed data) or more complex biological hierarchies (e.g., Reactome pathways). Without addressing scalability, the model's applicability to real-world, large-scale genomic datasets remains unclear.

      Reply 1-10. We have addressed scalability with both engineering optimizations and new scalability experiments. First, we refactored the model to use the xFormer memory-efficient attention for the hierarchical graph transformer (Lefaudeux et al., 2022), which also helps full parallelization of training, reducing bottlenecks. Second, we added a scaling study with progressively increasing SNP count. On 4×A30 GPUs, end-to-end training time for the 5k-SNP setting decreased from 4000 to 400 min. (approximately 7 GPU-hours, ×10). These new results are given in Supplementary Fig. 7, reproduced below.

      Minor Comment:

      Comment 1-11. Attention Weights as Mechanistic Insight: The paper equates high attention scores with biological importance, for example in highlighting the immunoglobulin system. There is no causal validation showing that altering the highlighted SNPs, genes, or systems has an actual effect on TG/HDL. Attention weights in transformer models are known to sometimes reflect spurious correlations, especially in high-dimensional settings. The correlation between attention scores and predictions (Supplementary Fig. 3a,b) does not constitute biological evidence. The interpretability claims can be restated without supporting functional or causal validation.

      Reply 1-11. We thank the reviewer for this thoughtful comment. We agree that attention weights are not causal evidence. In the revision, we (1) reframe attention-based findings as hypothesis-generating rather than mechanistic, and (2) add an explicit limitation noting that correlations between attention scores and predictions do not constitute biological validation.

      Response to Reviewer 2:

      This manuscript describes the introduction of the Genotype-to-Phenotype Transformer (G2PT), described by the authors as "a framework for modeling hierarchical information flow among variants, genes, multigenic systems, and phenotypes." The authors used the ratio TG/HDL as a trait for proof of concept of this tool.

      This is a potentially interesting computational tool of interest to bioinformaticians, computational genomicists, and biologists.

      We thank the reviewer for their overall positive assessment of our study.

      Comment 2-1. The rationale for choosing the TG/HDL ratio for this proof of concept analysis is not well justified beyond it being a marker for insulin resistance. Overall the use of a ratio may be problematic (see below). Analyses of TG and HDL separately as individual quantitative traits would be of interest. And an analysis of a dichotomous clinical trait (T2DM or CAD) would also be of great interest.

      Reply 2-1. We thank the reviewer for this suggestion. In the revised manuscript, we have expanded our analyses beyond the TG/HDL ratio to include TG and HDL as individual quantitative traits (Fig. 2, reproduced below). These additional analyses demonstrate that G2PT captures predictive signals robustly across each lipid component, not solely through their ratio. Furthermore, to address the reviewer's interest in clinical outcomes, we incorporated an analysis of type 2 diabetes (T2D) as a dichotomous trait of direct clinical relevance. Collectively, these results strengthen the rationale for our chosen phenotype and show that the G2PT framework generalizes effectively across quantitative and binary traits, consistently outperforming advanced PRS and machine learning benchmarks.

      Comment 2-2. The approach to mapping SNPs to genes does not incorporate the most advanced approaches. This should be described in more detail.

      Reply 2-2. We agree that the choice of SNP-to-gene mapping materially affects both performance and interpretability-indeed, our epistasis simulations suggest that more accurate mappings can improve recovery and localization. In this proof-of-concept work we use a straightforward, modular mapping sufficient to demonstrate the modeling framework, and we have clarified this in the Methods. The architecture is designed to plug-and-play alternative SNP-to-gene maps (e.g., eQTL/colocalization-based assignments, promoter-capture Hi-C). A dedicated follow-up study will systematically compare these alternatives and quantify their impact on attribution and downstream discovery.

      Comment 2-3. The example of gene prioritization at the A1/C3/A4/A5 gene locus is not particularly illuminating, as the prioritized genes are already well-known to influence TG and HDL-C levels and the TG/HDL ratio. Can the authors provide an example where G2PT prioritized a gene at a locus that is not already a well-known regulator of TG and HDL metabolism?

      Reply 2-3. We thank the reviewer for this suggestion. We have revised the manuscript to de-emphasize the well-established APOA1 locus and instead highlight the less expected "Positive regulation of immunoglobulin production" system (Figure 3a,b, Discussion). Here our model prioritizes the gene TNFSF13 based on specific variants that are not previously associated with TG or HDL (e.g., rs5030405, rs1858406, shown in blue). This finding points to an intriguing, non-canonical link between B-cell regulation and lipid metabolism. While full exploration of this finding is beyond the scope of the present methods paper, this example demonstrates G2PT's ability to identify novel, high-priority candidates in atypical systems.

      Comment 2-4. The identification of epistatic interactions is a potentially interesting application of G2PT. However, suppl table 1 shows a very limited number of such interactions with even fewer genes, and most of these are well established biological interactions (such as LPL/apoA5). The TGFB1 and FKBP1A interaction is interesting and should be discussed. What is needed for increasing the number of potential interactions, greater power?

      Reply 2-4. We are glad the reviewer appreciates the use of the G2PT model to identify epistatic interactions. We have now discussed a potential mechanism of epistasis between TGFB1 and FKBP1A in the protein dephosphorylation system (Discussion). In addition, we have addressed the reviewer's question about statistical power through extensive epistasis simulations (Fig. 5 and Supplementary Fig. 6), which show that G2PT's detection ability scales strongly with sample size-1,000 samples are insufficient, performance improves at 5,000, and power becomes reliable at 100,000. Realistic simulations (Fig. 5b-d) further demonstrate that under biologically plausible architectures, G2PT can robustly recover specific interactions even within complex genetic backgrounds

      Comment 2-5. Furthermore, the use of the TG/HDL ratio for the assessment of epistatic interactions may be problematic. For example, if one SNP affected only TG and the other only HDL-C, it would appear to be an epistatic interaction with regard to the ratio, although the biological epistasis may be limited to non-existent.

      Reply 2-5. We have greatly expanded the example phenotypes modeled in our study, Please see our reply 2-1 above.

      Response to Reviewer 3:

      This manuscript by Lee et al provides a sensible and powerful approach to polygenic score prediction. The model aggregates information from SNPs to genes to systems, using a transformer based architecture, which appears to increase predictive performance, produce interpretable outputs of genes and systems that underlie risk, and identify candidates for epistasis tests.

      I think the manuscript is clear and well written, and conducted via state-of-the-art approaches. I don't have any concerns regarding the claims that are made.

      We thank the reviewer for their very positive assessment of our study.

      Major comments:

      Comment 3-1. Specifically, lipid based traits are perhaps the most well-powered and the most biologically coherent; they are also very well-studied biologically and thus overrepresented in the gene ontology. It is unclear whether this approach will work as well for a trait like Schizophrenia for which the underlying pathways are not as well captured in existing ontologies. The authors anticipate this in their limitations section, and I am not expecting them to solve every issue with this, but it would be nice to expand the testing a little bit beyond only this one trait.

      Reply 3-1. We appreciate the reviewer's suggestion to expand beyond a single lipid trait. In the revised manuscript, we have included analyses of additional phenotypes, including low-density lipoprotein (LDL) and T2D (Fig. 2). These additions demonstrate the broader applicability of our framework beyond a single trait class.

      Comment 3-2. It also seems like the authors have not compared their method to the truly latest PRS methods, such as PRS-CSx and SBayesR. I would suggest adding some of the methods shown to be the best from this recent paper: https://www.nature.com/articles/s41598-025-02903-1

      Reply 3-2. We agree these are important comparators. Accordingly, we have extended our comparison to include PRS‑CS (Ge et al., 2019) and SBayesRC (Zheng et al., 2024), following its strong performance demonstrated in recent benchmarking studies (see Figure 2 above). We confirmed that G2PT outperforms advanced PRS methods for all TG/HDL ratio, LDL, and T2D phenotypes.

      Comment 3-3. Another major comment regards whether this method could be applied to traits with just GWAS summary statistics, rather than individual level data. This would not enable identification of specific methods underlying an individual, but it could still learn SNP based weights that could be mapped to genes and systems that could help explain risk when the model is applied to individuals (kind of like a pretraining step?)

      Reply 3-3. We appreciate this suggestion. While SNP weights from GWAS summary statistics could, in principle, serve as informative priors for attention values, incorporating them would require a sophisticated mathematical formulation that is beyond the scope of this study. Our current framework also relies on individual-level genotype and phenotype data to capture multilevel information flow and individual-specific variation.

      Minor comments:

      Comment 3-4. Why the need to constrain to a small number of SNPs? Is it just computational cost? If so, what would happen as power increases and more SNPs exceed the thresholds used?

      Reply 3-4. Yes, it's about computational cost, but we've now modified the code for improved computational efficiency. First, we refactored the model to use the xFormer memory-efficient attention for the hierarchical graph transformer (Lefaudeux et al., 2022), which also helps full parallelization of training, reducing bottleneck effects. Second, we added a scaling study of the impact of varying SNP count. On 4×A30 GPUs, end-to-end training time for the 5k-SNP setting decreased from 65 hours to 7 GPU-hours (×9). We expect performance can potentially increase if more SNPs are provided to the model based on Fig. 2 (reproduced above). With the optimized implementation, users can raise SNP thresholds as power increases; the expected behavior is improved accuracy up to a plateau, while hierarchical sparsity maintains training tractability and ensures well-regularized results.

      Comment 3-5. What type of sample size/power does this method require to work well? If others were to use it, how many SNPs/samples would be needed to obtain good performance?

      Reply 3-5. To address this comment, we quantified performance as a function of training size by subsampling the cohort and retraining G2PT with identical architecture and SNP set. New Supplementary Fig. 3 (reproduced below) shows monotonic gains with sample size across three representative phenotypes. We found that stable performance is reached by ~100k samples. These trends hold for continuous traits (TG/HDL, LDL) and more modestly for a binary trait (T2D), consistent with lower per-sample information for case-control settings.

    1. English language note: As you may notice here, ‘ethics’ is, by convention, a singular word. An ‘ethics’ is a way of describing how people think about something. There is also a word, ‘ethic’, but that has different usage. So for example, someone’s ‘work ethic’ is different from the ‘ethics of work’ to which they might subscribe. On a related note, some people will tell you that ‘data’ and ‘media’ are both plural. These words come from Latin, and those word forms are indeed plural in Latin! But we are using English, and conventions vary as to whether these terms should be treated as grammatically plural or singular. You will see variation in how people use these forms in your studies (and perhaps even in this book!), but it should not alarm you. The rule of thumb is to be consistent across a document or project in how you treat such things, so we have tried to be consistent in this book, with the exception of where we are quoting someone else’s words. TODO: decide whether we will treat media and data as plural or singular, and ensure compliance

      This note illustrates how conventions in language influence our perception of concepts of ethics. In pointing out that “ethics” is usually a plural noun, it is important to recognize that ethics is a system of thinking or a framework rather than a set of several distinct principles. In regard to words like “data” or “media,” it is evident that language is a product of society that is not bound by its original roots in Latin. Rather, it is important to focus on consistency in a given situation rather than a standard form. In regard to ethics, it is important to focus on understanding rather than simply applying a set of principles. In short, it is important to recognize that ethics is not simply a consideration of principles, but rather a consideration of language.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study presents valuable findings that advance our understanding of mural cell dynamics and vascular pathology in a zebrafish model of cerebral small vessel disease. The authors provide compelling evidence that partial loss of foxf2 function leads to progressive, cell-intrinsic defects in pericytes and associated endothelial abnormalities across the lifespan, leveraging powerful in vivo imaging and genetic tools. The strength of evidence could be further improved by additional mechanistic insight and quantitative or lineage-tracing analyses to clarify how pericyte number and identity are affected in the mutant model.

      Thank you to the reviewers for insightful comments and for the time spent reviewing the manuscript. We have strengthened the data through responding to the comments.

      Public Reviews:

      Reviewer #1 (Public review):

      The paper by Graff et al. investigates the function of foxf2 in zebrafish to understand the progression of cerebral small vessel disease. The authors use a partial loss of foxf2 (zebrafish possess two foxf2 genes, foxf2a and foxf2b, and the authors mainly analyze homozygous mutants in foxf2a) to investigate the role of foxf2 signaling in regulating pericyte biology. They find that the number of pericytes is reduced in foxf2a mutants and that the remaining pericytes display alterations in their morphologies. The authors further find that mutant animals can develop to adulthood, but that in adult animals, both endothelial and pericyte morphologies are affected. They also show that mutant pericytes can partially repopulate the brain after genetic ablation.

      (1) Weaknesses: The results are mainly descriptive, and it is not clear how they will advance the field at their current state, given that a publication on mice has already examined the loss of foxf2 phenotype on pericyte biology (Reyahi, 2015, Dev. Cell).

      The Reyahi paper was the earliest report of foxf2 mutant brain pericytes and remains illuminating. The work was very well technically executed. Our manuscript expands and at times, contradicts, their findings. We realized that we did not fully discuss this in our discussion, and this has now been updated. The biggest difference between the two studies is in the direction of change in pericytes after foxf2 knockout, a major finding in both papers. This is where it is important to understand the differences in methods. Reyahi et al., used a conditional knockout under Wnt1:Cre which will ablate pericytes derived from neural crest, but not those derived from mesoderm, nor will it affect foxf2 expression in endothelial cells. Our model is a full constitutive knockout of the gene in all brain pericytes and endothelial cells. For GOF, Reyahi used a transgenic model with a human FOXF2 BAC integrated into the mouse germline.

      Both studies are important. We do not know enough about human phenotypes in patients with strokeassociated human FOXF2 SNVs to know the direction of change in pericyte numbers. We showed that the SNVs reduce FOXF2 gene expression in vitro (Ryu, 2022). Here we demonstrate dosage sensitivity in fish (showing phenotypes when 1 of 4 foxf2a + foxf2b alleles are lost, Figure 1F), supporting that slight reductions of FOXF2 in humans could lead to severe brain vessel phenotypes. For this reason, our work is complementary to the previously published work and suggests that future studies should focus on understanding the role of dosage, cell autonomy, and human pericyte phenotypes with respect to FOXF2. While some experiments are parallel in mouse and fish, we go further to look at cell death and regeneration, and to understand the consequences on the whole brain vasculature.

      (2) Reyahi et al. showed that loss of foxf2 in mice leads to a marked downregulation of pdgfrb expression in perivascular cells. In contrast to expectation, perivascular cell numbers were higher in mutant animals, but these cells did not differentiate properly. The authors use a transgenic driver line expressing gal4 under the control of the pdgfrb promoter and observe a reduction in pericyte (pdgfrb-expressing) cells in foxf2a mutants. In light of the mouse data, this result might be due to a similar downregulation of pdgfrb expression in fish, which would lead to a downregulation of gal4 expression and hence reduced labelling of pericytes. The authors show a reduction of pdgfrb expression also in zebrafish in foxf2b mutants (Chauhan et al., The Lancet Neurology 2016).

      Reyahi detected more pericytes in the Wnt1:Cre mouse, while we detected fewer in the foxf2a (and foxf2a;foxf2b) mutants. This may be because of different methods. For instance, because the mouse knockout is not a constitutive Foxf2 knockout, the observed increase in pericytes may be because mesodermal-derived pericytes proliferate more highly when the neural crest-derived pericytes are absent. Or does endothelial foxf2 activate pericyte proliferation when foxf2 is lost in some pericytes? It is also possible that mouse foxf2 has a different role from its fish ortholog. Despite these differences, there are common conclusions from both models. For instance, both mouse and fish show foxf2 controls capillary pericyte numbers, albeit in different directions. Both show hemorrhage and loss of vascular stability as a result. Both papers identify the developmental window as critical for setting up the correct numbers of pericytes.  

      As the reviewer suggested, it was important to test whether pdgfrb is downregulated in fish as it is in mice. To do this, we measured expression of pdgfrb in foxf2 mutants using hybridization chain reaction (HCR) of pdgfrb in foxf2 mutants. The results show no change in pdgfrb mRNA in foxf2a mutants at two independent experiments (Fig S3). Independently, we integrated pdgfrb transgene intensity (using a single allele of the transgene so there are no dose effects) in foxf2a mutants vs. wildtype. We found no difference (Fig S3) suggesting that pdgfrb is a reliable reporter for counting pericytes in the foxf2a knockout. The reviewer is correct that we previously showed downregulation of pdgfrb in foxf2b mutants at 4 dpf using colorimetric ISH. foxf2a and foxf2b are unlinked, independent genes (~400 M years apart in evolution) and may have different regulation.

      (3) It would be important to clarify whether, also in zebrafish, foxf2a/foxf2b mutants have reduced or augmented numbers of perivascular cells and how this compares to the data in the mouse.  

      We discuss methodological differences between Reyahi and our work in point (1) above. The reduction in pericytes in foxf2a;foxf2b mutants has been previously published (Ryu, 2022, Supplemental Figure 1) and shown again here in Supplemental Figure 2). Numbers are reduced in double mutants up to 10 dpf, suggesting no recovery. Further, in response to reviewer comments, we have quantified pericytes in the whole fish brain (Figure 3E-G) and show reduced pericytes in the adult, reduced vessel network length, and importantly that the pericyte density is reduced. In aggregate, our data shows pericyte reduction at 5 developmental stages from embryo through adult. The reason for different results from the mouse is unknown and may reflect a technical difference (constitutive vs Wnt1:Cre) or a species difference.  

      (4) The authors should perform additional characterization of perivascular cells using marker gene expression (for a list of markers, see e.g., Shih et al. Development 2021) and/or genetic lineage tracing.

      This is a good point. We have added HCR analysis of additional markers. Results show co-expression of foxf2a, foxf2b, nduf4la2 and pdgfrb in brain pericytes (Fig 2, Fig S3).

      (5) The authors motivate using foxf2a mutants as a model of reduced foxf2 dosage, "similar to human heterozygous loss of FOXF2". However, it is not clear how the different foxf2 genes in zebrafish interact with each other transcriptionally. Is there upregulation of foxf2b in foxf2a mutants and vice versa? This is important to consider, as Reyahi et al. showed that foxf2 gene dosage in mice appears to be important, with an increase in foxf2 gene dosage (through transgene expression) leading to a reduction in perivascular cell numbers.

      We agree that dosage is a very important concept and show phenotypes in foxf2a heterozygotes (Fig 1F). To test the potential compensation from foxf2b, we have added qPCR for foxf2b in foxf2a mutants as well as HCR of foxf2b in foxf2a mutants (Fig S3C,D). There is no change in foxf2b expression in foxf2a mutants. We discuss dosage in our discussion.

      (6) Figures 3 and 4 lack data quantification. The authors describe the existence of vascular defects in adult fish, but no quantifiable parameters or quantifications are provided. This needs to be added.

      This query was technically challenging to address, but very worthwhile. We have not seen published methods for quantifying brain pericytes along with the vascular network (certainly not in zebrafish adults), so we developed new methods of analyzing whole brain vascular parameters of cleared adult brains (Figure S6) using a combination of segmentation methods for pericytes, endothelium and smooth muscle. We have added another author (David Elliott) as he was instrumental in designing methods. We find a significant decrease in vessel network length in foxf2a mutants at 3 month and 6 months (Figures 3F and 4G). Similarly, we show a lower number of brain pericytes in foxf2a mutants (Figure 3E). Finally, we added whole brain analysis of smooth muscle coverage (Figure 4) and show no change in vSMC number or coverage of vessels at 5 and 10 dpf or adult, respectively, pointing to pericytes being the cells most affected. Thank you, this query pushed us in a very productive direction. These methods will be extremely useful in the future!

      (7) The analysis of pericyte phenotypes and morphologies is not clear. On page 6, the authors state: "In the wildtype brain, adult pericytes have a clear oblong cell body with long, slender primary processes that extend from the cytoplasm with secondary processes that wrap around the circumference of the blood vessel." Further down on the same page, the authors note: "In wildtype adult brains, we identified three subtypes of pericytes, ensheathing, mesh and thin-strand, previously characterized in murine models." In conclusion, not all pericytes have long, slender primary processes, but there are at least three different sub-types? Did the authors analyze how they might be distributed along different branch orders of the vasculature, as they are in the mouse?

      We have reworded the text on page 5/6 to be clearer that embryonic pericytes are thin strand only. Additional pericyte subtypes develop later are seen in the mature vasculature of the adult. We could not find a way to accurately analyze pericyte subtypes in the adult brain. The imaging analysis to count pericytes used soma as machine learning algorithms have been developed to count nuclei but not analyze processes.

      (8) Which type of pericyte is affected in foxf2a mutant animals? Can the authors identify the branch order of the vasculature for both wildtype and mutant animals and compare which subtype of pericyte might be most affected? Are all subtypes of pericytes similarly affected in mutant animals? There also seems to be a reduction in smooth muscle cell coverage.

      Please see the response to (7) about pericyte subtypes. In response to the reviewer’s query, we have now analyzed vSMCs in the embryonic and adult brain. In the embryonic brain we see no statistical differences in vSMC number at 5 and 10 dpf (Figure 4). In the adult, vSMC length (total length of vSMCs in a brain) and vSMC coverage (proportion of brain vessels with vSMCs) are not significantly different. This data is important because it suggests that foxf2a has a more important role in pericytes than in vSMCs.

      (9) Regarding pericyte regeneration data (Figure 7): Are the values in Figure 7D not significantly different from each other (no significance given)?

      Any graphs missing bars have no significance and were left off for clarity. We have stated this in the statistical methods.  

      (10) In the discussion, the authors state that "pericyte processes have not been studied in zebrafish".

      Ando et al. (Development 2016) studied pericyte processes in early zebrafish embryos, and Leonard et al. (Development 2022) studied zebrafish pericytes and their processes in the developing fin. We apologize, this was not meant to say that pericyte processes had not been studied before, we have reworded this to make clear the intent of the sentence. We were trying to emphasize that we are the first to quantify processes at different stages, especially  in foxf2 mutants. Processes change morphology over development, especially after 5 dpf, something that our data captures. Our images are of stages that have not been previously characterized. We added a reference to Mae et al., who found similar process length changes in a mouse knockout of a different gene, and to Leonard who previously showed overlap of processes in a different context in fish.

      Reviewer #2 (Public review):

      Summary:

      This study investigates the developmental and lifelong consequences of reduced foxf2 dosage in zebrafish, a gene associated with human stroke risk and cerebral small vessel disease (CSVD). The authors show that a ~50% reduction in foxf2 function through homozygous loss of foxf2a leads to a significant decrease in brain pericyte number, along with striking abnormalities in pericyte morphologyincluding enlarged soma and extended processes-during larval stages. These defects are not corrected over time but instead persist and worsen with age, ultimately affecting the surrounding endothelium. The study also makes an important contribution by characterizing pericyte behavior in wild-type zebrafish using a clever pericyte-specific Brainbow approach, revealing novel interactions such as pericyte process overlap not previously reported in mammals.

      Strengths:

      This work provides mechanistic insight into how subtle, developmental changes in mural cell biology and coverage of the vasculature can drive long-term vascular pathology. The authors make strong use of zebrafish imaging tools, including longitudinal analysis in transgenic lines to follow pericyte number and morphology over larval development, and then applied tissue clearing and whole brain imaging at 3 and 11 months to further dissect the longitudinal effects of foxf2a loss. The ability to track individual pericytes in vivo reveals cell-intrinsic defects and process degeneration with high spatiotemporal resolution. Their use of a pericyte-specific Zebrabow line also allows, for the first time, detailed visualization of pericytepericyte interactions in the developing brain, highlighting structural features and behaviors that challenge existing models based on mouse studies. Together, these findings make the zebrafish a valuable model for studying the cellular dynamics of CSVD.

      Weaknesses:

      (11) While the findings are compelling, several aspects could be strengthened. First, quantifying pericyte coverage across distinct brain regions (forebrain, midbrain, hindbrain) would clarify whether foxf2a loss differentially impacts specific pericyte lineages, given known regional differences in developmental origin, with forebrain pericytes being neural crest-derived and hindbrain pericytes being mesoderm-derived.

      In recently published work from our lab, we published that both neural crest and mesodermal cells contribute to pericytes in both the mid and hindbrain, and could not confirm earlier work suggesting more rigid compartmental origins (Ahuja, 2024). In the Ahuja, 2024 paper we noted that lineage experiments are often limited by n’s which is why this may not have been discovered before. This makes us skeptical that counting different regions will allow us to interpret data about neural crest and mesoderm. Further, Ahuja 2024 shows that pericyte intermediate progenitors from both mesoderm and neural crest are indistinguishable at 30 hpf through single cell sequencing and have converged on a common phenotype.  

      (12) Second, measuring foxf2b expression in foxf2a mutants would better support the interpretation that total FOXF2 dosage is reduced in a graded fashion in heterozygote and homozygote foxf2a mutants.

      We have done both qPCR for foxf2b in foxf2a mutants and HCR (quantitative ISH). This is now reported in Fig S3. 

      (13) Finally, quantifying vascular density in adult mutants would help determine whether observed endothelial changes are a downstream consequence of prolonged pericyte loss. Correlating these vascular changes with local pericyte depletion would also help clarify causality.

      We have added this data to Figure 3 and 4. Please also see response (6).

      Reviewer #3 (Public review):

      Summary:

      The goal of the work by Graff et al. is to model CSVD in the zebrafish using foxf2a mutants. The mutants show loss of cerebral pericyte coverage that persists through adulthood, but it seems foxf2a does not regulate the regenerative capacity of these cells. The findings are interesting and build on previous work from the group. Limitations of the work include little mechanistic insight into how foxf2a alters pericyte recruitment/differentiation/survival/proliferation in this context, and the overlap of these studies with previous work in fox2a/b double mutants. However, the data analysis is clean and compelling, and the findings will contribute to the field.

      (14) Please make Figures 5C and 5E red-green colorblind friendly.

      Thank you. We have changed the colors to light blue and yellow to be colorblind friendly.

      Reviewer #3 (Recommendations for the authors):

      (15) I'm not sure this reviewer totally agrees with the assessment that foxf2a loss of function, while foxf2b remains normal, is the same as FOXF2 heterozygous loss of function in humans. The discussion of the gene dosage needs to be better framed, and the authors should carry out qPCR to show that foxf2b levels are not altered in the foxf2a mutant background.

      We have added data on foxf2b expression in foxf2a mutants to Fig S3. We have updated the results.

      (16) Figure 4/SF7- is the aneurysm phenotype derived from the ECs or pericytes? Cell-type-specific rescues would be interesting to determine if phenotypes are rescued, especially the developmental phenotypes (it is appreciated that carrying out rescue experiments until adulthood is complex). When is the earliest time point that aneurysm-like structures are seen?

      This is a fascinating question, especially as we show that endothelial cells (vessel network length) are affected in the adult mutants. The foxf2a mutants that we work with here are constitutive knockouts. While a strategy to rescue foxf2a in specific lineages is being developed in the laboratory this will require a multi-generation breeding effort to get drivers, transgenes and mutants on the same background, and these fish are not currently available. Thank you for this comment- it is something we want to follow up on.

      (17) Figure 5 - This is very nice analysis.

      Thank you! We think it is informative too.

      (18) Figure 6 - needs to contain control images

      We have added wildtype images to figure 6A.

      (19) Figure 7- vessel images should be shown to demonstrate the specificity of NTR treatment to the pericytes.

      We have added the vessel images to Figure 7. We apologize for the omission.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      One possible remaining conceptual concern that might require future work is determining whether STN primarily mediates higher-level cognitive avoidance or if its activation primarily modulates motor tone.

      Our results using viral and electrolytic lesions (Fig. 11) and optogenetic inhibition of STN neurons (Fig. 10) show that signaled active avoidance is virtually abolished, and this effect is reproduced when we selectively inhibit STN fibers in the midbrain (Fig. 12). Inhibition of STN projections in either the substantia nigra pars reticulata (SNr) or the midbrain reticular tegmentum (mRt) eliminates cued avoidance responses while leaving escape responses intact. Importantly, mice continue to escape during US presentation after lesions or during photoinhibition, demonstrating that basic motor capabilities and the ability to generate rapid defensive actions are preserved.

      These findings argue against the idea that STN’s role in avoidance reflects a nonspecific suppression or facilitation of motor tone, even if the STN also contributes to general movement control. Instead, they show that STN output is required for generating “cognitively” guided cued actions that depend on interpreting sensory information and applying learned contingencies to decide when to act. Thus, while STN activity can modulate movement parameters, the loss-of-function results point to a more selective role in supporting cued, goal-directed avoidance behavior rather than a general adjustment of motor tone.

      Reviewer #2 (Public review):

      All previous weaknesses have been addressed. The authors should explain how inhibition of the STN impairing active avoidance is consistent with the STN encoding cautious action. If 'caution' is related to avoid latency, why does STN lesion or inhibition increase avoid latency, and therefore increase caution? Wouldn't the opposite be more consistent with the statement that the STN 'encodes cautious action'?

      The reviewer’s interpretation treats any increase in avoidance latency as evidence of “more caution,” but this holds only when animals are performing the avoidance behavior normally. In our intact animals, avoidance rates remain high across AA1 → AA2 → AA3, and the active avoidance trials (CS1) used to measure latency are identical across tasks (e.g., in AA2 the only change is that intertrial crossings are punished). Under these conditions, changes in latency genuinely reflect adjustments in caution, because the behavior itself is intact, actions remain tightly coupled to the cue, and the trials are identical.

      This logic does not apply when STN function is disrupted. STN inhibition or lesions reduce avoidance to near chance levels; the few crossings that do occur are poorly aligned to the CS and many likely reflect random movement rather than a cued avoidance response. Once performance collapses, latency can no longer be assumed to reflect the same cognitive process. Thus, interpreting longer latencies during STN inactivation as “more caution” would be erroneous, and we never make that claim.

      A simple analogy may help clarify this distinction. Consider a pedestrian deciding when to cross the street after a green light. If the road is deserted (like AA1), the person may step off the curb quickly. If the road is busy with many cars that could cause harm (like AA2), they may wait longer to ensure that all cars have stopped. This extra hesitation reflects caution, not an inability to cross. However, if the pedestrian is impaired (e.g., cannot clearly see the light, struggles to coordinate movements, or cannot reliably make decisions), a delayed crossing would not indicate greater caution—it would reflect a breakdown in the ability to perform the behavior itself. The same principle applies to our data: we interpret latency as “caution” only when animals are performing the active avoidance behavior normally, success rates remain high, and the trial rules are identical. Under STN inhibition or lesion, when active avoidance collapses, the latency of the few crossings that still occur can no longer be interpreted as reflecting caution. We have added these points to the Discussion.

      Reviewer #3 (Public review):

      Original Weaknesses:

      I found the experimental design and presentation convoluted and some of the results over-interpreted.

      We appreciate the reviewer’s comment, but the concern as stated is too general for us to address in a concrete way. The revised manuscript has been substantially reorganized, with simplified terminology, streamlined figures, and removal of an entire set of experiments to avoid over-interpretation. We are confident that the experimental design and results are now presented clearly and without extrapolation beyond the data. If there are specific points the reviewer finds convoluted or over-interpreted, we would be happy to address them directly.

      As presented, I don't understand this idea that delayed movement is necessarily indicative of cautious movements. Is the distribution of responses multi-modal in a way that might support this idea; or do the authors simply take a normal distribution and assert that the slower responses represent 'caution'? Even if responses are multi-modal and clearly distinguished by 'type', why should readers think this that delayed responses imply cautious responding instead of say: habituation or sensitization to cue/shock, variability in attention, motivation, or stress; or merely uncertainty which seems plausible given what I understand of the task design where the same mice are repeatedly tested in changing conditions. This relates to a major claim (i.e., in the title).

      We appreciate the reviewer’s question and address each component directly.

      (1) What we mean by “caution” and how it is operationalized

      In our study, caution is defined operationally as a systematic increase in avoidance latency when the behavioral demand becomes higher, while the trial structure and required response remain unchanged. Specifically, CS1 trials are identical in AA1, AA2, and AA3. Thus, when mice take longer to initiate the same action under more demanding contexts, the added time reflects additional evaluation before acting—consistent with longestablished interpretations of latency shifts in cognitive psychology (see papers by Donders, Sternberg, Posner) and interpretations of deliberation time in speed-accuracy tradeoff literature.

      (2) Why this interpretation does not rely on multi-modal response distributions We do not claim that “cautious” responses form a separate mode in the latency distribution. The distributions are unimodal, and caution is inferred from conditiondependent shifts in these distributions across identical trials, not from the existence of multiple peaks (see Zhou et al, 2022). Latency shifts across conditions with identical trial structure are widely used as behavioral indices of deliberation or caution.

      (3) Why alternative explanations (habituation/sensitization, motivation, attention, stress, uncertainty) do not account for these latency changes

      Importantly, nothing changes in CS1 trials between AA1 and AA2 with respect to the cue, shock, or required response. Therefore:

      - Habituation/sensitization to the cue or shock cannot explain the latency shift (the stimuli and trial type are unchanged). We have previously examined cue-evoked orienting responses and their habituation in detail (Zhou et al., 2023), and those measurements are dissociable from the latency effects described here.

      - Motivation or attention are unlikely to change selectively for identical CS1 trials when the task manipulation only adds a contingency to intertrial crossings.

      - Uncertainty also does not increase for CS1 trials, they remain fully predictable and unchanged between conditions.

      - Stress is too broad a construct to be meaningful unless clearly operationalized; moreover, any stress differences that arise from task structure would covary with caution rather than replace the interpretation.

      (4) Clarifying “types” of responses

      The reviewer’s question about “response types” appears to conflate behavioral latencies with the neuronal response “types” defined in the manuscript. The term “type” in this paper refers to neuronal activation derived from movement-based clustering, not to distinct behavioral categories of avoidance, which we term modes.

      In sum, we interpret increased CS1 latency as “caution” only when performance remains intact and trial structure is identical between conditions; under those criteria, latency reliably reflects additional cognitive evaluation before acting, rather than nonspecific changes in sensory processing, motivation, etc.

      Related to the last, I'm struggling to understand the rationale for dividing cells into 'types' based their physiological responses in some experiments.

      There is longstanding precedent in systems neuroscience for classifying neurons by their physiological response patterns, because neurons that respond similarly often play similar functional roles. For example, place cells, grid cells, direction cells, in vivo, and regular spiking, burst firing, and tonic firing in vitro are all defined by characteristic activity patterns in response to stimuli rather than anatomy or genetics alone. In the same spirit, our classifications simply reflect clusters of neurons that exhibit similar ΔF/F dynamics around behaviorally relevant events, such as movement sensitivity or avoidance modes. This is a standard analytic approach used in many studies. Thus, our rationale is not arbitrary: the “classes” and “types” arise from data-driven clustering of physiological responses, consistent with widespread practice, and they help reveal functional distinctions within the STN that would otherwise remain obscured.

      In several figures the number of subjects used was not described. This is necessary. Also necessary is some assessment of the variability across subjects.

      All the results described include the number of animals. To eliminate uncertainty, we now also include this information in figure legends.

      The only measure of error shown in many figures relates trial-to-trial or event variability, which is minimal because in many cases it appears that hundreds of trials may have been averaged per animal, but this doesn't provide a strong view of biological variability (i.e., are results consistent across animals?).

      The concern appears to stem from a misunderstanding of what the mixed-effects models quantify. The figure panels often show session-averaged traces for clarity, all statistical inferences in the paper are made at the level of animals, not trials. Mixed-effects modeling is explicitly designed for hierarchical datasets such as ours, where many trials are nested within sessions, which are themselves nested within animals.

      In our models, animal is the clustering (random) factor, and sessions are nested within animals, so variability across animals is directly estimated and used to compute the population-level effects. This approach is not only appropriate but is the most stringent and widely recommended method for analyzing behavioral and neural data with repeated measures. In other words, the significance tests and confidence intervals already fully incorporate biological variability across animals.

      Thus, although hundreds of trials per animal may be illustrated for visualization, the inferences reflect between-animal consistency, not within-animal trial repetition. The fact that the mixed-effects results are robust across animals supports the biological reliability of the findings.

      It is not clear if or how spread of expression outside of target STN was evaluated, and if or how or how many mice were excluded due to spread or fiber placements. Inadequate histological validation is presented and neighboring regions that would be difficult to completely avoid, such as paraSTN may be contributing to some of the effects.

      The STN is a compact structure with clear anatomical boundaries, and our injections were rigorously validated to ensure targeting specificity. As detailed in the Methods, every mouse underwent histological verification, and injections were quantified using the Brain Atlas Analyzer app (available on OriginLab), which we developed to align serial sections to the Allen Brain Atlas. This approach provides precise, slice-by-slice confirmation of viral spread. We have performed thousands of AAV injections and probe implants in our lab, incorporating over the years highly reliable stereotaxic procedures with multiple depth and angle checks and tools. For this study specifically, fewer than 10% of mice were excluded due to off-target expression or fiber/lesion placement. None of the included cases showed spread into adjacent structures.

      Regarding paraSTN: anatomically, paraSTN is a very small extension contiguous with STN. Our study did not attempt to dissociate subregions within STN, and the viral expression patterns we report fall within the accepted boundaries of STN. Importantly, none of our photometry probes or miniscope lenses sampled paraSTN, so contributions from that region are extremely unlikely to account for any of our neural activity results.

      Finally, our paper employs five independent loss-of-function approaches—optogenetic inhibition of STN neurons, selective inhibition of STN projections to the midbrain (in two sites: SNr and mRt), and STN lesions (electrolytic and viral). All methods converge on the same conclusion, providing strong evidence that the effects we report arise from manipulation of STN itself rather than from neighboring regions.

      Raw example traces are not provided.

      We do not think raw traces are useful here. All figures contain average traces to reflect the average activity of the estimated populations, which are already clustered per classes and types.

      The timeline of the spontaneous movement and avoidance sessions were not clear, nor the number of events or sessions per animal and how this was set. It is not clear if there was pre-training or habituation, if many or variable sessions were combined per animal, or what the time gaps between sessions was, or if or how any of these parameters might influence interpretation of the results.

      As noted, we have enhanced the description of the sessions, including the number of animals and sessions, which are daily and always equal per animals in each group of experiments. The sessions are part of the random effects in the model. In addition, we now include schematics to facilitate understanding of the procedures.  

      Comments on revised version:

      The authors removed the optogenetic stimulation experiments, but then also added a lot of new analyses. Overall the scope of their conclusions are essentially unchanged. Part of the eLife model is to leave it to the authors discretion how they choose to present their work. But my overall view of it is unchanged. There are elements that I found clear, well executed, and compelling. But other elements that I found difficult to understand and where I could not follow or concur with their conclusions.

      We respectfully disagree with the assertion that the scope of our conclusions remains unchanged. The revised manuscript differs in several fundamental ways:

      (1) Removal of all optogenetic excitation experiments

      These experiments were a substantial portion of the original manuscript, and their removal eliminated an entire set of claims regarding the causal control of cautious responding by STN excitation. The revised manuscript no longer makes these claims.

      (2) Addition of analyses that directly address the reviewers’ central concerns The new analyses using mixed-effects modeling, window-specific covariates, and movement/baseline controls were added precisely because reviewers requested clearer dissociation of sensory, motor, and task-related contributions. These additions changed not only the presentation but the interpretation of the neural signals. We now conclude that STN encodes movement, caution, and aversive signals in separable ways—not that it exclusively or causally regulates caution.

      (3) Clear narrowing of conclusions

      Our current conclusions are more circumscribed and data-driven than in the original submission. For example, we removed all claims that STN activation “controls caution,” relying instead on loss-of-function data showing that STN is necessary for performing cued avoidance—not for generating cautious latency shifts. This is a substantial conceptual refinement resulting directly from the review process.

      (4) Reorganization to improve clarity

      Nearly every section has been restructured, including terminology (mode/type/class), figure organization, and explanations of behavioral windows. These revisions were implemented to ensure that readers can follow the logic of the analyses.

      We appreciate the reviewer’s recognition that several elements were clear and compelling. For the remaining points they found difficult to understand, we have addressed each one in detail in the response and revised the manuscript accordingly. If there are still aspects that remain unclear, we would welcome explicit identification of those points so that we can clarify them further.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Show individual data points on bar plots

      - partially addressed. Individual data points are still not shown.

      Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeatedmeasures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (2) The active avoidance experiments are confusing when they are introduced in the results section. More explanation of what paradigms were used and what each CS means at the time these are introduced would add clarity. For example AA1, AA2 etc are explained only with references to other papers, but a brief description of each protocol and a schematic figure would really help.

      - partially addressed. A schematic figure showing the timeline would still be helpful.

      As suggested, we have added an additional panel to Fig. 5A with a schematic describing

      AA1-3 tasks. In addition, the avoidance protocols are described briefly but clearly in the Results section (second paragraph of “STN neurons activate during goal-directed avoidance contingencies”) and in greater detail in the Methods section. As stated, these tasks were conducted sequentially, and mice underwent the same number of sessions per procedure, which are indicated. All relevant procedural information has been included in these sections. Mice underwent daily sessions and learnt these tasks within 1-2 sessions, progressing sequentially across tasks with an equal number of sessions per task (7 per task), and the resulting data were combined and clustered by mouse/session in the statistical models.

      (3) How do the Class 1, 2, 3 avoids relate to Class 1 , 2, 3 neural types established in Figure 3? It seems like they are not related, and if that is the case they should be named something different from each other to avoid confusion.

      -not sufficiently addressed. The new naming system of neural 'classes' and 'types' helps with understanding that these are completely different ways of separating subpopulations within the STN. However, it is still unclear why the authors re-type the neurons based on their relation to avoids, when they classify the neurons based on their relationship to speed earlier. And it is unclear whether these neural classes and neural types have anything to do with each other. Are the neural Types related to the neural classes in any way? and what is the overlap between neural types vs classes? Which separation method is more useful for functionally defining STN populations?

      The remaining confusion stems from treating several independent analyses as if they were different versions of the same classification. In reality, each analysis asks a distinct question, and the resulting groupings are not expected to overlap or correspond. We clarify this explicitly below.

      - Movement onset neuron classes (Class A, B, C; Fig. 3):

      These classes categorize neurons based on how their ΔF/F changes around spontaneous movement onset. This analysis identifies which neurons encode the initiation and direction of movement. For instance, Class B neurons (15.9%) were inhibited as movement slowed before onset but did not show sharp activation at onset, whereas Class C neurons (27.6%) displayed a pronounced activation time-locked to movement initiation. Directional analyses revealed that Class C neurons discharged strongly during contraversive turns, while Class B neurons showed a weaker ipsiversive bias. Because neurons were defined per session and many of these recordings did not include avoidance-task sessions, these movement-onset classes were not used in the avoidance analyses.

      - Movement-sensitivity neuron classes (Class 1, 2, 3, 4; Fig. 7):

      These classes categorize neurons based on the cross-correlation between ΔF/F and head speed, capturing how each neuron’s activity scales with movement features across the entire recording session. This analysis identifies neurons that are strongly speed-modulated, weakly speed-modulated, or largely insensitive to movement. These movement-sensitivity classes were then carried forward into the avoidance analyses to ask how neurons with different kinematic relationships participate during task performance; for example, whether neurons that are insensitive to movement nonetheless show strong activation during avoidance actions.

      - Avoidance modes (Mode 1, 2, 3; Fig. 8)

      Here we classify actions, not neurons. K-means clustering is applied to the movementspeed time series during CS1 active avoidance trials only, which allows us to identify distinct action modes or variants—fast-onset versus delayed avoidance responses. This action-based classification ensures that we compare neural activity across identical movements, eliminating a major confound in studies that do not explicitly separate action variants. First, we examine how population activity differs across these avoidance modes, reflecting neural encoding of the distinct actions themselves. Second, within each mode, we then classify neurons into “types,” which simply describes how different neurons activate during that specific avoidance action (as noted next).

      - Neuron activation types within each mode (Type a, b, c; Fig.9)

      This analysis extends the mode-based approach by classifying neuronal activation patterns only within each specific avoidance mode. For each mode, we apply k-means clustering to the ΔF/F time series to identify three activation types—e.g., neurons showing little or no response, neurons showing moderate activation, and neurons showing strong or sharply timed activation. Because all trials within a mode have identical movement profiles, these activation types capture the variability of neural responses to the same avoidance behavior. Importantly, these activation “types” (a, b,

      c) are not global neuron categories. They do not correspond to, nor are they intended to map onto, the movement-based neuron classes defined earlier. Instead, they describe how neurons differ in their activation during a particular behavioral mode—that is, within a specific set of behaviorally matched trials. Because modes are defined at the trial level, the neurons contributing to each mode can differ: some neurons have trials belonging to one mode, others to two or all three. Thus, Type a/b/c groupings are not fixed properties of neurons. To prevent confusion, we refer to them explicitly as neuronal activation types, emphasizing that they characterize mode-specific response patterns rather than global cell identities.

      In conclusion, the categorizations serve entirely different analytical purposes and should not be interpreted as competing classifications. The mode-specific “types” do not reclassify or replace the movement-sensitivity classes; they capture how neurons differ within a single, well-defined avoidance action, while the movement classes reflect how neurons relate to movements in general. Each classification relates to different set of questions and overlap between them is not expected.

      To make this as clear as possible we added the following paragraph to the Results:  

      “To avoid confusion between analyses, it is important to note that the movement-sensitivity classes defined here (Class 1–4; Fig. 7) are conceptually distinct from both the movementonset classes (Class A–C; Fig. 3) and the neuronal activation “types” introduced later in the avoidance-mode analysis. The Class 1–4 grouping reflects how neurons relate to movement across the entire session, based on their cross-correlation with speed. The onset classes A–C capture neural activity specifically around spontaneous movement initiation during general exploration. In contrast, the later activation “types” are derived within each avoidance mode and describe how neurons differ in their activation patterns during identical CS1 avoidance responses. These classifications answer different questions about STN function and are not intended to correspond to one another.”

      (4) Similarly having 3 different cell types (a,b,c) in the active avoidance seems unrelated to the original classification of cell types (1,2,3), and these are different for each class of avoid. This is very confusing and it is unclear how any of these types relate to each other. Presumable the same mouse has all three classes of avoids, so there are recording from each cell during each type of avoid. So the authors could compare one cell during each avoid and determine whether it relates to movement or sound or something else. It is interesting that types a,b,c have the exact same proportions in each class of avoid, and really makes it important to investigate if these are the exact same cells or not. Also, these mice could be recorded during open field so the original neural classification (class 1, 2,3) could be applied to these same cells and then the authors can see whether each cell type defined in the open field has different response to the different avoid types. As it stands, the paper simply finds that during movement and during avoidance behaviors different cells in the STN do different things. - Similarly, the authors somewhat addressed the neural types issue, but figure 9 still has 9 different neural types and it is unclear whether the same cells that are type 'a' in mode 1 avoids are also type 'a' in mode 2 avoids, or do some switch to type b? Is there consistency between cell types across avoid modes? The authors show that type 'c' neurons are differentially elevated in mode 3 vs 2, but also describes neurons as type '2c' and statistically compare them to type '1c' neurons. Are these the same neurons? or are type 2c neurons different cells vs type 1c neurons? This is still unclear and requires clarification to be interpretable.

      We believe the remaining confusion arises from treating the different classification schemes as if they were alternative labels applied to the same neurons, when in fact they serve entirely separate analytical purposes and may not include the same neurons (see previous point). Because these classifications answer different questions, they are not expected to overlap, nor is overlap required for the interpretations we draw. It is therefore not appropriate to compare a neuron’s “type” in one avoidance mode to its movement class, or to ask whether types a/b/c across different modes are “the same cells,” since modes are defined by trial-level movement clustering rather than by neuron identity. Importantly, Types a/b/c are not intended as a new global classification of neurons; they simply summarize the variability of neuronal responses within each behaviorally matched mode. We agree that future studies could expand our findings, but that is beyond the already wide scope of the present paper. Our current analyses demonstrate a key conceptual point: when movement is held constant (via modes), STN neurons still show heterogeneous, outcome- and caution-related patterns, indicating encoding that cannot be reduced to movement alone.

      Relatedly, was the association with speed used to define each neural "class" done in the active avoidance context or in a separate (e.g. open field) experiment? This is not clear in the text.

      The cross-correlation classes were derived from the entire recording session, which included open-field and avoidance tasks recordings. The tasks include long intertrial periods with spontaneous movements. We found no difference in classes when we include only a portion of the session, such as the open field or if we exclude the avoidance interval where actions occur.

      Finally, in figure 7, why is there a separate avoid trace for each neural class? With the GRIN lens, the authors are presumably getting a sample of all cell types during each avoid, so why do the avoids differ depending on the cell type recorded?

      The entire STN population is not recorded within a single session; each session contributes only a subset of neurons to the dataset. Consequently, each neural class is composed of neurons drawn from partially non-overlapping sets of sessions, each with its own movement traces. For this reason, we plot avoidance traces separately for each neural class to maintain strict within-session correspondence between neural activity and the behavior collected in the same sessions. This prevents mixing behavioral data across sessions that did not contribute neurons to that class and ensures that all neural– behavioral comparisons remain appropriately matched. We have clarified this rationale in the revised manuscript. We note that averaging movement across classes—as is often done—would obscure these distinctions and would not preserve the necessary correspondence between neural activity and behavior. This is also clarified in Results.

      (5) The use of the same colors to mean two different things in figure 9 is confusing. AA1 vs AA2 shouldn't be the same colors as light-naïve vs light signaling CS.

      -addressed, but the authors still sometimes use the same colors to mean different things in adjacent figures (e.g. the red, blue, black colors in figure 1 and figure 2 mean totally different things) and use different colors within the same figure to represent the same thing (Figure 9AB vs Figure 9CD). This is suboptimal.

      Following the reviewer’s suggestion, in Figure 2, we changed the colors, so readers do not assume they are related to Fig. 1.

      In Figure 9, we changed the colors in C,D to match the colors in A,B.

      (6) The exact timeline of the optogenetics experiments should be presented as a schematic for understandability. It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1 that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presenting CS1+CS2 at the same time and could be confusing. The authors should make it clear whether the mice were naïve during this passive avoid experiment or whether they had experienced STN stimulation paired with anything prior to this experiment.

      -addressed

      (7) Similarly, the duration of the STN stimulation should be made clear on the plots that show behavior over time (e.g. Figure 9E).

      -addressed

      (8) There is just so much data and so many conditions for each experiment here. The paper is dense and difficult to read. It would really benefit readability if the authors put only the key experiments and key figure panels in the main text and moved much of the repetative figure panels to supplemental figures. The addition of schematic drawings for behavioral experiment timing and for the different AA1, AA2, AA3 conditions would also really improve clarity.

      -partially addressed. The paper is still dense and difficult to read. No experimental schematics were added.

      As suggested, we now added the schematic to Fig. 5A.  

      New Comments:

      (9) Description of the animals used and institutional approval are missing from the methods.

      The information on animal strains and institutional approval is already included in the manuscript. The first paragraph of the Methods section states:

      “… All procedures were reviewed and approved by the institutional animal care and use committee and conducted in adult (>8 weeks) male and female mice. …”

      Additionally, the next subsection, “Strains and Adeno-Associated Viruses (AAVs),” fully specifies all mouse lines used. We therefore believe that the required descriptions of animals and institutional approval are already present and meet standard reporting.

    1. Author response:

      The following is the authors’ response to the latest reviews:

      "One remaining question is the interpretation of matching variants with very low stable posterior probabilities (~0), which the authors have analyzed in detail but without fully conclusive findings. I agree with the authors that this event is relatively rare and the current sample size is limited but this might be something to keep in mind for future studies."

      Fine-mapping stabilityon matching variants with very low stable posterior probability

      We thank Reviewer 2 for encouraging us to think more about how low stable posterior probability matching variants can be interpreted. We describe a few plausible interpretations, even though – as Reviewer 2 and we have both acknowledged – our present experiments do not point to a clear and conclusive account.

      One explanation is that the locus captured by the variant might not be well-resolved, in the sense that many correlated variants exist around the locus. Thus, the variant itself is unlikely causal, but the set of variants in high LD with it may contain the true causal variant, or it's possible that the causal variant itself was not sequenced but lies in that locus. A comparison of LD patterns across ancestries at the locus would be helpful here.

      Another explanation rests on the following observation. For a variant to be matching between top and stable PICS and to also have very small stable PP, it has to have the largest PP after residualization on the ALL slice but also have positive PP with gene expression on many other slices. In other words, failing to control for potential confounders shrinks the PP. If one assumes that the matching variant is truly causal, then our observation points to an example of negative confounding (aka suppressor effect). This can occur when the confounders (PCs) are correlated with allele dosage at the causal variant in a different direction than their correlation with gene expression, so that the crude association between unresidualized gene expression and causal variant allele dosage is biased toward 0.

      Although our present study does not allow us to systematically confirm either interpretation – since we found that matching variants were depleted in causal variants in our simulations, violating the second argument, but we also found functional enrichment in analyses of GEUVADIS data though only 17 matching variants with low stable PP were reported – we believe a larger-scale study using larger cohort sizes (at least 1000 individuals per ancestry) and many more simulations (to increase yield of such cases) would be insightful.

      ———

      The following is the authors’ response to the original reviews:

      Reviewer #1:

      Major comments:

      (1) It would be interesting to see how much fine-mapping stability can improve the fine-mapping results in cross-population. One can simulate data using true genotype data and quantify the amount the fine-mapping methods improve utilizing the stability idea.

      We agree, and have performed simulation studies where we assume that causal variants are shared across populations. Specifically, by mirroring the simulation approach described in Wang et al. (2020), we generated 2,400 synthetic gene expression phenotypes across 22 autosomes, using GEUVADIS gene expression metadata (i.e., gene transcription start site) to ensure largely cis expression phenotypes were simulated. We additionally generated 1,440 synthetic gene expression phenotypes that incorporate environmental heterogeneity, to motivate our pursuit of fine-mapping stability in the first place (see Response to Reviewer 2, Comment 6). These are described in Results section “Simulation study”:

      We evaluated the performance of the PICS algorithm, specifically comparing the approach incorporating stability guidance against the residualization approach that is more commonly used — similar to our application to the real GEUVADIS data. We additionally investigated two ways of “combining” the residualization and stability guidance approaches: (1) running stability-guided PICS on residualized phenotypes; (2) prioritizing matching variants returned by both approaches. See Response to Reviewer 2, Comment 5.

      (2) I would be very interested to see how other fine-mapping methods (FINEMAP, SuSiE, and CAVIAR) perform via the stability idea.

      Thank you for this valuable comment. We ran SuSiE on the same set of simulated datasets. Specifically, we ran a version that uses residualized phenotypes (supposedly removing the effects of population structure), and also a version that incorporates stability. The second version is similar to how we incorporate stability in PICS. We investigated the performance of Stable SuSiE in a similar manner to our investigation of PICS. First we compared the performance relative to SuSiE that was run on residualized phenotypes. Motivated by our finding in PICS that prioritizing matching variants improves causal variant recovery, we did the same analysis for SuSiE. This analysis is described in Results section “Stability guidance improves causal variant recovery in SuSiE.”

      We reported overall matching frequencies and causal variant recovery rates of top and stable variants for SuSiE in Figures 2C&D.

      Frequencies with which Stable and Top SuSiE variants match, stratified by the simulation parameters, are summarized in Supplementary File 2C (reproduced for convenience in Response to Reviewer 2, Comment 3). Causal variant recovery rates split by the number of causal variants simulated, and stratified by both signal-to-noise ratio and the number of credible sets included, are reported in Figure 2—figure supplements 16-18. We reproduce Figure 2—figure supplement 18 (three causal variants scenario) below for convenience. Analogous recovery rates for matching versus non-matching top or stable variants are reported in Figure 2—figure supplements 19, 21 and 23.

      (3) I am a little bit concerned about the PICS's assumption about one causal variant. The authors mentioned this assumption as one of their method limitations. However, given the utility of existing fine-mapping methods (FINEMAP and SuSiE), it is worth exploring this domain.

      Thank you for raising this fair concern. We explored this domain, by considering simulations that include two and three causal variants (see Response to Reviewer 2, Comment 3). We looked at how well PICS recovers causal variants, and found that each potential set largely does not contain more than one causal variant (Figure 2—figure supplements 20 and 22). This can be explained by the fact that PICS potential sets are constructed from variants with a minimum linkage disequilibrium to a focal variant. On the other hand, in SuSiE, we observed multiple causal variants appearing in lower credible sets when applying stability guidance (Figure 2—figure supplements 21 and 23). A more extensive study involving more fine-mapping methods and metrics specific to violation of the one causal variant assumption could be pursued in future work.

      Reviewer #2:

      Aw et al. presents a new stability-guided fine-mapping method by extending the previously proposed PICS method. They applied their stability-based method to fine-map cis-eQTLs in the GEUVADIS dataset and compared it against what they call residualization-based method. They evaluated the performance of the proposed method using publicly available functional annotations and claimed the variants identified by their proposed stability-based method are more enriched for these functional annotations.

      While the reviewer acknowledges the contribution of the present work, there are a couple of major concerns as described below.

      Major:

      (1) It is critical to evaluate the proposed method in simulation settings, where we know which variants are truly causal. While I acknowledge their empirical approach using the functional annotations, a more unbiased, comprehensive evaluation in simulations would be necessary to assess its performance against the existing methods.

      Thank you for this point. We agree. We have performed a simulation study where we assume that causal variants are shared across populations (see response to Reviewer 1, Comment 1). Specifically, by mirroring the simulation approach described in Wang et al. (2020), we generated 2,400 synthetic gene expression phenotypes across 22 autosomes, using GEUVADIS gene expression metadata (i.e., gene transcription start site) to ensure cis expression phenotypes were simulated.

      (2) Also, simulations would be required to assess how the method is sensitive to different parameters, e.g., LD threshold, resampling number, or number of potential sets.

      Thank you for raising this point. The underlying PICS algorithm was not proposed by us, so we followed the default parameters set (LD threshold, r<sup>2</sup> \= 0.5; see Taylor et al., 2021 Bioinformatics) to focus on how stability considerations will impact the existing fine-mapping algorithm. We attempted to derive the asymptotic joint distribution of the p-values, but it was too difficult. Hence, we used 500 permutations because such a large number would allow large-sample asymptotics to kick in. However, following your critical suggestion we varied the number of potential sets in our analyses of simulated data. We briefly mention this in the Results.

      “In the Supplement, we also describe findings from investigations into the impact of including more potential sets on matching frequency and causal variant recovery…”

      A detailed write-up is provided in Supplementary File 1 Section S2 (p.2):

      “The number of credible or potential sets is a parameter in many fine-mapping algorithms. Focusing on stability-guided approaches, we consider how including more potential sets for stable fine-mapping algorithms affects both causal variant recovery and matching frequency in simulations…

      Causal variant recovery. We investigate both Stable PICS and Stable SuSiE. Focusing first on simulations with one causal variant, we observe a modest gain in causal variant recovery for both Stable PICS and Stable SuSiE, most noticeably when the number of sets was increased from 1 to 2 under the lowest signal-to-noise ratio setting…”

      We observed that increasing the number of potential sets helps with recovering causal variants for Stable PICS (Figure 2—figure supplements 13-15). This observation also accounts for the comparable power that Stable PICS has with SuSiE in simulations with low signal-to-noise ratio (SNR), when we increase the number of credible sets or potential sets (Figure 2—figure supplements 10-12).

      (3) Given the previous studies have identified multiple putative causal variants in both GWAS and eQTL, I think it's better to model multiple causal variants in any modern fine-mapping methods. At least, a simulation to assess its impact would be appreciated.

      We agree. In our simulations we considered up to three causal variants in cis, and evaluated how well the top three Potential Sets recovered all causal variants (Figure 2—figure supplements 13-15; Figure 2—figure supplement 15). We also reported the frequency of variant matches between Top and Stable PICS stratified by the number of causal variants simulated in Supplementary File 2B and 2C. Note Supplementary File 2C is for results from SuSiE fine-mapping; see Response to Reviewer 1, Comment 2.

      Supplementary File 2B. Frequencies with which Stable and Top PICS have matching variants for the same potential set. For each SNR/ “No. Causal Variants” scenario, the number of matching variants is reported in parentheses.

      Supplementary File 2C. Frequencies with which Stable and Top SuSiE have matching variants for the same credible set. For each SNR/ “No. Causal Variants” scenario, the number of matching variants is reported in parentheses.

      (4) Relatedly, I wonder what fraction of non-matching variants are due to the lack of multiple causal variant modeling.

      PICS handles multiple causal variants by including more potential sets to return, owing to the important caveat that causal variants in high LD cannot be statistically distinguished. For example, if one believes there are three causal variants that are not too tightly linked, one could make PICS return three potential sets rather than just one. To answer the question using our simulation study, we subsetted our results to just scenarios where the top and stable variants do not match. This mimics the exact scenario of having modeled multiple causal variants but still not yielding matching variants, so we can investigate whether these non-matching variants are in fact enriched in the true causal variants.

      Because we expect causal variants to appear in some potential set, we specifically considered whether these non-matching causal variants might match along different potential sets across the different methods. In other words, we compared the stable variant with the top variant from another potential set for the other approach (e.g., Stable PICS Potential Set 1 variant vs Top PICS Potential Set 2 variant). First, we computed the frequency with which such pairs of variants match. A high frequency would demonstrate that, even if the corresponding potential sets do not have a variant match, there could still be a match between non-corresponding potential sets across the two approaches, which shows that multiple causal variant modeling boosts identification of matching variants between both approaches — regardless of whether the matching variant is in fact causal.

      Low frequencies were observed. For example, when restricting to simulations where Top and Stable PICS Potential Set 1 variants did not match, about 2-3% of variants matched between the Potential Set 1 variant in Stable PICS and Potential Sets 2 and 3 variants in Top PICS; or between the Potential Set 1 variant in Top PICS and Potential Sets 2 and 3 variants in Stable PICS (Supplementary File 2D). When looking at non-matching Potential Set 2 or Potential Set 3 variants, we do see an increase in matching frequencies (between 10-20%) between Potential Set 2 variants and other potential set variants between the different approaches. However, these percentages are still small compared to the matching frequencies we observed between corresponding potential sets (e.g., for simulations with one causal variant this was 70-90% between Top and Stable PICS Potential Set 1, and for simulations with two and three causal variants this was 55-78% and 57-79% respectively).

      We next checked whether these “off-diagonal” matching variants corresponded to the true causal variants simulated. Here we find that the causal variant recovery rate is mostly less than the corresponding rate for diagonally matching variants, which together with the low matching frequency suggests that the enrichment of causal variants of “off-diagonal” matching variants is much weaker than in the diagonally matching approach. In other words, the fraction of non-matching (causal) variants due to the lack of multiple causal variant modeling is low.

      We discuss these findings in Supplementary File 1 Section S2 (bottom of p.2).

      (5) I wonder if you can combine the stability-based and the residualization-based approach, i.e., using the residualized phenotypes for the stability-based approach. Would that further improve the accuracy or not?

      This is a good idea, thank you for suggesting it. We pursued this combined approach on simulated gene expression phenotypes, but did not observe significant gains in causal variant recovery (Figure 2B; Figure 2—figure supplements 2, 13 and 15). We reported this Results “Searching for matching variants between Top PICS and Stable PICS improves causal variant Recovery.”

      “We thus explore ways to combine the residualization and stability-driven approaches, by considering (i) combining them into a single fine-mapping algorithm (we call the resulting procedure Combined PICS); and (ii) prioritizing matching variants between the two algorithms. Comparing the performance of Combined PICS against both Top and Stable PICS, however, we find no significant difference in its ability to recover causal variants (Figure 2B)...”

      However, we also confirmed in our simulations that prioritizing matching variants between the two approaches led to gains in causal variant recovery (Figure 2D; Figure 2—figure supplements 4, 19, 20 and 22). We reported this Results “Searching for matching variants between Top PICS and Stable PICS improves causal variant Recovery.”

      “On the other hand, matching variants between Top and Stable PICS are significantly more likely to be causal. Across all simulations, a matching variant in Potential Set 1 is 2.5X as likely to be causal than either a non-matching top or stable variant (Figure 2D) — a result that was qualitatively consistent even when we stratified simulations by SNR and number of causal variants simulated (Figure 2—figure supplements 19, 20 and 22)...”

      This finding is consistent with our analysis of real GEUVADIS gene expression data, where we reported larger functional significance of matching variants relative to non-matching variants returned by either Top of Stable PICS.

      (6) The authors state that confounding in cohorts with diverse ancestries poses potential difficulties in identifying the correct causal variants. However, I don't see that they directly address whether the stability approach is mitigating this. It is hard to say whether the stability approach is helping beyond what simpler post-hoc QC (e.g., thresholding) can do.

      Thank you for raising this fair point. Here is a model we have in mind. Gene expression phenotypes (Y) can be explained by both genotypic effects (G, as in genotypic allelic dosage) and the environment (E): Y = G + E. However, both G and E depend on ancestry (A), so that Y = G|A+E|A. Suppose that the causal variants are shared across ancestries, so that (G|A=a)=G for all ancestries a. Suppose however that environments are heterogeneous by ancestry: (E|A=a) = e(a) for some function e that depends non-trivially on a. This would violate the exchangeability of exogenous E in the full sample, but by performing fine-mapping on each ancestry stratum, the exchangeability of exogenous E is preserved. This provides theoretical justification for the stability approach.

      We next turned to simulations, where we investigated 1,440 simulated gene expression phenotypes capturing various ways in which ancestry induces heterogeneity in the exogenous E variable (simulation details in Lines 576-610 of Materials and Methods). We ran Stable PICS, as well as a version of PICS that did not residualize phenotypes or apply the stability principle. We observed that (i) causal variant recovery performance was not significantly different between the two approaches (Figure 2—figure supplements 24-32); but (ii) disagreement between the approaches can be considerable, especially when the signal-to-noise ratio is low (Supplementary File 2A). For example, in a set of simulations with three causal variants, with SNR = 0.11 and E heterogeneous by ancestry by letting E be drawn from N(2σ,σ<sup>2</sup>) for only GBR individuals (rest are N(0,σ<sup>2</sup>)), there was disagreement between Potential Set 1 and 2 variants in 25% of simulations — though recovery rates were similar (Probability of recovering at least one causal variant: 75% for Plain PICS and 80% for Stable PICS). These points suggest that confounding in cohorts can reduce power in methods not adjusting or accounting for ancestral heterogeneity, but can be remedied by approaches that do so. We report this analysis in Results “Simulations justify exploration of stability guidance”

      In the current version of our work, we have evaluated, using both simulations and empirical evidence, different ways to combine approaches to boost causal variant recovery. Our simulation study shows that prioritizing matching variants across multiple methods improves causal variant recovery. On GEUVADIS data, where we might not know which variants are causal, we already demonstrated that matching variants are enriched for functional annotations. Therefore, our analyses justify that the adverse consequence of confounding on reducing fine-mapping accuracy can be mitigated by prioritizing matching variants between algorithms including those that account for stability.

      (7) For non-matching variants, I wonder what the difference of posterior probabilities is between the stable and top variants in each method. If the difference is small, maybe it is due to noise rather than signal.

      We have reported differences in posterior probabilities returned by Stable and Top PICS for GEUVADIS data; see Figure 3—figure supplement 1. For completeness, we compute the differences in posterior probabilities and summarize these differences both as histograms and as numerical summary statistics.

      Potential Set 1

      - Number of non-matching variants = 9,921

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 1.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 1.

      Potential Set 2

      - Number of non-matching variants = 14,454

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 2.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 2.

      Potential Set 3

      - Number of non-matching variants = 16,814

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 3.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 3.

      We also compared the difference in posterior probabilities between non-matching variants returned by Stable PICS and Top PICS for our 2,400 simulated gene expression phenotypes. Focusing on just Potential Set 1 variants, we find two equally likely scenarios, as demonstrated by two distinct clusters of points in a “posterior probability-posterior probability” plot. The first is, as pointed out, a small difference in posterior probability (points lying close to y=x). The second, however, reveals stable variants with very small posterior probability (of order 4 x 10<sup>–5</sup> to 0.05) but with a non-matching top variant taking on posterior probability well distributed along [0,1]. Moving down to Potential Sets 2 and 3, the distribution of pairs of posterior probabilities appears less clustered, indicating less tendency for posterior probability differences to be small ( Figure 2—figure supplement 8).

      Here are the histograms and numerical summary statistics.

      Potential Set 1

      - Number of non-matching variants = 663 (out of 2,400)

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 4.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 4.

      Potential Set 2

      Number of non-matching variants = 1,429 (out of 2,400)

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 5.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 5.

      Potential Set 3

      - Number of non-matching variants = 1,810 (out of 2,400)

      - Table of Summary Statistics of (Stable Posterior Probability – Top Posterior Probability)

      Author response table 6.

      - Histogram of (Stable Posterior Probability – Top Posterior Probability)

      Author response image 6.

      (8) It's a bit surprising that you observed matching variants with (stable) posterior probability ~ 0 (SFig. 1). What are the interpretations for these variants? Do you observe functional enrichment even for low posterior probability matching variants?

      Thank you for this question. We have performed a thorough analysis of matching variants with very low stable posterior probability, which we define as having a posterior probability < 0.01 (Supplementary File 1 Section S11). Here, we briefly summarize the analysis and key findings.

      Analysis

      First, such variants occur very rarely — only 8 across all three potential sets in simulations, and 17 across all three potential sets for GEUVADIS (the latter variants are listed in Supplementary 2E). We begin interpreting these variants by looking at allele frequency heterogeneity by ancestry, support size — defined as the number of variants with positive posterior probability in the ALL slice* — and the number of slices including the stable variant (i.e., the stable variant reported positive posterior probability for the slice).

      *Note that the stable variant posterior probability need not be at least 1/(Support Size). This is because the algorithm may have picked a SNP that has a lower posterior probability in the ALL slice (i.e., not the top variant) but happens to appear in the most number of other slices (i.e., a stable variant).

      For variants arising from simulations, because we know the true causal variants, we check if these variants are causal. For GEUVADIS fine-mapped variants, we rely on functional annotations to compare their relative enrichment against other matching variants that did not have very low stable posterior probability.

      Findings

      While we caution against generalizing from observations reported here, which are based on very small sample sizes, we noticed the following. In simulations, matching variants with very low stable posterior probability are largely depleted in causal variants, although factors such as the number of slices including the stable variant may still be useful. In GEUVADIS, however, these variants can still be functionally enriched. We reported three examples in Supplementary File 1 Section S11 (pp. 8-9 of Supplement), where the variants were enriched in either VEP or biologically interpretable functional annotations, and were also reported in earlier studies. We partially reproduce our report below for convenience.

      “However, we occasionally found variants that stand out for having large functional annotation scores. We list one below for each potential set.

      - Potential Set 1 reported the variant rs12224894 from fine-mapping ENSG00000255284.1 (accession code AP006621.3) in Chromosome 11. This variant stood out for lying in the promoter flanking region of multiple cell types and being relatively enriched for GC content with a 75bp flanking region. This variant has been reported as a cis eQTL for AP006632 (using whole blood gene expression, rather than lymphoblastoid cell line gene expression in this study) in a clinical trial study of patients with systemic lupus erythematosus (Davenport et al., 2018). Its nearest gene is GATD1, a ubiquitously expressed gene that codes for a protein and is predicted to regulate enzymatic and catabolic activity. This variant appeared in all 6 slices, with a moderate support size of 23.

      - Potential Set 2 reported the variant rs9912201 from fine-mapping ENSG00000108592.9 (mapped to FTSJ3) in Chromosome 17. Its FIRE score is 0.976, which is close to the maximum FIRE score reported across all Potential Set 2 matching variants. This variant has been reported as a SNP in high LD to a GWAS hit SNP rs7223966 in a pan-cancer study (Gong et al., 2018). This variant appeared in all 6 slices, with a moderate support size of 32.

      - Potential Set 3 reported the variant rs625750 from fine-mapping ENSG00000254614.1 (mapped to CAPN1-AS1, an RNA gene) in Chromosome 11. Its FIRE score is 0.971 and its B statistic is 0.405 (region under selection), which lie at the extreme quantiles of the distributions of these scores for Potential Set 3 matching variants with stable posterior probability at least 0.01. Its associated mutation has been predicted to affect transcription factor binding, as computed using several position weight matrices (Kheradpour and Kellis, 2014). This variant appeared in just 3 slices, possibly owing to the considerable allele frequency difference between ancestries (maximum AF difference = 0.22). However, it has a small support size of 4 and a moderately high Top PICS posterior probability of 0.64.

      To summarize, our analysis of GEUVADIS fine-mapped variants demonstrates that matching variants with very low stable posterior probability could still be functionally important, even for lower potential sets, conditional on supportive scores in interpretable features such as the number of slices containing the stable variant and the posterior probability support size…”

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      Evidence, reproducibility and clarity

      __Summary

      Köver et al. examine the genetic and environmental underpinnings of multicellular-like phenotypes (MLPs) in fission yeast, studying 57 natural isolates of Schizosaccharomyces pombe. They uncover that a noteworthy subset of these isolates can develop MLPs, with the extent of these phenotypes varying according to growth media. Among these, two strains demonstrate pronounced MLP across a range of conditions. By genetically manipulating one strain with an MLP phenotype (distinct from the previously mentioned two strains), they provide evidence that genes such as MBX2 and SRB11 play a direct role in MLP formation, strengthening their genetic mapping findings. The study also reveals that while some key genes and their phenotypic effects are strikingly similar between budding and fission yeast, other aspects of MLP formation are not conserved, which is an intriguing finding.

      Overall, the manuscript is well-written, dense yet logically structured, and the figures are well presented. The combination of phenotypic, genetic, and bioinformatics analyses, particularly from wet lab experiments, is commendable. The study addresses a significant gap in our understanding, primarily explored in budding yeast, by providing comprehensive data on MLP diversity in fission yeast and the interplay of genetic and environmental factors.

      In summary, I enjoyed reading the manuscript and have only a few minor suggestions to strengthen the paper:

      Minor revisions:

      1. Although this may seem like a minor revision, but it is a crucial point. Please make sure that all raw data used to generate figures, run stats, sequence data, and scripts used to run data analysis are made publicly available. Provide relevant accession numbers and links to public data repositories. It is important that others can download the various types of data that went into the major conclusions of this paper in order to replicate your analysis or expand upon the scope of this work. I am not sure if the journal has a policy regarding this, but it should be followed to allow for transparency and reproducibility of the research.__

      Reply: We very much agree with the reviewer that sharing raw data and scripts is an essential part of open science. All code and data are deposited to Github (https://github.com/BKover99/S.-Pombe-MLPs) and Figshare (https://figshare.com/articles/software/S_-Pombe-MLPs/25750980), which have now been updated to reflect our revisions. Additionally, the sequenced genomes have been deposited to ENA (PRJEB69522). Where external data was used, it was properly referenced and specifically included in Supplementary Table 3.

      Two out of 57 strains exhibit strong and consistent MLP across multiple environments. Providing more information on these strains (JB914 and JB953), such as their natural habitats and distinct appearances of their MLP phenotypes under varying conditions, would provide valuable insights.

      First, a brief discussion highlighting what differentiates these two strains from the rest would be helpful for readers (e.g. insight into their unique genetic and environmental background that might be linked to the MLP phenotype).

      Additionally, culture tube and microscopy images of these strains, similar to those presented for JB759 in Figure 2A, can be included in the supplementary materials. My reasoning is that these images could help illustrate variation or lack thereof in aggregative group size across different media.

      Reply: We thank the reviewer for highlighting this issue. Our further investigation into these strains has added additional interesting insights. JB914 and JB953 were isolated from molasses in Jamaica and the exudate of Eucalyptus in Australia, respectively, though it remains unclear whether these environments are related or even selective for the ability of these strains to form MLPs. We note that the environment from which a strain is isolated is an incomplete way of assessing its ecology. Indeed, recent research suggests that the primary habitat of S. pombe is honeybee honey and suggests that bees, which may be attracted to a number of sugary substances, may be a vector by which fission yeast are transported (1). Therefore, isolation from a particular nectar or food production environment might not reflect significant ecological differences. We now refer to the location of strain isolation in the manuscript text (lines 208-209).

      However, there is more to learn from the genetic backgrounds of these two strains. We found that JB914 possesses the same variant in srb11 causally related to MLPs as JB759, the MLP-forming parental strain for our QTL analysis. To understand whether the appearance of this variant in these two strains derived from a single mutation event or was a case of convergent evolution, we analysed homology between the genomes of JB759 and JB914, focusing specifically on that variant. We found an approximately 20kb region of homology between JB759 and JB914 surrounding the srb11 truncation variant, in contrast to the majority of the genome, which does not share homology between those two strains (New Supplementary Figure 9A, B)). This result suggests that, while the two strains are largely unrelated, that specific region shares a recent common ancestor and is likely a result of interbreeding across strains.

      Importantly, this analysis further emphasizes the point that the srb11 variant segregates with the MLP-forming phenotype. We conclude this because none of the other strains similar to JB759 (either across the whole genome, or specifically in the region surrounding srb11) exhibit MLPs (New Supplementary Figure 9C). This thereby further complements our QTL analysis on the significance of this variant. We have added this analysis to the manuscript text (lines 337-349).

      Furthermore, we searched other strains which exhibited MLPs in our experiments (e.g. JB953) for frame shifts, insertions or deletions in any other genes in the CKM module or in the genes that were identified in our deletion library screen as adhesive, and did not identify any severe mutations falling into coding regions (other than the srb11 truncation in JB914 and JB759). This indicates that MLPs in these other strains may be caused by differences in regulatory regions surrounding these genes, or variants in other genes that were not identified in our screen. We have added this analysis to our manuscript (lines 424-425) and Supplementary Table 13.

      We agree that microscopy and culture tube images of JB914 and JB953 may give insight into the nature of the MLPs exhibited by those strains. We have included such images of cultures grown in YES, EMM and EMM-Phosphate media in our revision (Lines 207-208, Supplementary Figures 4 and 5). These images are consistent with our adhesion assay screen and show that JB914 and JB953 are adhesive at the microscopic level in the relevant conditions (EMM or EMM-Phosphate).

      The phenotypic outcome of overexpressing MXB2 is striking, as shown in Supplementary Figure 4C. Incorporating at least one of the culture tube images depicting large flocs into the main text, perhaps adjacent to Figure 3 panel D, would improve the visual appeal and highlight this key finding (at the moment those images are only shown in the supplementary materials).

      Reply: We thank the reviewer for this suggestion. In response to Reviewer 2's suggestion to overexpress mbx2 in YES, we created new mbx2 overexpression strains that could overexpress mbx2 in YES, which was not possible in our previous strain in which mbx2 overexpression was triggered by removal of thymine from the media. We have replaced our original data from Figure 3D with data from the new mbx2 overexpression experiment, including flask images.

      I know that the authors discuss the knowledge gap in the intro and results, but the abstract does not mention this critical gap. Please stress this critical gap (i.e., MLPs understudied in fission yeast) with a brief sentence in the abstract. Similarly, please consider writing a brief concluding sentence summarizing the paper's most significant finding referring to the knowledge gap would provide a clearer takeaway message for the reader - the abstract ends abruptly without any conclusion.

      Reply: We agree and have now emphasized the critical gap in our abstract:

      "As MLP formation remains understudied in fission yeast compared to budding yeast, we aimed to narrow this gap." at lines 18-19.

      Additionally, we added the following final sentence to give the reader a clearer takeaway message:

      "Our findings provide a comprehensive genetic survey of MLP formation in fission yeast, and a functional description of a causal mutation that drives MLP formation in nature." at lines 31-32.

      1. The observation that strains with adhesive phenotypes have a lower growth rate compared to non-adhesive strains is a noteworthy point (lines 532-535). This represents yet another example of this classical trade-off. This point could be emphasized in the Discussion or alongside the relevant result, with a brief speculative explanation for this phenomenon.

      Reply: We agree that the nature of the trade-off between MLP formation is an interesting discussion point that could arise from our work. Understanding this trade-off is made more complicated by the fact that growth is always condition-dependent, and measuring growth in strains exhibiting MLPs is non-trivial, as adhesion to labware and thick clumps of cells separated by regions of cell-free media can add variability. Nonetheless, there has been some previous work on this problem. In S. cerevisiae, it was shown that larger group size correlates with slower growth rate (3), and that flocculating cells grow more slowly (4). In S. cerevisiae, cAMP, a signalling molecule heavily involved in regulating growth in response to nutrient availability, also regulates filamentation (5). However, the relationship between flocculation and slow growth is not consistent in the literature. In some settings overexpressing the flocculins FLO8, FLO5, and FLO10 results in slower growth (6), while in others it does not (7). In addition, ethanol production has been shown to improve for biofilms (7).

      Furthermore, in S. cerevisiae, MLP-forming cells grow better in low sucrose concentrations (8) and under various stress conditions (4). Flocculating cells have also shown faster fermentation in media containing common industrial bioproduction inhibitors, despite slower fermentation than non-flocculating cells in non-inhibitory media (9). However, any consequence of this possible advantage on growth has not been characterised.

      In S. pombe, there is less work on this topic; however, it has been shown that deletions of rpl3201 and rpl3202, which code for ribosomal proteins, cause flocculation and slow growth (10). In that case, it is not clear if there is any causal relationship between slow growth and flocculation or if they are both parallel consequences of the ribosomal pathway disruption. We have added some of these points to the portion of the discussion that discusses this tradeoff (Lines 477-499).

      To get a better understanding of this tradeoff in our system, we took several approaches. First, we added a supporting analysis (New Supplementary Figure 12B), using published growth data based on measurements on agar plates for the S. pombe gene deletion library (11). There, the authors defined a set of deletion strains that grow more slowly on EMM than the wild-type lab strain. We found that our MLP hit strains were significantly enriched in this "EMM-slow" category. This information is now included in the manuscript (Lines 409-413, New Supplementary Figure 12B).

      It is, however, possible that for the assays from that work, the appearance of slow growth on solid agar in adhesive cells could be partially artifactual. Indeed, we have observed that adhesive cells tend to stick to flasks and, when grown on agar plates, cells in the same colony can stick to one another rather than to inoculation loops or pin pads. Both of these dynamics can reduce initial inoculation densities. This is less of a concern for our adhesion assay and Figures 2E, 5B, and 5F, because our before-wash intensity was done with a 7x7 pinned square about 10x10 mm2. Nonetheless, as we wanted to make a point about srb10 and srb11 mutants growing faster than other deletion mutants that exhibit MLP-formation, we also conducted growth assays in liquid media (New Figure 5F).

      We observed that srb10Δ and srb11Δ strains (which exhibit MLPs in EMM) show growth curves similar to wild-type cells in minimal (EMM) and rich media (YES). On the other hand, other strains that grow similarly to wild type cells in YES, such as tlg2Δ and rpa12Δ, grow much more slowly in EMM when they clump together. There are also some strains, mus7Δ and kgd2Δ, that grow more slowly in both YES and EMM but are only adhesive in EMM.

      The text mentions two lab strains, JB22 and JB50, displaying strong adhesion under phosphate starvation (lines 525-526), yet the data point for JB22 in Figure 2C is not labeled.

      Reply: We agree that highlighting JB22 on the figure is crucial, given that it was mentioned in the main text. JB22 is now highlighted in green on Fig 2C.

      1. Although I generally avoid commenting on formatting, I found the manuscript to be dense. As mentioned above, I truly enjoyed reading it! But I couldn't help but think of ways to make the manuscript more concise for readers. The Results section spans nine pages (excluding figure captions), and the Discussion is five pages long. The main text contains 6 figures with approximately 27 panels and 32 plots and Venn diagrams, while the supplementary material has 11 figures with 22 panels and about 59 plots. Altogether, the manuscript comprises 17 figures, 49 panels, and roughly 91 plots and Venn diagrams! While I will not request any changes, I encourage the authors to consider streamlining the text/data where possible to focus on the core theme of the study.

      We thank the reviewer for these suggestions and have reorganised some of our figures and text to appear less dense. We have also added several figures and panels in response to reviewer comments. While we endeavor to make our points clear and concise in the main figures, we believe that it is important to retain key supplementary figures so that an interested reader can evaluate the data in more detail:

      A summary of our major changes to the figures is below, and we also provide a manuscript with changes tracked for the reviewers' convenience:

      Fig 2:

      Added Panel E in response to reviewer comments. Fig 3:

      Removed axes for pfl3 and pfl7 from Fig 3C, as the point was made by the other genes displayed (mbx2, pfl8 and gsf2) Replaced Fig 3D with similar data from an improved experiment in response to reviewer comments. Added New Fig 3F from Original Supp Fig 5 Fig 5:

      Moved Original Fig 5A to New Supp Fig 10A. Added New Fig 5F in response to reviewer comments. Original Supp Fig 4 / New Supp Fig 6:

      Removed mbx2 overexpression images from Original Fig 4C, to be replaced by new overexpression data and images in New Fig 3D. Added flask images for srb10 and srb11 deletion mutants from Original Supp Fig 5A to New Supp Fig 6C. Added microscope image for srb11 deletion mutant from Ooriginal Supp Fig 5A to New Supp Fig 6C. Added adhesion assay results from Original Supp Fig 5C to New Supp Fig 6C. Added New Supp Fig 6D in response to review Original Supp Fig 5

      Removed this figure. Original Supp Fig 5A and 5B were moved to New Supp Fig 6. Original Supp Fig 5B was removed to make the manuscript more concise. Original Supp Figs 6, 7 and 8 were combined into New Supp Fig 8.

      Original Supp Fig 6A and 6B are now New Supp Fig 8A and 8B. Original Supp Fig 7 is now New Supp Fig 8C. Original Supp Fig 8A is now New Supp Fig 8D and 8E. Original Supp Fig 8B is now New Supp Fig 8F Original Supp Fig 9/New Supp Fig 10

      Added Original Fig 5A as new Supp Fig 10A. Original Supp Fig 11/New Supp Fig 12

      Removed Original Fig 11B and the relevant text to make the manuscript more concise. Added New Supp Fig 12B in response to reviewer comments. New Supplementary Figures added in response to reviewer comments:

      New Supp Fig 4: Microscopy images of natural isolates. New Supp Fig 5: Flask images of natural isolates New Supp Fig 7: Microscopy and flask images of mbx2 overexpression strains. New Supp Fig 9: Genomic comparisons between JB759 and the MLP-forming wild isolate, JB914. Removed some less relevant points from our discussion, to reduce the length.

      Added new Supplementary Tables:

      Supplementary Table 13: Variants in candidate genes. Added in response to reviewer comments Supplementary Table 14: List of plasmids used in the study.

      **Referees cross-commenting**

      There are many useful recommendations from all the other reviewers that will help improve the final product. Once those points are revised, I think this will be a nice paper of interest to folks interested in natural variation in MLPs and its genetic background.

      Significance

      My expertise: evolutionary genetics, evolution of multicellularity, yeast genetics, experimental evolution

      Overall, the manuscript is well-written, dense yet logically structured, and the figures are well presented. The combination of phenotypic, genetic, and bioinformatics analyses, particularly from wet lab experiments, is commendable. The study addresses a significant gap in our understanding, primarily explored in budding yeast, by providing comprehensive data on MLP diversity in fission yeast and the interplay of genetic and environmental factors.

      In summary, I enjoyed reading the manuscript and have only a few minor suggestions to strengthen the paper.

      Reviewer #2

      Evidence, reproducibility and clarity

      REVIEWER COMMENTS

      Yeast species, including fission yeast and budding yeast, could form multicellular-like phenotypes (MLP). In this work, Kӧvér and colleagues found most proteins involved in MLP formation are not functionally conserved between S. pombe and budding yeast by bioinformatic analysis. The authors analyzed 57 natural S. pombe isolates and found MLP formation to widely vary across different nutrient and drug conditions. The authors demonstrate that MLP formation correlated with expression levels of the transcription factor gene mbx2 and several flocculins. The authors also show that Cdk8 kinase module and srub11 deletions also resulted in MLP formation. The experimental design is logic, the manuscript is well-written and organized. I have a few concerns that should be addressed before the publication.

      Major points:

      1) Line 61-62, how did the authors grow yeast cells in the liquid medium? Shaking or static? If shaking, the nutrient should be even distributed in the medium.

      If static culture, most single yeast cells could precipitate on the bottom, how do you address the advantage of flocculation for increasing the sedimentation? In addition, under static culture, the bottom will have less air than the up medium, how to balance the air and nutrients?

      Reply: In line 61-62 we stated that "Similarly, flocculation could increase sedimentation in liquid media, thereby assisting the search for more nutrient-rich or less stressful environments (4)".

      Our intent was to speculate on the advantages of multicellular-like growth, and cited a review article which has mentioned sedimentation. After further consideration, we decided that this is a minor point and is rather speculative, and removed it altogether from the manuscript.

      In response to the Reviewer's question about how cells were grown in liquid medium, throughout the paper we used shaking cultures for our flocculation assays and for pre-cultures. We have made this more clear in the text where it was ambiguous (e.g. line 189, throughout the methods section, and in the legend of Fig. 2A).

      2) Line 555, it will be interesting to test whether overexpression of mbx2 could cause flocculation in YES medium. In Figure 3D, the authors use two control strains, but only one mbx2 OE strain, mbx2 OE should be tested in both strains. In addition, did the authors transform empty plasmid into the control strains, please indicate in the figure.

      In this experiment, mbx2 was overexpressed using a thiamine-repressible nmt1 promoter, which is a standard construct in fission yeast studies. Assaying MLP formation was not feasible in YES with this strain, because YES is a rich media made up of yeast extract which contains thiamine. Thus, we could not remove thiamine from the media to trigger mbx2 overexpression.

      In order to test the influence of mbx2 overexpression in YES, we constructed strains in which mbx2 was integrated into the genome and expression was driven by the rpl2102 promoter, which has been shown to provide constitutive moderate expression levels (12). We observed strong flocculation in both EMM and YES (Fig 3D, New Supplementary Figure 7) . We did not see strong flocculation in a control in which GFP was expressed under the rpl2102 promoter. The flocculation phenotype was so strong that our original adhesion assay protocol required modification for this experiment, including resuspension in 10 mM EDTA before repinning (Methods). We observed strong adhesion for the mbx2 overexpression strains (Fig 3D), but not for control strains in YES. We could not check adhesion in EMM for those strains because cells pinned on EMM did not survive resuspension in EDTA.

      We performed these experiments in two backgrounds, 968 h90 (JB50), which is one of the parental strains of the segregant library analysed in Figure 3 and 972 h- (JB22), which is an appropriate background for the gene deletion collection.

      We have replaced the data from the original Figure 3D with the new adhesion assay and added New Supplementary Figure 7 to the manuscript (Lines 236-244).

      This result also helped us to further refine our model for the pathway. We can now say that the repression of MLPs in rich media must act via Mbx2, as overexpression of mbx2 is sufficient to abolish it, and is likely to act transcriptionally (if it acted on the protein level, the mild overexpression would likely not have led to the phenotype) (Figure 6, Lines 554-556 in the discussion)

      3) Line 600-601, the authors may do the backcross of srb11Δ::Kan to exclude the possibility caused by other mutations.

      Reply: We thank the reviewer for noticing our concern about suppressor mutations arising in the srb11Δ strain obtained from our deletion library. This initial concern arose following the observation that while qualitatively the srb11Δ::Kan and srb11Δ(CRISPR) strains were both strongly adhesive, there was a minor quantitative difference in their adhesion.

      As we obtained this strain from an h+ deletion library strain backcrossed with a prototrophic h- strain (JB22) in order to restore auxotrophies (13), the chances for a suppressor mutation to arise are very low. We have therefore removed that language from our text. We now suspect that a more likely explanation for this small difference could be the strain background, as our CRISPR engineered strain was made in a JB50 background which has the h90 mating type, while the deletion library strains are h- without auxotrophic markers.

      We would like to emphasize, however, that despite this quantitative difference in the adhesion phenotype between the two srb11Δ strains, they both have a large increase in the adhesion phenotype relative to the respective wild-type strains. To address this point, we have removed the unnecessary statistical comparison of these two deletion strains and focused on their qualitatively high levels of adhesion in the text (lines 267-269) and in our Revised Supplementary Figure 6D.

      Minor points:

      1) Line 506, what are the growth conditions of cells in Figure 2A? Did the authors use the liquid or solid medium? Please mention in the Methods or figure legends.

      Reply: We have updated the manuscript to include the relevant details in the text (line 189), figure caption for Fig. 2A and in the methods section (lines 829-831).

      2) Line 533-535, please explain why the strains exhibiting strong adhesion have a decreased growth rate. Is there any related research? Please add some references.

      Reply: Please see reply to Reviewer 1, comment 5.

      **Referees cross-commenting**

      I agree with most of the comments from other reviewers. This publication may indeed be of interest to a minor area. But the results and the interpretations of the data are interesting and warranted, the findings are scientifically important.

      Significance

      The authors did many large-scale screens and bioinformatic analyses. The experiments in the manuscript are generally logical and sound. This study is useful for deciphering the mechanism of multicellular-like phenotype formation in the fission yeast, with some implications for some other organisms.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      Summary: Using a variety of targeted and genome wide analyses, the authors investigate the basis for "multicellular-like phenotypes" in S. pombe. Authors developed several methodologies to detect and quantify "multicellular-like phenotypes" (flocculation, aggregation...) and defined genes involved in these processes in laboratory and wild S. pombe.

      SECTION A - Evidence, reproducibility and clarity

      This is a very solid manuscript that is well-written and supported by convincing data. While one can imagine many additional experiments, the manuscript stands on its own and presents a quite exhaustive analysis of the area. I commend the author for their rigorous work and clear presentation. They are only a few minor points that warrant comments or corrections: - Supplementary Figure 1 is a typical example of the "necessity" to have statistics and P-values everywhere. The data are convincing but what is the evidence that the Filtering assay and the Plate-reader assay values should be linearly related? Lets imagine that Plate-reader assay value is proportional to the square of the Filtering assay value. What would be the Pearson R and P-value in this case? What is most appropriate? Why would one use a linear correlation? What is the "real" significance?

      Reply: We thank the reviewer for pointing out that the data in Supplementary Figure 1 does not appear to be linear and, therefore, reporting the Pearson correlation coefficient may not be the best way to represent the relationship between the two assays. The nonlinear nature of this data could indicate that

      The filtering assay saturates before the plate reader assay, and is less able to distinguish between strains that flocculate strongly and The filtering assay may be more sensitive for strains that show lower levels of flocculation. In general, we observed fewer strains with intermediate phenotypes for both assays, making it difficult to ascertain the true relationship between them; however, we believe that the key result is that the strains with the highest level of flocculation have the highest values in both assays. To capture this aspect of the data, we now report the Spearman correlation which is non-parametric and indicates how similar the ranking of each strain is based on both assays. With the alternative hypothesis being that the correlation is > 0, we report a Spearman correlation coefficient of 0.24 and a P-value of 0.04 (lines 823-826)

      • Minor points: * They are several "personal communications" in the manuscript (page 11, page 18, page 23). It should be checked whether this is accepted in the journal that publishes this manuscript.

      Reply: We thank the reviewer for highlighting this issue. We had three instances of "personal communications" in our original submission.

      The first instance was an acknowledgement for advice on our DNA extraction protocol from Dan Jeffares. We now include this in the Acknowledgements section instead.

      The second communication with Angad Garg described that they observed flocculation while growing cells in phosphate starvation conditions, which was not reported in their publication (14). Though we appreciate their willingness to share unpublished data with us, we have removed this observation from our manuscript and instead rely only on our own observations and arguments based on their published RNA-seq data to make our point.

      The third personal communication with Olivia Hillson supplements a minor hypothesis, namely that deletion of SPNCRNA.781 might cause MLP formation by affecting the promoter of hsr1, for which we had access to unpublished ChIP-seq data, showing its binding to flocculins. Recently published work from a different group (15) also suggests this link between hsr1 and flocculation and is now discussed in our manuscript instead of the result based on unpublished data obtained from personal communication at Lines 397-398.

      * Page 4 check "a few regulators"

      Reply: For clarity, this has now been changed to "several regulatory proteins" at Line 108. The specific proteins we are referring to are highlighted in Figure 1C.

      * Page 19, line 567: "remaining 8 strains" may be confusing as Material and Methods states "remaining 10 strains".

      Reply: Two of the 10 strains were found to be redundant after sequencing as explained in the Methods (Lines 930-934). Therefore, we only added 8 new strains to the analysis. We thank the reviewer for highlighting this as a potential source of misunderstanding, and clarified this point in the text (Lines 247-250 and in the methods).

      **Referees cross-commenting**

      I concur with most comments. Overall, the reviewers agree that this is a solid piece of work that could benefit from minor modifications and should be published. I reiterate that, for me, despite its quality, this publication will only be of interest to specialists.

      Reviewer #3 (Significance (Required)):

      A limited number of studies have investigated "multicellular-like phenotypes" in S. pombe. This manuscript brings therefore new and solid information. Yet, despite an impressive amount of work, our conceptual advance in understanding this process and its phylogenetic conservation remains limited. This is probably best illustrated in the figure 6 that summarize the study and contains 3 question marks and an additional unknown mechanism. (Most of the solid arrows in this figure correspond to interactions within the Mediator complex that were well known before this study.) In addition, while only few studies have been published in this area, the authors' findings are often only bringing additional support to already published observations. Overall, while this manuscript will be of interest to a restricted group of aficionados, it will most likely not attract the attention of a wide readership.

      __ Reviewer #4 (Evidence, reproducibility and clarity (Required)):__

      In this manuscript, the authors explore how multicellular-like phenotypes (MLPs) arise in the fission yeast S. pombe. Although yeasts are characterized as unicellular fungi, diverse species show MLPs, including filamentous growth on agar plates and flocculation in liquid media. MLPs may provide certain advantages in nutritionally poor conditions and protection against external challenges, upon which natural selection can then act. Previous work on MLPs has mostly been carried out in the budding yeasts S. cerevisiae and C. albicans, and little was known about these behaviors in S. pombe. The authors thus set out to investigate both genetic and environmental regulators of MLP formation.

      First, their analysis of published data revealed a limited number of shared regulators of MLP between S. pombe, S. cerevisiae, and C. albicans, although the cell adhesion proteins themselves are largely not conserved. Next, the authors screened a set of non-clonal natural isolates using two high-throughput assays that they developed and found that MLPs vary in strains and depending on nutrient conditions. Focusing on a natural isolate that showed both adhesion on agar plates and flocculation in liquid medium, they then analyzed a segregant library generated from this and a laboratory strain using their assays. Using QTL analysis, they uncovered a frameshift in the srb11 gene, which encodes a subunit of the Mediator complex, as the likely causal inducer of MLP. This was confirmed by additional analyses of strains lacking srb11 or other members of Mediator. Furthermore, the authors showed that loss of srb11 function resulted in the upregulation of the Mbx2 transcription factor, which was both necessary and sufficient for MLP formation in this background. Finally, screening of two additional yeast strain collections (gene and long intergenic non-coding RNA deletion) identified both known and novel regulators representing different pathways that may be involved in MLP formation.

      Altogether, this study provides new perspectives into our understanding of the diverse inputs that regulate multicellular-like phenotypes in yeast.

      Major comments:

      • The methods for screening for adhesion and flocculation are well described, with representative figures that show plates and flasks. However, there are few microscopy images of cells, and it would be interesting and helpful for the reader to have an idea of how cells look when they exhibit MLPs. For instance, are there any differences in cell shape or size when strains present different degrees of adhesion or flocculation? In addition, the authors mention that mutants with strong adhesion generally had lower colony density and are likely to be slower growing. Although their analyses suggest otherwise (page 22), this has a potential for introducing error in their observations, and including images of the adhesion/flocculation phenotypes may provide further support for their conclusions. I suggest that the authors present microscopy images 1) similar to what is shown for JB759 in Figure 2A and 2) of cells growing on agar in the adhesion assay. This could be included for the different Mediator subunit deletions that they tested, where there appear to be varying phenotypes. It could also be informative for a subset of the 31 high-confidence candidates that they identified in their screen.

      Reply: We thank the reviewer for highlighting the need for further microscopic characterisation of MLP forming strains. We therefore now include images of JB914, JB953 (New Supplementary Figures 4, Figure 2E) in liquid media in EMM, EMM-Phosphate, and YES; an srb11 deletion strain (Figure 3F), and mbx2 overexpression strains (New Supplementary Figure 7).

      • Upon identifying a frameshift in srb11 that is responsible for the MLP, the authors assessed whether deletion of other Mediator subunits would result in the same phenotype. They found that srb10 and srb11 deletions both flocculate and show adhesion, while other mutants had milder phenotypes. However, the authors also found that a new deletion of srb11 that they generated had a stronger adhesion phenotype than the srb11 deletion from the prototrophic deletion library, which was attributed this the accumulation of suppressor mutations in the strains of the deletion collection. As the authors make clear distinctions between the phenotypes of different Mediator mutants, I suggest generating and analyzing "clean" deletions of the 6 other subunits that they tested. This would strengthen their conclusion and help to rule out accumulated suppressors as the cause of the differences in the observed phenotypes.

      Reply: We thank the reviewer for noticing our concern about suppressor mutations in the manuscript. As we describe above in response to a similar question from reviewer 2, as the prototrophic deletion library from which we extracted the Mediator deletion strains had been backcrossed during its construction (13), we no longer suspect that small difference between the srb11Δ::Kan strain from the deletion library and the newly created srb11Δ (CRISPR) strains is due to suppressor mutations. Rather, we think they may be a result of the difference in genetic background and possibly mating type between the two strains. We also want to emphasize that this difference is small compared to the difference between the adhesion ratios of the srb11Δ strains and their respective control strains.

      Nevertheless, we made clean, independent Mediator mutants for 5 out of 6 Mediator genes tested (med10Δ, med13Δ, med19Δ, med27Δ, and srb10Δ) as well as an additional mutant that we didn't have in our library, med12Δ (Figure R9). When running the assay on these new strains we got an overall lower dynamic range, possibly due to variations in the water flow rate relative to the first assay. However, we saw a strong phenotype for both library and our own srb10Δ and CRISPR srb11Δ strains. We did not see a significant increase in adhesion for the other Mediator deletion mutants in EMM relative to wild type with the exception of for med10Δ in both the library strain and for our clean mutant, for which we did not observe a phenotype in our previous experiment. We included the experiment for the newly created mutants as New Supplementary Figure S6E and described them in lines 276-281 in our revised manuscript.

      Minor comments:

      • One point that recurs in the manuscript is the idea that mutations that give rise to strong MLPs also generally lead to slower growth, representing a potential trade-off. This idea could be reinforced with measurements of growth rate or generation time by optical density or cell number, for instance, rather than comparisons of colony density. Also, it would be interesting to mention if the slow growth phenotype is only observed in MLP-inducing conditions or also in rich medium.

      Reply: As described above in response to item 5 from Reviewer 1, we have conducted growth assays in liquid media for srb10Δ, srb11Δ, and other mutants from our adhesion screen (tlg2Δ, rpa12Δ, mus7Δ and kgd2Δ) that showed a similar phenotype to those genes in both minimal (EMM) and rich (YES) media. We observe that in rich media, srb10Δ and srb11Δ cells grow similarly to control strains, and they exhibit a lower decrease in growth rate than the other similarly adhesive strains. Both mus7Δ and kgd2Δ cells grow more slowly, even in rich media.

      We have also added data on the tradeoff between growth and adhesion based on growth on solid media from (11) for all mutants identified in our screen (New Supp Fig 12B)).

      Thus, the relationship between slow growth and clumpiness depends on the mutation, and specifically, mutations of the Mediator, including those to srb11 and srb10, seem to decrease the impact of any tradeoff between growth and adhesion.

      • The authors show that the MLPs of the srb10 and srb11 deletions occur through mbx2 upregulation. Do the varying strengths of the phenotypes of the strains lacking different Mediator subunits correlate with mbx2 levels in these backgrounds?

      Reply: There is some evidence from previous work that the relationship between the strength of the MLPs and the expression of mbx2 may not be perfectly proportional. In (16), med12Δ had a higher (though qualitatively comparable) level of mbx2 upregulation than srb10Δ (New Supp Fig 8E), even though that paper reported a milder phenotype for med12Δ than for srb10Δ cells. We did not observe a significant increase in adhesion in our med12Δ strain (New Supp Fig 6D). This suggests that in the case of these mutants, it is not simply the level of mbx2 that controls MLP formation, but that there are likely additional regulatory mechanisms. We have added some discussion on this context in the manuscript (lines 545-547).

      **Referees cross-commenting**

      I agree overall with the comments and suggestions from the other reviewers. The revision would require only minor modifications. The paper is interesting both for the combination of methodologies used and its findings, and I believe that it would benefit a growing community of researchers.

      Reviewer #4 (Significance (Required)):

      This study employed a variety of methods that allowed the authors to uncover previously unknown regulators of MLPs. Taking advantage of the diversity of natural fission yeast isolates as well as the constructed gene and non-coding RNA deletion collections, the authors identified novel genetic determinants that give rise to MLPs, opening new avenues into this exciting area of research. The overall conclusions of the work are solid and supported by the reported results and analyses. This study will be appreciated by a broad audience of readers who are interested in understanding how organisms respond to environmental challenges as well as how MLPs may result in emergent properties that play key roles in these responses. Some of the limitations of the work are described above, with recommendations for addressing these points.

      Keywords for my field of expertise: fission yeast, cell cycle, transcription, replication.

      References for Response to Reviews

      1. Brysch-Herzberg M, Jia GS, Seidel M, Assali I, Du LL. Insights into the ecology of Schizosaccharomyces species in natural and artificial habitats. Antonie Van Leeuwenhoek. 2022 May 1;115(5):661-95.
      2. Jeffares DC, Rallis C, Rieux A, Speed D, Převorovský M, Mourier T, et al. The genomic and phenotypic diversity of Schizosaccharomyces pombe. Nat Genet. 2015 Mar;47(3):235-41.
      3. Ratcliff WC, Denison RF, Borrello M, Travisano M. Experimental evolution of multicellularity. Proc Natl Acad Sci. 2012 Jan 31;109(5):1595-600.
      4. Smukalla S, Caldara M, Pochet N, Beauvais A, Guadagnini S, Yan C, et al. FLO1 is a variable green beard gene that drives biofilm-like cooperation in budding yeast. Cell. 2008 Nov 14;135(4):726-37.
      5. Lorenz MC, Heitman J. Yeast pseudohyphal growth is regulated by GPA2, a G protein alpha homolog. EMBO J. 1997 Dec 1;16(23):7008-18.
      6. Ignacia DGL, Bennis NX, Wheeler C, Tu LCL, Keijzer J, Cardoso CC, et al. Functional analysis of Saccharomyces cerevisiae FLO genes through optogenetic control. FEMS Yeast Res. 2025 Sept 24;25:foaf057.
      7. Wang Z, Xu W, Gao Y, Zha M, Zhang D, Peng X, et al. Engineering Saccharomyces cerevisiae for improved biofilm formation and ethanol production in continuous fermentation. Biotechnol Biofuels Bioprod. 2023 July 31;16(1):119.
      8. Koschwanez JH, Foster KR, Murray AW. Improved use of a public good selects for the evolution of undifferentiated multicellularity. eLife. 2013 Apr 2;2:e00367.
      9. Westman JO, Mapelli V, Taherzadeh MJ, Franzén CJ. Flocculation Causes Inhibitor Tolerance in Saccharomyces cerevisiae for Second-Generation Bioethanol Production. Appl Environ Microbiol. 2014 Nov;80(22):6908-18.
      10. Li R, Li X, Sun L, Chen F, Liu Z, Gu Y, et al. Reduction of Ribosome Level Triggers Flocculation of Fission Yeast Cells. Eukaryot Cell. 2013 Mar;12(3):450-9.
      11. Rodríguez-López M, Bordin N, Lees J, Scholes H, Hassan S, Saintain Q, et al. Broad functional profiling of fission yeast proteins using phenomics and machine learning. Marston AL, James DE, editors. eLife. 2023 Oct 3;12:RP88229.
      12. Hebra T, Smrčková H, Elkatmis B, Převorovský M, Pluskal T. POMBOX: A Fission Yeast Cloning Toolkit for Molecular and Synthetic Biology. ACS Synth Biol. 2024 Feb 16;13(2):558-67.
      13. Malecki M, Bähler J. Identifying genes required for respiratory growth of fission yeast. Wellcome Open Res. 2016 Nov 15;1:12.
      14. Garg A, Sanchez AM, Miele M, Schwer B, Shuman S. Cellular responses to long-term phosphate starvation of fission yeast: Maf1 determines fate choice between quiescence and death associated with aberrant tRNA biogenesis. Nucleic Acids Res. 2023 Feb 16;51(7):3094-115.
      15. Ohsawa S, Schwaiger M, Iesmantavicius V, Hashimoto R, Moriyama H, Matoba H, et al. Nitrogen signaling factor triggers a respiration-like gene expression program in fission yeast. EMBO J. 2024 Oct 15;43(20):4604-24.
      16. Linder T, Rasmussen NN, Samuelsen CO, Chatzidaki E, Baraznenok V, Beve J, et al. Two conserved modules of Schizosaccharomyces pombe Mediator regulate distinct cellular pathways. Nucleic Acids Res. 2008 May;36(8):2489-504.
    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewing Editor Comment:

      The reviewers felt that the study could be improved by (1) better integrating the results with the existing literature in the field

      (1) In the Introduction and Results section of the manuscript, we had made every attempt to cite the relevant literature. (Reviewer 1 stated that “The literature is appropriately cited”). We agree with the Reviewing Editor that rather than simply cite the relevant literature, we could have done a better job of integrating our findings with what has been previously discovered by others. We have attempted to do this in the revised manuscript. Also, we have included many additional citations in the Introduction and in the first section of the Results where work by others has provided a framework for interpreting our single-cell studies.

      and (2) manipulating Trib expression and analyzing the expression of 1-2 HIX genes.

      (2) We are grateful for this suggestion. As suggested by the Reviewing Editor we have attempted to increase and decrease trbl expression and assess the effect on expression of two genes, Swim and CG15784.

      We increased trbl levels in the wing pouch using rn-Gal4, tub-Gal80<sup>ts</sup> and UAS-trbl. By transferring larvae for 24 h from 18oC to 31oC, we were able to induce trbl expression in the wing pouch. When these larvae were irradiated at 4000 rad, we found reduced levels of apoptosis in the wing pouch of discs that overexpressed trbl (Figure 7-figure supplement 1). This indicated that upregulation of trbl is radioprotective. Consistent with our findings, others have previously shown that upregulation of trbl and stalling in the G2 phase of the cells cycle protects cells from JNK-induced apoptosis (Cosolo et al., 2019, PMID:30735120) or that downregulating the G2/M progression promoting factor string protects cells from X-ray radiation induced apoptosis (Ruiz-Losada et al., 2021, PMID:34824391).

      As suggested by the Reviewing Editor, we also examined the effect of trbl overexpression on the induction of two “highly induced by X-ray irradiation (HIX)” gene, Swim and CG15784. Increasing trbl expression had no effect on the induction of Swim and only a modest decrease in the induction of CG15784 (Figure 7-figure supplement 2). Thus, increasing trbl expression, is in itself, insufficient to promote HIX gene expression indicating that other factors are necessary for HIX gene induction.

      We also attempted to reduce trbl expression, using three different RNAi lines. While some of these lines have been used previously by others to reduce trbl expression under unirradiated conditions (Cosolo et al., 2019, PMID:30735120), we nevertheless wanted to check if they reduced trbl induction following irradiation. For each of the three lines, we observed no obvious reduction in trbl RNA following irradiation when visualized using HCR (Author response image 1). Thus, any effects on gene expression that we observe could not be attributed to a decrease in trbl expression. We have therefore included the images showing a lack of knockdown in this Response to Reviews document but not included these experiments in the revised manuscript.

      Author response image 1.

      RNA in situ hybridizations using the hybridization chain reaction performed using probes to trbl. In A-F, the RNAi is expressed using nubbin-Gal4. In G-I the RNAi is expressed using rn-Gal4, tub-Gal80<sup>ts</sup>. white-RNAi was used as a control (A, B, G, H). Three different RNAi lines directed against trbl were tested: Vienna lines VDRC 106774 (C, D) and VDRC 22113 (E, F), and Bloomington line BL42523. In no case was a reduction in trbl RNA upregulation in the wing pouch following 4000 rad observed, except for one disc (n = 6) of VDRC 106774 crossed to nubbin-gal4.

      Reviewer #1 (Public review):

      Summary:

      The authors analyze transcription in single cells before and after 4000 rads of ionizing radiation. They use Seuratv5 for their analyses, which allows them to show that most of the genes cluster along the proximal-distal axis. Due to the high heterogeneity in the transcripts, they use the Herfindahl-Hirschman index (HHI) from Economics, which measures market concentration. Using the HHI, they find that genes involved in several processes (like cell death, response to ROS, DNA damage response (DDR)) are relatively similar across clusters. However, ligands activating the JAK/STAT, Pvr, and JNK pathways and transcription factors Ets21C and dysf are upregulated regionally. The JAK/STAT ligands Upd1,2,3 require p53 for their upregulation after irradiation, but the normal expression of Upd1 in unirradiated discs is p53-independent. This analysis also identified a cluster of cells that expressed tribbles, encoding a factor that downregulates mitosis-promoting String and Twine, that appears to be G2/M arrested and expressed numerous genes involved in apoptosis, DDR, the aforementioned ligands, and TFs. As such, the tribbles-high cluster contains much of the heterogeneity.

      Strengths:

      (1) The authors have used robust methods for rearing Drosophila larvae, irradiating wing discs, and analyzing the data with Seurat v5 and HHI.

      (2) These data will be informative for the field.

      (3) Most of the data is well-presented

      (4) The literature is appropriately cited.

      We thank the reviewer for these comments.

      Weaknesses:

      (1) The data in Figure 1 are single-image representations. I assume that counting the number of nuclei that are positive for these markers is difficult, but it would be good to get a sense of how representative these images are and how many discs were analyzed for each condition in B-M.

      For each condition at least 5 discs were imaged but we imaged up to 15 discs in some cases. We tried to choose a representative disc for each condition after looking at all of them. All discs imaged under each condition are shown below; the disc chosen for the figure is indicated with an asterisk. All scale bars are 100 mm.

      Author response image 2.

      Images for discs shown in Manuscript Figure 1panels B, C

      Author response image 3.

      Images for discs shown in Manuscript Figure 1panels D, E

      Author response image 4.

      Images used in Manuscript Figure 1, F, G

      Author response image 5.

      Images used in Manuscript Figure 1H, I

      Author response image 6.

      Images used in Manuscript Figure 1J, K

      Author response image 7.

      Images used in Manuscript Figure 1L, M

      (2) Some of the figures are unclear.

      It is unclear to us exactly which figures the Reviewer is referring to. Perhaps this is the same issue mentioned below in “Recommendations for the authors”. We address it below.

      Reviewer #1 (Recommendations for the authors):

      (1) Regarding Figure 1, what is stained in blue? Is it DAPI? If so, this should be added to the figure legend.

      Thank you for pointing out this omission. This has been addressed in the revised manuscript.

      It is very difficult to see blue on black, so could the authors please outline the discs?

      Alternatively, they could show DAPI in green and the markers (pH2Av, etc) in magenta.

      We used DAPI (blue) as a way of outlining the discs. While we appreciate the reviewer’s concern, after reviewing the images, we found that the blue is clearly visible when the document is viewed on the screen. It is less obvious if the document is printed on some kinds or printers. Since boosting this channel would make the signal from the channels more difficult to see, we left the images as they were.

      (2) Figure 3, Figure Supplement 2, panel B. It is not possible to read the gene names in the panel's current form. Please break this up into 4 lines (as much as possible from the current 2).

      Thank you for this suggestion. We have done this in the revised manuscript.

      Reviewer #2 (Public review):

      This manuscript investigates the question of cellular heterogeneity using the response of Drosophila wing imaginal discs to ionizing radiation as a model system. A key advance here is the focus on quantitatively expressing various measures of heterogeneity, leveraging single-cell RNAseq approaches. To achieve this goal, the manuscript creatively uses a metric from the social sciences called the HHI to quantify the spatial heterogeneity of expression of individual genes across the identified cell clusters. Inter- and intra-regional levels of heterogeneity are revealed. Some highlights include the identification of spatial heterogeneity in the expression of ligands and transcription factors after IR. Expression of some of these genes shows dependence on p53. An intriguing finding, made possible by using an alternative clustering method focusing on cell cycle progression, was the identification of a high-trbl subset of cells characterized by concordant expression of multiple apoptosis, DNA damage repair, ROS-related genes, certain ligands, and transcription factors, collectively representing HIX genes. This high-trbl set of cells may correspond to an IR-induced G2/M arrested cell state.

      Overall, the data presented in the manuscript are of high quality but are largely descriptive. This study is therefore perceived as a resource that can serve as an inspiration for the field to carry out follow-up experiments.

      Thank you for your assessment of the work.

      Reviewer #2 (Recommendations for the authors):

      I suggest two major points for improvement:

      (1) It is important to test whether manipulation of trbl levels (i.e., overexpression, knockdown, mutation) would result in measurable biological outcomes after IR, such as altered HIX gene expression, altered cell cycle progression, or both. This may help disentangle the question of whether high trbl expression and correlated HIX gene expression are a cause or consequence of G2/M stalling.

      We have described these experiments at the beginning of this Response to Reviews document when addressing the comments made by the Reviewing Editor. Please see Figure 7, figure supplements 1 and 2. These experiments suggest that upregulation of trbl offers some protection from radiation-induced death, yet it is itself insufficient to induce expression of two HIX genes tested. As we have also described earlier, three different RNAi lines tested did not reduce trbl upregulation after irradiation.

      (2) A more extensive characterization of the high-trbl cell state would also be appropriate, particularly in terms of their relationship to the cell cycle.

      We attempted to address this issue in two ways. First, we used the expression of a trbl-gfp transgene and RNA in-situ hybridization experiments to visualize the distribution of the high-trbl cells (shown in new manuscript figure, Figure 6-figure supplement 3). When examining trbl RNA in irradiated discs, there is no obvious demarcation between cells that express high levels of trbl and other cells. This is also apparent in the UMAP shown in Figure 6A and A’. Most cells seem to express trbl; cells in the “high trbl” cluster simply express more trbl than others. We observed cells expressing trbl and PCNA as well as cells expressing only one of those two genes at detectable levels. Thus, it was not possible to distinguish the “high trbl” cells from other cells by this approach.

      We decided instead to focus on examining the expression of other cell-cycle genes in the high-trbl cluster. We have added a paragraph in the Results section that details our findings. Many transcriptional changes are indeed consistent with stalling in G2 such as high levels of trbl and low levels of string (stg). Additionally, that the cells are likely in G2 is consistent with reduced levels of genes that are normally expressed at other stages of the cell cycle: G1 genes such as E2f1 and Dp, S-phase genes such as several Mcm genes, PCNA and RnrS, and genes that encode mitotic proteins such as polo, Incenp and claspin. There are however, several anomalies such as slightly increased expression of the early-G1 cyclin, CycD, and the retinoblastoma ortholog Rbf. Thus, at least as assessed by the transcriptome, this cluster may not correspond to a cell state that is found under normal physiological conditions.

      (3) Minor: p. 12, line 3. Figure 5A is mentioned, but it seems that it should be 4A instead.

      Thank you for pointing this out. We have addressed this in our revisions.

      Reviewer #3 (Public review):

      Strengths:

      Overall, the manuscript makes a compelling case for heterogeneity in gene expression changes that occur in response to uniform induction of damage by X-rays in a single-layer epithelium. This is an important finding that would be of interest to researchers in the field of DNA damage responses, regeneration, and development.

      Weaknesses:

      This work would be more useful to the field if the authors could provide a more comprehensive discussion of both the impact and the limitations of their findings, as explained below.

      Propidium iodide staining was used as a quality control step to exclude cells with a compromised cell membrane. But this would exclude dead/dying cells that result from irradiation. What fraction of the total do these cells represent? Based on the literature, including works cited by the authors, up to 85% of cells die at 4000R, but this likely happens over a longer period than 4 hours after irradiation. Even if only half of the 85% are PI-positive by 4 hr, this still removes about 40% of the cell population from analysis. The remaining cells that manage to stay alive (excluding PI) at 4 hours and included in the analysis may or may not be representative of the whole disc. More relevant time points that anticipate apoptosis at 4 hr may be 2 hr after irradiation, at which time pro-apoptotic gene expression peaks (Wichmann 2006). Can the authors rule out the possibility that there is heterogeneity in apoptosis gene expression, but cells with higher expression are dead by 4 hours, and what is left behind (and analyzed in this study) may be the ones with more uniform, lower expression? I am not asking the authors to redo the study with a shorter time point, but to incorporate the known schedule of events into their data interpretation.

      We thank the reviewer for these important comments. The generation of single-cell RNA-seq data from irradiated cells is tricky. Many cells have already died. Even those that do not incorporate propidium iodide are likely in early stages of apoptosis or are physiologically unhealthy and likely made it through our FACS filters. Indeed, in irradiated samples up to 57% of sequenced cells were not included in our analysis since their RNA content seemed to be of low quality. It is therefore likely that our data are biased towards cells that are less damaged. As advised by the reviewer, we will include a clearer discussion of these issues as well as the time course of events and how our analysis captures RNA levels only at a single time point.

      If cluster 3 is G1/S, cluster 5 is late S/G2, and cluster 4 is G2/M, what are clusters 0, 1, and 2 that collectively account for more than half of the cells in the wing disc? Are the proportions of clusters 3, 4, and 5 in agreement with prior studies that used FACS to quantify wing disc cells according to cell cycle stage?

      Work by others (Ruiz-Losada et al., 2021, PMID:34824391) has shown that almost 80% of cells have a 4C DNA content 4 h after 4,000 rad X-ray irradiation. The high-trbl cluster accounts for only 18% of cells and can therefore account for a minority of cells with a 4C DNA content.

      Thus clusters 0, 1 and 2 could potentially contain other populations that also have a 4C DNA content. Importantly, similar proportions of cells in these clusters are also observed in unirradiated discs.

      We expect that clusters 1 and 2 are largely comprised of cells in G2/M. Together, these clusters are marked by some genes previously found to be higher in FACS separated G2 cells compared to G1 cells (Liang et al., 2014, PMID: 24684830). These genes include Det, aurA, and ana1. Strangely, cluster 0 is not strongly marked by any of the 175 cell cycle genes used in our clustering (eff being the strongest marker) and has a lower-than-average expression of 165/175 cell cycle genes. Cluster 0 is however marked by the genes ac and sc, which are known to be expressed in proneuronal cell clusters interspersed throughout the disc that stall in G2 and form mitotically quiescent domains (Usui & Kimura 1992, Development, 116 (1992), pp. 601-610 (no PMID); Nègre et al., 2003, PMID: 12559497). Given these observations, we hypothesize that cluster 0 is largely comprised of stalled G2 cells like those found in ac/sc-expressing proneural clusters.

      The EdU data in Figure 1 is very interesting, especially the persistence in the hinge. The authors speculate that this may be due to cells staying in S phase or performing a higher level of repair-related DNA synthesis. If so, wouldn't you expect 'High PCNA' cells to overlap with the hinge clusters in Figures 6G-G'? Again, no new experiments are needed. Just a more thorough discussion of the data.

      We have found that the locations of elevated PCNA expression do not always correlate with the location of EdU incorporation either by examining scRNA-seq data or by using HCR to detect PCNA. PCNA expression is far more widespread as we now show in Figure 6-figure supplement 3.

      Trbl/G2/M cluster shows Ets21C induction, while the pattern of Ets21C induction as detected by HCR in Figures 5H-I appears in localized clusters. I thought G2/M cells are not spatially confined. Are Ets21C+ cells in Figure 5 in G2/M? Can the overlap be confirmed, for example, by co-staining for Trbl or a G2/M marker with Ets21C?

      The data show that the high-trbl cells are higher in Ets21C transcripts relative to other cell-cycle-based clusters after irradiation. This does not imply that high-trbl-cells in all regions of the disc upregulate Ets21C equally. Ets21C expression is likely heterogeneous in both ways – by location in the disc and by cell-cycle state.

      Induction of dysf in some but not all discs is interesting. What were the proportions? Any possibility of a sex-linked induction that can be addressed by separating male and female larvae?

      We can separate the cells in our dataset into male and female cells by expression of lncRNA:roX1/2. When we do this, we see X-ray induced dysf expressed similarly in both male and female cells. We think that it is therefore unlikely that this difference in expression can be attributed to cell sex. Another possibility is that dysf upregulation might be acutely sensitive to the developmental stage of the disc. This would require experiments with very precisely-staged larvae. We have not investigated this further as it is not a central issue in our paper.

      Reviewer #3 (Recommendations for the authors):

      Please check the color-coding in Figure 1A. The region marked as pouch appears to include hinge folds that express Zfh2 (a hinge marker) in Figure 2A (even after accounting for low Zfh2 expression in part of the pouch).

      We have corrected this and have marked the pouch region based on the analysis of expression of different hinge and pouch markers by Ayala-Camargo et al. 2013 (PMID 2398534).

      The statement 'Furthermore, within tissues, stem cells are most sensitive while differentiated cells are relatively radioresistant' needs to be qualified, as there are differences in radiosensitivity of adult versus embryonic stem cells (e.g., PMID: 30588339)

      We thank the reviewer for bringing this point to our attention and for pointing us to an article that addresses this issue in detail. We appreciate that our statement was rather simplistic – we have modified it and added two additional references.

    1. Metadata is information about some data. So we often think about a dataset as consisting of the main pieces of data (whatever those are in a specific situation), and whatever other information we have about that data (metadata).

      What surprised me is how much information is classified as metadata rather than data. While the tweet text and images feel like the main content, metadata such as time, user identity, and engagement numbers can be even more powerful when analyzing behavior at scale. This raises ethical concerns because users may not realize how much information about them is being collected and interpreted beyond what they intentionally post.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Lahtinen et al. evaluated the association between polygenic scores and mortality. This question has been intensely studied (Sakaue 2020 Nature Medicine, Jukarainen 2022 Nature Medicine, Argentieri 2025 Nature Medicine), where most studies use PRS as an instrument to attribute death to different causes. The presented study focuses on polygenic scores of non-fatal outcomes and separates the cause of death into "external" and "internal". The majority of the results are descriptive, and the data doesn't have the power to distinguish effect sizes of the interesting comparisons: (1) differences between external vs. internal (2) differences between PGI effect and measured phenotype. I have two main comments:

      (1) The authors should clarify whether the p-value reported in the text will remain significant after multiple testing adjustment. Some of the large effects might be significant; for example, Figure 2C

      We have now added Benjamini-Hochberg multiple-testing adjusted p-values in the text each time we present nominal p-values. Additionally, supplementary tables S5 and S6 provide multiple-adjusted p-values for all analysed PGIs.

      Although this was not always the case, many comparisons remained significant after multiple testing adjustments, especially in Figure 2C that the reviewer commented on. In the revised version, we have placed more emphasis on describing these HRs that have low p-values after multiple-test adjustment. The revised text for Figure 2C in the Results section now reads:

      Panel C analyses mortality in three age-specific follow-up periods. The PGIs were more predictive of death in younger age groups, although the difference between the 25–64 and 65–79 age groups was small, except for the PGI of ADHD (HR=1.14, 95% CI 1.08; 1.21 for 25–64-year-olds; HR=1.04, 95% CI 1.00; 1.08 for 65–79-year-olds; p=0.008 for difference, p=0.27 after multiple-testing adjustment). PGIs predicted death only negligibly among those aged 80+, and the largest differences between the age groups 25–64 and 80+ were for PGIs of self-rated health (HR 0.87, 95% CI 0.82; 0.93 for 25–64-year-olds, HR 1.00, 95% CI 0.94; 1.04 for 80+ year-olds, p=2*10<sup>-4</sup> for difference, p=0.006 after multiple-testing adjustment), ADHD (HR 1.14, 95% CI 1.08; 1.21 for 25–64-year-olds, HR 0.99, 95% CI 0.95; 1.03 for 80+ year-olds, p=7*10<sup>-4</sup> for difference, p=0.012 after multiple-testing adjustment) and depressive symptoms (HR 1.12, 95% CI 1.06; 1.18 for 25–64-year-olds, HR 1.00, 95% CI 0.96; 1.04 for 80+ year-olds, p=0.002 for difference, p=0.032 after multiple-testing adjustment). Additionally, the difference in HRs between these age groups achieved significance after multiple testing adjustment at the conventional 5% level for PGIs of cigarettes per day, educational attainment, and ever smoking.

      We have also included the recent study by Argentieri et al. (2025) in the literature review, which was missing from our previous version. We appreciate the reference. Other references mentioned were already included in the previous version of the manuscript.

      (note that the small prediction accuracy of PGI in older age groups has been extensively studied, see Jiang, Holmes, and McVean, 2021, PLoS Genetics).

      We would like to thank the reviewer for suggesting the relevant reference by Jiang et al. We have now expanded on the discussion of age-specific differences in the discussion section and included this reference.

      (2) The authors might check if PGI+Phenotype has improved performance over Phenotype only. This is similar to Model 2 in Table 1, but slightly different.

      The reviewer raises an interesting angle to approach the analysis. We have now added an analysis assessing the information criteria and the significance of improvement between nested models in Supplementary table S8. All the tested PGI+phenotype models show improvement over the phenotype-only model that is statistically significant at all conventional levels when tested by likelihood-ratio tests between nested models . Additionally,  improvement was found when using Akaike and Bayesian (Schwarz) information criteria (albeit sometimes modest in size). We have added a passage in the results section briefly summarising this analysis:

      Supplementary table S8 presents information criteria and significance tests on corresponding models. Models with PGI+phenotype (Models 2a–f) showed improvement over models with the phenotype only (Models 1a, 1c, 1e, 1g, 1i, 1k, with a p=0.0006 or lower) in terms of both Akaike information criterion (AIC) as well as Bayesian (Schwarz) information criterion (BIC) with a p=0.0006 or lower in all comparisons. The full Model 4 again showed improvement over the model with all PGIs jointly (Model 3b, with a p=0.0002 or p=0.00002, depending on continuous/categorical phenotype measurement), which had a lower AIC but not BIC.

      Reviewer #2 (Public review): 

      Summary:

      This study provides a comprehensive evaluation of the association between polygenic indices (PGIs) for 35 lifestyle and behavioral traits and all-cause mortality, using data from Finnish population- and family-based cohorts. The analysis was stratified by sex, cause of death (natural vs. external), age at death, and participants' educational attainment. Additional analyses focused on the six most predictive PGIs, examining their independent associations after mutual adjustment and adjustment for corresponding directly measured baseline risk factors.

      Strengths:

      Large sample size with long-term follow-up.

      Use of both population- and family-based analytical approaches to evaluate associations.

      Weaknesses:

      It is unclear whether the PGIs used for each trait represent the most current or optimal versions based on the latest GWAS data.

      To our reading, this comment is closely related to the “recommendations for the author” number 3 by reviewer 2, and we thus address them together. 

      If the Finnish data used in this study also contributed to the development of some of the PGIs, there is a risk of overestimating their associations with mortality due to overfitting or "double-dipping." Similar inflation of effect sizes has been observed in studies using the UK Biobank, which is widely used for PGI construction.

      To our reading, this comment is closely related to the “recommendations for the author” 4 by reviewer 2, and we thus address them together.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Specific comments:

      (1) Cited reference 1 also investigated the PRS association with life span; cited reference 8 explains PRS association with healthy lifespan. Can authors be clearer about what is new in the context of these references? Specifically, what are the PGIs studied here that were not analyzed in the cited analyses?

      Although some previous studies on the topic do exist, our analysis arguably has novelty in touching upon several unstudied or scarcely studied themes. These include:

      A set of PGIs focusing on social, psychological, and behavioural phenotypes or PGIs for typically non-fatal health conditions.

      An assessment of direct genetic effects/ confounding with a within-sibship design.

      An assessment of potential heterogeneous effects by several socio-demographic characteristics.

      An analysis of external causes of deaths (which can be hypothesised to be particularly relevant here, given the choice of our PGIs not focusing directly on typical causes of death).

      A detailed assessment of the interplay of the most predictive PGIs with their corresponding phenotypes.

      We have substantially revised the Introduction section focusing on making these novel contributions more explicit.

      (2) In the Methods section, it is not very clear why the authors specifically study the "within-sibship" samples. Is this for avoiding nurturing effects from parental genotypes or for controlling assortative mating? The authors should clarify the rationale behind the design.

      The substance-related rationale behind this approach was briefly discussed in the Introduction section while in the Methods section, we focused more on the technical description of our analyses. However, it is certainly worthwhile to clarify to the reader why within-sibship methods have been used. The revised passage in the methods section now states:

      “In addition to this population sample, we used a within-sibship analysis sample to assess the extent of direct and indirect genetic associations captured by the PGIs, as discussed in the introduction.”

      (3) Residual correlations of PGIs were no more than 0.050..." As a minor comment, since PGIs is a noisy variable, the correlation would be low; however, I don't think there are better ways to evaluate Cox assumptions, and in many cases, this assumption is not correct for strong predictors.

      Yes, these points are true. Overall, it is often implausible that empirical distributions exactly match distributional assumptions in statistical models. For example, it may not be realistic to expect that the mortality hazards across categories of independent variables stay exactly proportional during long mortality-follow-ups; some deviations from constant proportions are almost inevitable. However, there are reasonable grounds to argue that in case of moderate violations of the proportional hazards assumption, the estimates still remain interpretable for practical uses. They can be read as approximating average relative hazards over the study period (for discussion, see pages 42–47 in Allison P. 2014. Event history and survival analysis: Regression for longitudinal event data (second edition). Thousand Oaks: SAGE).

      (4) "PGI of ADHD (HR=1.08 95%CI 1.04;1.11 among men; HR=1.01 95%CI 0.97;1.05 among women; p=0.012 for difference)." Is this difference significant after multiple testing correction?

      We have presented multiple-testing adjusted p-values together with nominal ones in this and in all other instances where they are mentioned in the text. Additionally, Supplementary tables S5–S6 present multiple-adjusted p-values for each PGIs studied.

      (5) "Panel D displays that most PGIs had stronger associations with external (accidents, violent, suicide, and alcohol related deaths) than natural causes of death." Similar to the comment above, are there any results that are significantly different between internal and external?

      We have added the p-values of those variables that had larger differences in the revised text. Quoting from the revised article: “The HR differences between external and natural causes of death were nominally significant at the conventional 5% level for cannabis use (p=0.016), drinks per week (p=0.028), left out of social activity (p=0.029), ADHD (p=0.031), BMI (p=0.035) and height (p=0.049), but none of these differences remained significant after adjusting for 35 multiple tests. “

      (6) Table 1: The effect of the phenotype is stronger than the PGI; this is expected as PGI is a weak predictor and can be considered as "noised" measurement of true genetic value (Becker 2021 Nature Human behavior). Is there a way to adjust for the impact of noise in PGI at tagging genetic value and compare if the PGI effect is different from the phenotype effect?

      PGIs are certainly imperfect measures that contain a lot of noise. However, extracting new information from what is unknown is an extremely demanding exercise, and still further complicated for example, by that we do not know the exact benchmark of total genetic effect we should be aiming at. Different methods of heritability estimation, for instance, often give dramatically differing results – for reasons that are still up to scrutiny.

      We are thus not familiar with a method that could achieve satisfactory answer for this challenging task.

      Reviewer #2 (Recommendations for the authors):

      (3) Justification and Selection of PGIs:

      For several traits, such as BMI, multiple polygenic indices (PGIs) are currently available. The criteria used to select specific PGIs for this study are not clearly described. A more systematic and reproducible approach-for example, leveraging metadata from the PGS Catalog-could strengthen the justification for PGI selection and enhance the study's generalizability.

      There are numerous PGIs developed in the extensive GWAS literature, but a finite set of PGIs always needs to be chosen for any analysis. The rationale behind our decision to include every PGI from the repository of Becker et al. 2021 (full reference in the manuscript, see also https://www.thessgac.org/pgi-repository) that was available for the Finnish data (including the possibility to exclude overlapping samples, see our response to the next comment for more discussion) was to provide rigorous analysis by limiting the researchers degrees of freedom in arbitrarily choosing PGIs. Although it would have been tempting to not use some PGIs that were not expected to substantially correlate with mortality, we believe that our conservative strategy increases the credibility of the reported p-values, particularly the multiple adjustment should now work as intended. 

      We also mention now this rationale when discussing the chosen PGIs in the methods section: “As the independent variables of main interest, we used 35 different PGIs in the Polygenic Index repository by Becker et al., which were mainly based on GWASes using UK Biobank and 23andMe, Inc. data samples, but also other data collections. They were tailored for the Finnish data, i.e., excluding overlapping individuals between the original GWAS and our analysis and performing linkage-disequilibrium adjustment. We used every single-trait PGI defined in the repository (except for subjective well-being, for which we were unable to obtain a meta-analysis version that excluded the overlapping samples). By limiting the researchers’ freedom in selecting the measures, this conservative strategy should increase the validity of our estimates, particularly with regards to multiple-testing adjusted p-values.”

      (4) Overlap Between PGI Training Data and Study Sample:

      The authors should describe any overlap between the data used to develop the PGIs and the current study sample. If such overlap exists, it may lead to overestimation of effect sizes due to "double-dipping." A discussion of this issue and its potential implications is warranted, as similar concerns have been raised in studies using UK Biobank data.

      This is, fortunately, not a concern of our analysis. Overlapping samples were excluded in creating the PGIs that we used. We have now described this more clearly in the revised methods section.

      (1) Clarify the Methodology for Family-Based Cox Analysis:

      It is unclear what specific method was used to perform Cox regression in the family-based analysis. Please provide additional methodological details. ”

      We have described the method further and added an additional reference in the revision. The text now stands:

      “We compared these models to the corresponding within-sibship models, using the sibship identifier as the strata variable. This method employs a sibship-specific (instead of a whole-sample-wide baseline hazard in the population models) baseline hazard, and corresponds to a fixed-effects model in some other regression frameworks (e.g., linear model with sibship-specific intercepts)”

      (2) Clarify Timing of Measured Risk Factors Relative to Follow-Up:

      The main text should provide more detailed information regarding the timing of data collection for directly measured risk factors. Specifically, it should be clarified whether the measurements used correspond to the first available data for each individual after the start of follow-up, or if a different criterion was applied.

      BMI, self-rated health, alcohol consumption and smoking status were measured at the baseline survey of each dataset. Education was registered as the highest completed degree up to the end of 2019. Depression was a composite of survey self-report (at the time of the baseline survey), as well as depression-related medicine purchases and hospitalizations over a two-year period before the start of the individual’s follow-up.

      We have added more comprehensive information on the measurement of the phenotypes of interest in Supplementary table 2, including the timing of the measurement.

    1. Author response:

      Point-by-point description of the revisions:

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary

      In this article, the authors used the synthetic TALE DNA binding proteins, tagged with YFP, which were designed to target five specific repeat elements in Trypanosoma brucei genome, including centromere and telomeres-associated repeats and those of a transposon element. This is in order to detect and identified, using YFP-pulldown, specific proteins that bind to these repetitive sequences in T. brucei chromatin. Validation of the approach was done using a TALE protein designed to target the telomere repeat (TelR-TALE) that detected many of the proteins that were previously implicated with telomeric functions. A TALE protein designed to target the 70 bp repeats that reside adjacent to the VSG genes (70R-TALE) detected proteins that function in DNA repair and the protein designed to target the 177 bp repeat arrays (177R-TALE) identified kinetochore proteins associated T. brucei mega base chromosomes, as well as in intermediate and mini-chromosomes, which imply that kinetochore assembly and segregation mechanisms are similar in all T. brucei chromosome.

      Major comments:

      Are the key conclusions convincing?

      The authors reported that they have successfully used TALE-based affinity selection of proteinassociated with repetitive sequences in the T. brucei genome. They claimed that this study has provided new information regarding the relevance of the repetitive region in the genome to chromosome integrity, telomere biology, chromosomal segregation and immune evasion strategies. These conclusions are based on high-quality research, and it is, basically, merits publication, provided that some major concerns, raised below, will be addressed before acceptance for publication.

      (1) The authors used TALE-YFP approach to examine the proteome associated with five different repetitive regions of the T. brucei genome and confirmed the binding of TALE-YFP with Chip-seq analyses. Ultimately, they got the list of proteins that bound to synthetic proteins, by affinity purification and LS-MS analysis and concluded that these proteins bind to different repetitive regions of the genome. There are two control proteins, one is TRF-YFP and the other KKT2-YFP, used to confirm the interactions. However, there are no experiment that confirms that the analysis gives some insight into the role of any putative or new protein in telomere biology, VSG gene regulation or chromosomal segregation. The proteins, which have already been reported by other studies, are mentioned. Although the author discovered many proteins in these repetitive regions, their role is yet unknown. It is recommended to take one or more of the new putative proteins from the repetitive elements and show whether or not they (1) bind directly to the specific repetitive sequence (e.g., by EMSA); (2) it is recommended that the authors will knockdown of one or a small sample of the new discovered proteins, which may shed light on their function at the repetitive region, as a proof of concept.

      The main request from Referee 1 is for individual evaluation of protein-DNA interaction for a few candidates identified in our TALE-YFP affinity purifications, particularly using EMSA to identify binding to the DNA repeats used for the TALE selection. In our opinion, such an approach would not actually provide the validation anticipated by the reviewer. The power of TALE-YFP affinity selection is that it enriches for protein complexes that associate with the chromatin that coats the target DNA repetitive elements rather than only identifying individual proteins or components of a complex that directly bind to DNA assembled in chromatin.

      The referee suggests we express recombinant proteins and perform EMSA for selected candidates, but many of the identified proteins are unlikely to directly bind to DNA – they are more likely to associate with a combination of features present in DNA and/or chromatin (e.g. specific histone variants or histone post-translational modifications). Of course, a positive result would provide some validation but only IF the tested protein can bind DNA in isolation – thus, a negative result would be uninformative.

      In fact, our finding that KKT proteins are enriched using the 177R-TALE (minichromosome repeat sequence) identifies components of the trypanosome kinetochore known (KKT2) or predicted (KKT3) to directly bind DNA (Marciano et al., 2021; PMID: 34081090), and likewise the TelR-TALE identifies the TRF component that is known to directly associate with telomeric (TTAGGG)n repeats (Reis et al 2018; PMID: 29385523). This provides reassurance on the specificity of the selection, as does the lack of cross selectivity between different TALEs used (see later point 3 below). The enrichment of the respective DNA repeats quantitated in Figure 2B (originally Figure S1) also provides strong evidence for TALE selectivity.

      It is very likely that most of the components enriched on the repetitive elements targeted by our TALE-YFP proteins do not bind repetitive DNA directly. The TRF telomere binding protein is an exception – but it is the only obvious DNA binding protein amongst the many proteins identified as being enriched in our TelR-TALE-YFP and TRF-YFP affinity selections.

      The referee also suggests that follow up experiments using knockdown of the identified proteins found to be enriched on repetitive DNA elements would be informative. In our opinion, this manuscript presents the development of a new methodology previously not applied to trypanosomes, and referee 2 highlights the value of this methodological development which will be relevant for a large community of kinetoplastid researchers. In-depth follow-up analyses would be beyond the scope of this current study but of course will be pursued in future. To be meaningful such knockdown analyses would need to be comprehensive in terms of their phenotypic characterisation (e.g. quantitative effects on chromosome biology and cell cycle progression, rates and mechanism of recombination underlying antigenic variation, etc) – simple RNAi knockdowns would provide information on fitness but little more. This information is already publicly available from genome-wide RNAi screens (www.tritrypDB.org), with further information on protein location available from the genome-wide protein localisation resource (Tryptag.org). Hence basic information is available on all targets selected by the TALEs after RNAi knock down but in-depth follow-up functional analysis of several proteins would require specific targeted assays beyond the scope of this study.

      (2) NonR-TALE-YFP does not have a binding site in the genome, but YFP protein should still be expressed by T. brucei clones with NLS. The authors have to explain why there is no signal detected in the nucleus, while a prominent signal was detected near kDNA (see Fig.2). Why is the expression of YFP in NonR-TALE almost not shown compared to other TALE clones?

      The NonR-TALE-YFP immunolocalisation signal indeed is apparently located close to the kDNA and away from the nucleus. We are not sure why this is so, but the construct is sequence validated and correct. However, we note that artefactual localisation of proteins fused to a globular eGFP tag, compared to a short linear epitope V5 tag, near to the kinetoplast has been previously reported (Pyrih et al, 2023; PMID: 37669165).

      The expression of NonR-TALE-YFP is shown in Supplementary Fig. S2 in comparison to other TALE proteins. Although it is evident that NonR-TALE-YFP is expressed at lower levels than other TALEs (the different TALEs have different expression levels), it is likely that in each case the TALE proteins would be in relative excess.

      It is possible that the absence of a target sequence for the NonR-TALE-YFP in the nucleus affects its stability and cellular location. Understanding these differences is tangential to the aim of this study.

      However, importantly, NonR-TALE-YFP is not the only control for used for specificity in our affinity purifications. Instead, the lack of cross-selection of the same proteins by different TALEs (e.g. TelR-TALE-YFP, 177R-TALE-YFP) and the lack of enrichment of any proteins of interest by the well expressed ingiR-TALE-YFP or 147R-TALE-YFP proteins each provide strong evidence for the specificity of the selection using TALEs, as does the enrichment of similar protein sets following affinity purification of the TelR-TALE-YFP and TRF-YFP proteins which both bind telomeric (TTAGGG)n repeats. Moreover, control affinity purifications to assess background were performed using cells that completely lack an expressed YFP protein which further support specificity (Figure 6).

      We have added text to highlight these important points in the revised manuscript:

      Page 8:

      “However, the expression level of NonR-TALE-YFP was lower than other TALE-YFP proteins; this may relate to the lack of DNA binding sites for NonR-TALE-YFP in the nucleus.”

      Page 8:

      “NonR-TALE-YFP displayed a diffuse nuclear and cytoplasmic signal; unexpectedly the cytoplasmic signal appeared to be in the vicinity the kDNA of the kinetoplast (mitochrondria). We note that artefactual localisation of some proteins fused to an eGFP tag has previously been observed in T. brucei (Pyrih et al, 2023).”

      Page 10:

      Moreover, a similar set of enriched proteins was identified in TelR-TALE-YFP affinity purifications whether compared with cells expressing no YFP fusion protein (No-YFP), the NonR-TALE-YFP or the ingiR-TALE-YFP as controls (Fig. S7B, S8A; Tables S3, S4). Thus, the most enriched proteins are specific to TelR-TALE-YFP-associated chromatin rather than to the TALE-YFP synthetic protein module or other chromatin.

      (3) As a proof of concept, the author showed that the TALE method determined the same interacting partners enrichment in TelR-TALE as compared to TRF-YFP. And they show the same interacting partners for other TALE proteins, whether compared with WT cells or with the NonR-TALE parasites. It may be because NonR-TALE parasites have almost no (or very little) YFP expression (see Fig. S3) as compared to other TALE clones and the TRF-YFP clone. To address this concern, there should be a control included, with proper YFP expression.

      See response to point 2, but we reiterate that the ingi-TALE -YFP and 147R-TALE-YFP proteins are well expressed (western original Fig. S3 now Fig. S2) but few proteins are detected as being enriched or correspond to those enriched in TelR-TALE-YFP or TRF-YFP affinity purifications (see Fig. S9). Therefore, the ingi-TALE -YFP and 147R-TALE-YFP proteins provide good additional negative controls for specificity as requested. To further reassure the referee we have also included additional volcano plots which compare TelR-TALE-YFP, 70R-TALE-YFP or 177R-TALE-YFP to the ingiR-TALE-YFP affinity selection (new Figure S8). As with No-YFP or NonR-TALE-YFP controls, the use of ingiR-TALE-YFP as a negative control demonstrates that known telomere associated proteins are enriched in TelR-TALE-YFP affinity purification, RPA subunits enriched with 70R-TALE-YFP and Kinetochore KKT poroteins enriched with 177RTALE-YFP. These analyses demonstrate specificity in the proteins enriched following affinity purification of our different TALE-YFPs and provide support to strengthen our original findings.

      We now refer to use of No-YFP, NonR-TALE-YFP, and ingiR-TALE -YFP as controls for comparison to TelR-TALE-YFP, 70R-TALE-YFP or 177R-TALE-YFP in several places:

      Page10:

      “Moreover, a similar set of enriched proteins was identified in TelR-TALE-YFP affinity purifications whether compared with cells expressing no YFP fusion protein (No-YFP), the NonR-TALE-YFP or the ingiR-TALE-YFP as controls (Fig. S7B, S8A; Tables S3, S4).”

      Page 11:

      “Thus, the nuclear ingiR-TALE-YFP provides an additional chromatin-associated negative control for affinity purifications with the TelR-TALE-YFP, 70R-TALE-YFP and 177R-TALE-YFP proteins (Fig. S8).”

      “Proteins identified as being enriched with 70R-TALE-YFP (Figure 6D) were similar in comparisons with either the No-YFP, NonR-TALE-YFP or ingiR-TALE-YFP as negative controls.”

      Top Page 12:

      “The same kinetochore proteins were enriched regardless of whether the 177R-TALE proteomics data was compared with No-YFP, NonR-TALE or ingiR-TALE-YFP controls.”

      Discussion Page 13:

      “Regardless, the 147R-TALE and ingiR-TALE proteins were well expressed in T. brucei cells, but their affinity selection did not significantly enrich for any relevant proteins. Thus, 147R-TALE and ingiR-TALE provide reassurance for the overall specificity for proteins enriched TelR-TALE, 70R-TALE and 177R-TALE affinity purifications.”

      (4) After the artificial expression of repetitive sequence binding five-TALE proteins, the question is if there is any competition for the TALE proteins with the corresponding endogenous proteins? Is there any effect on parasite survival or health, compared to the control after the expression of these five TALEs YFP protein? It is recommended to add parasite growth curves, for all the TALE proteins expressing cultures.

      Growth curves for cells expressing TelR-TALE-YFP, 177R-TALE-YFP and ingiR-TALE-YFP are now included (New Fig S3A). No deficit in growth was evident while passaging 70R-TALE-YFP, 147R-TALE-YFP, NonR-TALE-YFP cell lines (indeed they grew slightly better than controls).

      The following text has been added page 8:

      “Cell lines expressing representative TALE-YFP proteins displayed no fitness deficit (Fig. S3A).”

      (5) Since the experiments were performed using whole-cell extracts without prior nuclear fractionation, the authors should consider the possibility that some identified proteins may have originated from compartments other than the nucleus. Specifically, the detection of certain binding proteins might reflect sequence homology (or partial homology) between mitochondrial DNA (maxicircles and minicircles) and repetitive regions in the nuclear genome. Additionally, the lack of subcellular separation raises the concern that cytoplasmic proteins could have been co-purified due to whole cell lysis, making it challenging to discern whether the observed proteome truly represents the nuclear interactome.

      In our experimental design, we confirmed bioinformatically that the repeat sequences targeted were not represented elsewhere in the nuclear or mitochondrial genome (kDNA). The absence of subcellular fractionation could result in some cytoplasmic protein selection, but this is unlikely since each TALE targets a specific DNA sequence but is otherwise identical such that cross-selection of the same contaminating protein set would be anticipated if there was significant non-specific binding. We have previously successfully affinity selected 15 chromatin modifiers and identified associated proteins without major issues concerning cytoplasmic protein contamination (Staneva et al 2021 and 2022; PMID: 34407985 and 36169304). Of course, the possibility that some proteins are contaminants will need to be borne in mind in any future follow-up analysis of proteins of interest that we identified as being enriched on specific types of repetitive element in T. brucei. Proteins that are also detected in negative control, or negative affinity selections such as No-YFP, NoR-YFP, IngiR-TALE or 147R-TALE must be disregarded.

      (6) Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      As mentioned earlier, the author claimed that this study has provided new information concerning telomere biology, chromosomal segregation mechanisms, and immune evasion strategies. But there are no experiments that provides a role for any unknown or known protein in these processes. Thus, it is suggested to select one or two proteins of choice from the list and validate their direct binding to repetitive region(s), and their role in that region of interaction.

      As highlighted in response to point 1 the suggested validation and follow up experiments may well not be informative and are beyond the scope of the methodological development presented in this manuscript. Referee 2 describes the study in its current form as “a significant conceptual and technical advancement” and “This approach enhances our understanding of chromatin organization in these regions and provides a foundation for investigating the functional roles of associated proteins in parasite biology.”

      The Referee’s phrase ‘validate their direct binding to repetitive region(s)’ here may also mean to test if any of the additional proteins that we identified as being enriched with a specific TALE protein actually display enrichment over the repeat regions when examined by an orthogonal method. A key unexpected finding was that kinetochore proteins including KKT2 are enriched in our affinity purifications of the 177R-TALE-YFP that targets 177bp repeats (Figure 6F). By conducting ChIP-seq for the kinetochore specific protein KKT2 using YFP-KKT2 we confirmed that KKT2 is indeed enriched on 177bp repeat DNA but not flanking DNA (Figure 7). Moreover, several known telomere-associated proteins are detected in our affinity selections of TelRTALE-YFP (Figure 6B, FigS6; see also Reis et al, 2018 Nuc. Acids Res. PMID: 29385523; Weisert et al, 2024 Sci. Reports PMID: 39681615).

      Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      The answer for this question depends on what the authors want to present as the achievements of the present study. If the achievement of the paper was is the creation of a new tool for discovering new proteins, associated with the repeat regions, I recommend that they add a proof for direct interactions between a sample the newly discovered proteins and the relevant repeats, as a proof of concept discussed above, However, if the authors like to claim that the study achieved new functional insights for these interactions they will have to expand the study, as mentioned above, to support the proof of concept.

      See our response to point 1 and the point we labelled ‘6’ above.

      Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      I think that they are realistic. If the authors decided to check the capacity of a small sample of proteins (which was unknown before as a repetitive region binding proteins) to interacts directly with the repeated sequence, it will substantially add of the study (e.g., by EMSA; estimated time: 1 months). If the authors will decide to check the also the function of one of at least one such a newly detected proteins (e.g., by KD), I estimate the will take 3-6 months.

      As highlighted previously the proposed EMSA experiment may well be uninformative for protein complex components identified in our study or for isolated proteins that directly bind DNA in the context of a complex and chromatin. RNAi knockdown data and cell location data (as well as developmental expression and orthology data) is already available through tritrypDB.org and trtyptag.org

      Are the data and the methods presented in such a way that they can be reproduced? Yes

      Are the experiments adequately replicated, and statistical analysis adequate?

      The authors did not mention replicates. There is no statistical analysis mentioned.

      The figure legends indicate that all volcano plots of TALE affinity selections were derived from three biological replicates. Cutoffs used for significance: P < 0.05 (Student's t-test).

      For ChiP-seq two biological replicates were analysed for each cell line expressing the specific YFP tagged protein of interest (TALE or KKT2). This is now stated in the relevant figure legends – apologies for this oversight. The resulting data are available for scrutiny at GEO: GSE295698.

      Minor comments:

      Specific experimental issues that are easily addressable.

      The following suggestions can be incorporated:

      (1) Page 18, in the material method section author mentioned four drugs: Blasticidine, Phleomycin and G418, and hygromycin. It is recommended to mention the purpose of using these selective drugs for the parasite. If clonal selection has been done, then it should also be mentioned.

      We erroneously added information on several drugs used for selection in our labaoratory. In fact all TALE-YFP construct carry the Bleomycin resistance genes which we select for using Phleomycin. Also, clones were derived by limiting dilution immediately after transfection. We have amended the text accordingly:

      Page 17/18:

      “Cell cultures were maintained below 3 x 106 cells/ml. Pleomycin 2.5 µg/ml was used to select transformants containing the TALE construct BleoR gene.”

      “Electroporated bloodstream cells were added to 30 ml HMI-9 medium and two 10-fold serial dilutions were performed in order to isolate clonal Pleomycin resistant populations from the transfection. 1 ml of transfected cells were plated per well on 24-well plates (1 plate per serial dilution) and incubated at 37°C and 5% CO2 for a minimum of 6 h before adding 1 ml media containing 2X concentration Pleomycin (5 µg/ml) per well.”

      (2) In the method section the authors mentioned that there is only one site for binding of NonR-TALE in the parasite genome. But in Fig. 1C, the authors showed zero binding site. So, there is one binding site for NonR-TALE-YFP in the genome or zero?

      We thank the reviewer for pointing out this discrepancy. We have checked the latest Tb427v12 genome assembly for predicted NonR-TALE binding sites and there are no exact matches. We have corrected the text accordingly.

      Page 7:

      “A control NonR-TALE protein was also designed which was predicted to have no target sequence in the T. brucei genome.”

      Page 17:

      “A control NonR-TALE predicted to have no recognised target in the T. brucei geneome was designed as follows: BLAST searches were used to identify exact matches in the TREU927 reference genome. Candidate sequences with one or more match were discarded.”

      (3) The authors used two different anti-GFP antibodies, one from Roche and the other from Thermo Fisher. Why were two different antibodies used for the same protein?

      We have found that only some anti-GFP antibodies are effective for affinity selection of associated proteins, whereas others are better suited for immunolocalisation. The respective suppliers’ antibodies were optimised for each application.

      (4) Page 6: in the introduction, the authors give the number of total VSG genes as 2,634. Is it known how many of them are pseudogenes?

      This value corresponds to the number reported by Consentino et al. 2021 (PMID: 34541528) for subtelomeric VSGs, which is similar to the value reported by Muller et al 2018 (PMID: 30333624) (2486), both in the same strain of trypanosomes as used by us. Based on the earlier analysis by Cross et al (PMID: 24992042), 80% of the identified VSGs in their study (2584) are pseudogenes. This approximates to the estimation by Consentino of 346/2634 (13%) being fully functional VSG genes at subtelomeres, or 17% when considering VSGs at all genomic locations (433/2872).

      (5) I found several typos throughout the manuscript.

      Thank you for raising this, we have read through the manuscipt several times and hopefully corrected all outstanding typos.

      (6) Fig. 1C: Table: below TOTAL 2nd line: the number should be 1838 (rather than 1828)

      Corrected- thank you.

      - Are prior studies referenced appropriately? Yes

      - Are the text and figures clear and accurate? Yes

      - Do you have suggestions that would help the authors improve the presentation of their data and conclusions? Suggested above

      Reviewer #1 (Significance):

      Describe the nature and significance of the advance (e.g., conceptual, technical, clinical) for the field:

      This study represents a significant conceptual and technical advancement by employing a synthetic TALE DNA-binding protein tagged with YFP to selectively identify proteins associated with five distinct repetitive regions of T. brucei chromatin. To the best of my knowledge, it is the first report to utilize TALE-YFP for affinity-based isolation of protein complexes bound to repetitive genomic sequences in T. brucei. This approach enhances our understanding of chromatin organization in these regions and provides a foundation for investigating the functional roles of associated proteins in parasite biology. Importantly, any essential or unique interacting partners identified could serve as potential targets for therapeutic intervention.

      - Place the work in the context of the existing literature (provide references, where appropriate). I agree with the information that has already described in the submitted manuscript, regarding its potential addition of the data resulted and the technology established to the study of VSGs expression, kinetochore mechanism and telomere biology.

      - State what audience might be interested in and influenced by the reported findings. These findings will be of particular interest to researchers studying the molecular biology of kinetoplastid parasites and other unicellular organisms, as well as scientists investigating chromatin structure and the functional roles of repetitive genomic elements in higher eukaryotes.

      - (1) Define your field of expertise with a few keywords to help the authors contextualize your point of view. Protein-DNA interactions/ chromatin/ DNA replication/ Trypanosomes

      - (2) Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate. None

      Reviewer #2 (Evidence, reproducibility and clarity):

      Summary

      Carloni et al. comprehensively analyze which proteins bind repetitive genomic elements in Trypanosoma brucei. For this, they perform mass spectrometry on custom-designed, tagged programmable DNA-binding proteins. After extensively verifying their programmable DNA-binding proteins (using bioinformatic analysis to infer target sites, microscopy to measure localization, ChIP-seq to identify binding sites), they present, among others, two major findings: 1) 14 of the 25 known T. brucei kinetochore proteins are enriched at 177bp repeats. As T. brucei's 177bp repeatcontaining intermediate-sized and mini-chromosomes lack centromere repeats but are stable over mitosis, Carloni et al. use their data to hypothesize that a 'rudimentary' kinetochore assembles at the 177bp repeats of these chromosomes to segregate them. 2) 70bp repeats are enriched with the Replication Protein A complex, which, notably, is required for homologous recombination. Homologous recombination is the pathway used for recombination-based antigenic variation of the 70bp-repeat-adjacent variant surface glycoproteins.

      Major Comments

      None. The experiments are well-controlled, claims well-supported, and methods clearly described. Conclusions are convincing.

      Thank you for these positive comments.

      Minor Comments

      (1) Fig. 2 - I couldn't find an uncropped version showing multiple cells. If it exists, it should be linked in the legend or main text; Otherwise, this should be added to the supplement.

      The images presented represent reproducible analyses, and independently verified by two of the authors. Although wider field of view images do not provide the resolution to be informative on cell location, as requested we have provided uncropped images in new Fig. S4 for all the cell lines shown in Figure 2A.

      In addition, we have included as supplementary images (Fig. S3B) additional images of TelRTALE-YFP, 177R-TALE-YFP and ingiR-TALE YFP localisation to provide additional support their observed locations presented in Figure 1. The set of cells and images presented in Figure 2A and in Fig S3B were prepared and obtained by a different authors, independently and reproducibly validating the location of the tagged protein.

      (2) I think Suppl. Fig. 1 is very valuable, as it is a quantification and summary of the ChIP-seq data. I think the authors could consider making this a panel of a main figure. For the main figure, I think the plot could be trimmed down to only show the background and the relevant repeat for each TALE protein, leaving out the non-target repeats. (This relates to minor comment 6.) Also, I believe, it was not explained how background enrichment was calculated.

      We are grateful for the reviewer’s positive view of original Fig. S1 and appreciate the suggestion. We have now moved these analysis to part B of main Figure 2 in the revised manuscript – now Figure 2B. We have also provided additional details in the Methods section on the approaches used to assess background enrichment.

      Page 19:

      “Background enrichment calculation

      The genome was divided into 50 bp sliding windows, and each window was annotated based on overlapping genomic features, including CIR147, 177 bp repeats, 70 bp repeats, and telomeric (TTAGGG)n repeats. Windows that did not overlap with any of these annotated repeat elements were defined as "background" regions and used to establish the baseline ChIP-seq signal. Enrichment for each window was calculated using bamCompare, as log₂(IP/Input). To adjust for background signal amongst all samples, enrichment values for each sample were further normalized against the corresponding No-YFP ChIP-seq dataset.”

      Note: While revising the manuscript we also noticed that the script had a nomalization error. We have therefore included a corrected version of these analyses as Figure 2B (old Fig. S1)

      (3) Generally, I would plot enrichment on a log2 axis. This concerns several figures with ChIP-seq data.

      Our ChIP-seq enrichment is calculated by bamCompare. The resulting enrichment values are indeed log2 (IP/Input). We have made this clear in the updated figures/legends.

      (4) Fig. 4C - The violin plots are very hard to interpret, as the plots are very narrow compared to the line thickness, making it hard to judge the actual volume. For example, in Centromere 5, YFP-KKT2 is less enriched than 147R-TALE over most of the centromere with some peaks of much higher enrichment (as visible in panel B), however, in panel C, it is very hard to see this same information. I'm sure there is some way to present this better, either using a different type of plot or by improving the spacing of the existing plot.

      We thank the reviewer for this suggestion; we have elected to provide a Split-Violin plot instead. This improves the presentation of the data for each centromere. The original violin plot in Figure 4C has been replaced with this Split-Violin plot (still Figure 4C).

      (5) Fig. 6 - The panels are missing an x-axis label (although it is obvious from the plot what is displayed).

      Maybe the "WT NO-YFP vs" part that is repeated in all the plot titles could be removed from the title and only be part of the x-axis label?

      In fact, to save space the X axis was labelled inside each volcano plot but we neglected to indicate that values are a log2 scale indicating enrichment. This has been rectified – see Figure 6, and Fig. S7, S8 and S9.

      (6) Fig. 7 - I would like to have a quantification for the examples shown here. In fact, such a quantification already exists in Suppl. Figure 1. I think the relevant plots of that quantification (YFPKKT2 over 177bp-repeats and centromere-repeats) with some control could be included in Fig. 7 as panel C. This opportunity could be used to show enrichment separated out for intermediate-sized, mini-, and megabase-chromosomes. (relates to minor comment 2 & 8)

      The CIR147 sequence is found exclusively on megabase-sized chromosomes, while the 177 bp repeats are located on intermediate- and mini-sized chromosomes. Due to limitations in the current genome assembly, it is not possible to reliably classify all chromosomes into intermediate- or mini- sized categories based on their length. Therefore, original Supplementary Fig. S1 presented the YFP-KKT2 enrichment over CIR147 and 177 bp repeats as a representative comparison between megabase chromosomes and the remaining chromosomes (corrected version now presented as main Figure 2B). Additionally, to allow direct comparison of YFP-KKT2 enrichment on CIR147 and 177 bp repeats we have included a new plot in Figure 7C which shows the relative enrichment of YFP-KKT2 on these two repeat types.

      We have added the following text , page 12:

      “Taking into account the relative to the number of CIR147 and 177 bp repeats in the current T.brucei genome (Cosentino et al., 2021; Rabuffo et al., 2024), comparative analyses demonstrated that YFP-KKT2 is enriched on both CIR147 and 177 bp repeats (Figure 7C).”

      (7) Suppl. Fig. 8 A - I believe there is a mistake here: KKT5 occurs twice in the plot, the one in the overlap region should be KKT1-4 instead, correct?

      Thanks for spotting this. It has been corrected

      (8) The way that the authors mapped ChIP-seq data is potentially problematic when analyzing the same repeat type in different regions of the genome. The authors assigned reads that had multiple equally good mapping positions to one of these mapping positions, randomly.

      This is perfectly fine when analysing repeats by their type, independent of their position on the genome, which is what the authors did for the main conclusions of the work.

      However, several figures show the same type of repeat at different positions in the genome. Here, the authors risk that enrichment in one region of the genome 'spills' over to all other regions with the same sequence. Particularly, where they show YFP-KKT2 enrichment over intermediate- and mini-chromosomes (Fig. 7) due to the spillover, one cannot be sure to have found KKT2 in both regions.

      Instead, the authors could analyze only uniquely mapping reads / read-pairs where at least one mate is uniquely mapping. I realize that with this strict filtering, data will be much more sparse. Hence, I would suggest keeping the original plots and adding one more quantification where the enrichment over the whole region (e.g., all 177bp repeats on intermediate-/mini-chromosomes) is plotted using the unique reads (this could even be supplementary). This also applies to Fig. 4 B & C.

      We thank the reviewer for their thoughtful comments. Repetitive sequences are indeed challenging to analyze accurately, particularly in the context of short read ChIP-seq data. In our study, we aimed to address YFP-KKT2 enrichment not only over CIR147 repeats but also on 177 bp repeats, using both ChIP-seq and proteomics using synthetic TALE proteins targeted to the different repeat types. We appreciate the referees suggestion to consider uniquely mapped reads, however, in the updated genome assembly, the 177 bp repeats are frequently immediately followed by long stretches of 70 bp repeats which can span several kilobases. The size and repetitive nature of these regions exceeds the resolution limits of ChIP-seq. It is therefore difficult to precisely quantify enrichment across all chromosomes.

      Additionally, the repeat sequences are highly similar, and relying solely on uniquely mapped reads would result in the exclusion of most reads originating from these regions, significantly underestimating the relative signals. To address this, we used Bowtie2 with settings that allow multi-mapping, assigning reads randomly among equivalent mapping positions, but ensuring each read is counted only once. This approach is designed to evenly distribute signal across all repetitive regions and preserve a meaningful average.

      Single molecule methods such as DiMeLo (Altemose et al. 2022; PMID: 35396487) will need to be developed for T. brucei to allow more accurate and chromosome specific mapping of kinetochore or telomere protein occupancy at repeat-unique sequence boundaries on individual chromosomes.

      Reviewer #2 (Significance):

      This work is of high significance for chromosome/centromere biology, parasitology, and the study of antigenic variation. For chromosome/centromere biology, the conceptual advancement of different types of kinetochores for different chromosomes is a novelty, as far as I know. It would certainly be interesting to apply this study as a technical blueprint for other organisms with minichromosomes or chromosomes without known centromeric repeats. I can imagine a broad range of labs studying other organisms with comparable chromosomes to take note of and build on this study. For parasitology and the study of antigenic variation, it is crucial to know how intermediate- and mini-chromosomes are stable through cell division, as these chromosomes harbor a large portion of the antigenic repertoire. Moreover, this study also found a novel link between the homologous repair pathway and variant surface glycoproteins, via the 70bp repeats. How and at which stages during the process, 70bp repeats are involved in antigenic variation is an unresolved, and very actively studied, question in the field. Of course, apart from the basic biological research audience, insights into antigenic variation always have the potential for clinical implications, as T. brucei causes sleeping sickness in humans and nagana in cattle. Due to antigenic variation, T. brucei infections can be chronic.

      Thank you for supporting the novelty and broad interest of our manuscript

      My field of expertise / Point of view:

      I'm a computer scientist by training and am now a postdoctoral bioinformatician in a molecular parasitology laboratory. The laboratory is working on antigenic variation in T. brucei. The focus of my work is on analyzing sequencing data (such as ChIP-seq data) and algorithmically improving bioinformatic tools.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      Summary: 

      The authors provide a resource to the systems neuroscience community, by offering their Python-based CLoPy platform for closed-loop feedback training. In addition to using neural feedback, as is common in these experiments, they include a capability to use real-time movement extracted from DeepLabCut as the control signal. The methods and repository are detailed for those who wish to use this resource. Furthermore, they demonstrate the efficacy of their system through a series of mesoscale calcium imaging experiments. These experiments use a large number of cortical regions for the control signal in the neural feedback setup, while the movement feedback experiments are analyzed more extensively.

      Strengths:

      The primary strength of the paper is the availability of their CLoPy platform. Currently, most closed-loop operant conditioning experiments are custom built by each lab and carry a relatively large startup cost to get running. This platform lowers the barrier to entry for closed-loop operant conditioning experiments, in addition to making the experiments more accessible to those with less technical expertise.

      Another strength of the paper is the use of many different cortical regions as control signals for the neurofeedback experiments. Rodent operant conditioning experiments typically record from the motor cortex and maybe one other region. Here, the authors demonstrate that mice can volitionally control many different cortical regions not limited to those previously studied, recording across many regions in the same experiment. This demonstrates the relative flexibility of modulating neural dynamics, including in non-motor regions.

      Finally, adapting the closed-loop platform to use real-time movement as a control signal is a nice addition. Incorporating movement kinematics into operant conditioning experiments has been a challenge due to the increased technical difficulties of extracting real-time kinematic data from video data at a latency where it can be used as a control signal for operant conditioning. In this paper they demonstrate that the mice can learn the task using their forelimb position, at a rate that is quicker than the neurofeedback experiments.

      Weaknesses:

      There are several weaknesses in the paper that diminish the impact of its strengths. First, the value of the CLoPy platform is not clearly articulated to the systems neuroscience community. Similarly, the resource could be better positioned within the context of the broader open-source neuroscience community. For an example of how to better frame this resource in these contexts, I recommend consulting the pyControl paper. Improving this framing will likely increase the accessibility and interest of this paper to a less technical neuroscience audience, for instance by highlighting the types of experimental questions CLoPy can enable.

      We appreciate the editor’s feedback regarding the clarity of the CLoPy platform's value and its positioning within the broader neuroscience community. We agree and understand the importance of effectively communicating the utility of CLoPy to both the systems neuroscience field and the wider open-source neuroscience community.

      To address this, we have revised the introduction and discussion sections of the manuscript to more clearly articulate the unique contributions of the CLoPy platform. Specifically:

      (1) We have emphasized how CLoPy can address experimental questions in systems neuroscience by highlighting its ability to enable real-time closed-loop experiments, such as investigating neural dynamics during behavior or studying adaptive cortical reorganization after injury. These examples are aimed at demonstrating its practical utility to the neuroscience audience.

      (2) We have positioned CLoPy within the broader open-source neuroscience ecosystem, drawing comparisons to similar resources like pyControl. We describe how CLoPy complements existing tools by focusing on real-time optical feedback and integration with genetically encoded indicators, which are becoming increasingly popular in systems neuroscience. We also emphasize its modularity and ease of adoption in experimental settings with limited resources.

      (3) To make the manuscript more accessible to a less technically inclined audience, we have restructured certain sections to focus on the types of experiments CLoPy enables, rather than the technical details of the implementation.

      We have consulted the pyControl paper, as suggested, and have used it as a reference point to improve the framing of our resource. We believe these changes will increase the accessibility and appeal of the paper to a broader neuroscience audience.

      While the dataset contains an impressive amount of animals and cortical regions for the neurofeedback experiment, and an analysis of the movement-feedback experiments, my excitement for these experiments is tempered by the relative incompleteness of the dataset, as well as its description and analysis in the text. For instance, in the neurofeedback experiment, many of these regions only have data from a single mouse, limiting the conclusions that can be drawn. Additionally, there is a lack of reporting of the quantitative results in the text of the document, which is needed to better understand the degree of the results. Finally, the writing of the results section could use some work, as it currently reads more like a methods section.

      Thank you for your thoughtful and constructive feedback on our manuscript. We appreciate the time and effort you took to review our work and provide detailed suggestions for improvement. Below, we address the key points raised in your review:

      (1) Dataset Completeness: We acknowledge that some of the neurofeedback experiments include data from only a single mouse for some cortical regions while for some cortical regions, there are several animals. This was due to practical constraints during the study, and we understand the limitations this poses for drawing broad conclusions. We felt it was still important to include these data sets with smaller sample sizes as they might be useful for others pursuing this direction in the future. To address this, we have revised the text to explicitly acknowledge these limitations and clarify that the results for some regions are exploratory in nature. We believe our flexible tool will provide a means for our lab and others include more animals representing additional cortical regions in future studies. Importantly, we have included all raw and processed data as well as code for future analysis.

      (2) Quantitative Results: We recognize the importance of reporting quantitative results in the text for better clarity and interpretation. In response, we have added more detailed description of the quantitative findings from both the neurofeedback and movement-feedback experiments. This will include effect sizes, statistical measures, and key numerical results to provide a clearer understanding of the degree and significance of the observed effects.

      (3) Results Section Writing: We appreciate your observation that parts of the results section read more like a methods section. To improve clarity and focus, we have restructured the results section to present the findings in a more concise and interpretative manner, while moving overly detailed descriptions of experimental procedures to the methods section.

      Suggestions for improved or additional experiments, data or analyses:

      Not necessary for this paper, but it would be interesting to see if the CLNF group could learn without auditory feedback.

      This is a great suggestion and certainly something that could be done in the future.

      There are no quantitative results in the results section. I would add important results to help the reader better interpret the data. For example, in: "Our results indicated that both training paradigms were able to lead mice to obtain a significantly larger number of rewards over time," You could show a number, with an appropriate comparison or statistical test, to demonstrate that learning was observed.

      Thank you for pointing this out. We have mentioned quantification values in the results now, along with being mentioned in the figure legends, and we are quoting it in following sentences. “A ΔF/F0 threshold value was calculated from a baseline session on day 0 that would have allowed 25% performance. Starting from this basal performance of around 25% on day 1, mice (CLNF No-rule-change, N=23, n=60 and CLNF Rule-change, N=17, n=60) were able to discover the task rule and perform above 80% over ten days of training (Figure 4A, RM ANOVA p=2.83e-5), and Rule-change mice even learned a change in ROIs or rule reversal (Figure 4A, RM ANOVA p=8.3e-10, Table 5 for different rule changes). There were no significant differences between male and female mice (Supplementary Figure 3A).”

      For: "Performing this analysis indicated that the Raspberry Pi system could provide reliable graded feedback within ~63 {plus minus} 15 ms for CLNF experiments." The LED test shows the sending of the signal, but the actual delay for the audio generation might be longer. This is also longer than the 50 ms mentioned in the abstract.

      We appreciate the reviewer’s insightful comment. The latency reported (~63ms) was measured using the LED test, which captures the time from signal detection to output triggering on the Raspberry Pi GPIO. We agree that the total delay for auditory feedback generation could include an additional latency component related to the digital-to-analog conversion and speaker response. In our setup, we employ a fast Audiostream library written in C to generate the audio signal and expect the delay contribution to be negligible compared to the GPIO latency. Though we did not do this, it can be confirmed by an oscilloscope-based pilot measurement (for additional delay calculation). We have updated the manuscript to clarify that the 63 ± 15 ms value reflects the GPIO-triggered output latency, and we have revised the abstract to accurately state the delay as “~63 ms” rather than 50 ms. This ensures consistency and avoids underestimation of the latency. We have corrected the LED latency for CLNF and CLMF experiments in the abstract as well.

      It could be helpful to visualize an individual trial for each experiment type, for instance how the audio frequency changes as movement speed / calcium activity changes.

      We have added Supplementary Figure 8 that contains this data where you can see the target cortical activity trace, target paw speed, rewards, along with the audio frequency generated.

      The sample sizes are small (n=1) for a few groups. I am excited by the variety of regions recorded, so it could be beneficial for the authors to collect a few more animals to beef up the sample sizes.

      We've acknowledged that some of the sample sizes are small. Importantly, we have included raw and processed data as well as code for future analysis. We felt it was still important to still include these data sets with smaller sample sizes as they might be useful for others pursuing this direction in the future.

      I am curious as to why 60 trials sessions were used. Was it mostly for the convenience of a 30 min session, or were the animals getting satiated? If the former, would learning have occurred more rapidly with longer sessions?

      This is a great observation and the answer is it was mostly due to logistical reasons. We tried to not keep animals headfixed for more than 45 minutes in each session as they become less engaged with long duration headfixed sessions. After headfixing them, it takes about 15 minutes to get the experiment going and therefore 30 - 40 minutes long recorded sessions seemed appropriate before they stop being engaged or before they get satiated in the task. We provided supplemental water after the sessions and we observed that they consumed water after the sessions so they were not fully satiated during the sessions even when they performed well in the task and got maximum rewards. We also had inter-trial rest periods of 10s that elongated the session duration. We think it would be interesting to explore the relationship between session duration(number of trials) and task learning progression over the days in a separate study.

      Figure 4E is interesting, it seems like the changes in the distribution of deltaF was in both positive and negative directions, instead of just positive. I'd be curious as to the author's thoughts as to why this is the case. Relatedly, I don't see Figure 4E, and a few other subplots, mentioned in the text. As a general comment, I would address each subplot in the text.

      We have split Figure 4 into two to keep the figures more readable. Previous Figure 4E-H are now Figure 5A-D in the revised manuscript. The online real-time CLNF sessions were using a moving window average to calculate ΔF/F<sub>0</sub>  and the figures were generated by averaging the whole recorded sessions. We have added text in Methods under “Online ΔF/F<sub>0</sub>calculation” and “Offline ΔF/F<sub>0</sub> calculation” sections making it clear about how we do our ΔF/F<sub>0</sub> normalization based on average fluorescence over the entire session. Using this method of normalization does increase the baseline so that some peaks appear to be below zero. Additionally, it is unclear what strategy animals are employing to achieve the rule specific target activity. The task did not constrain them to have a specific strategy for cortical activation - they were rewarded as long as they crossed the threshold in target ROI(s). For example, in 2-ROI experiments, to increase ROI1-ROI2 target activity, they could increase activity of ROI1 relative to ROI2 or decreased activity of ROI1 relative to ROI1 - both would have led to a reward as long as the result crossed the threshold.

      We have now addressed and added reference to the figures in the text in Results under “Mice can explore and learn an arbitrary task, rule, and target conditions” and “Mice can rapidly adapt to changes in the task rule” sections - thanks for pointing this out.

      For: "In general, all ROIs assessed that encompassed sensory, pre-motor, and motor areas were capable of supporting increased reward rates over time," I would provide a visual summary showing the learning curves for the different types of regions.

      We have rewritten this section to emphasize that these conclusions were based on pooled data from multiple regions of interest. The sample sizes for each type of region are different and some are missing. We believe it would be incomplete and not comparable to present this as a regular analysis since the sample sizes were not balanced. We would be happy to dive deeper into this and point to the raw and processed dataset if anyone would like to explore this further by GitHub or other queries.

      Relatedly, I would further explain the fast vs slow learners, and if they mapped onto certain regions.

      Mice were categorized into fast or slow learners based on the slope of learning over days (reward progression over the days) as shown in Supplementary Figure 3C,D. Our initial aim was not to probe cortical regions that led to fast vs slow learning but this was a grouping we did afterwards. Based on the analysis we did, the fast learners included the sensory (V1), somatosensory (BC, HL), and motor (M1, M2) areas, while the slow learners included the motor (M1, M2), and higher order (TR, RL) cortical areas. Testing all dorsal cortical areas would be prudent to establish their role in fast or slow learning and it is an interesting future direction.

      Also I would make the labels for these plots (e.g. Supp Fig3) more intuitive, versus the acronyms currently used.

      We have made more expressive labels and explained the acronyms below the Supplementary Figure 3.

      The CLMF animals showed a decrease in latency across learning, what about the CLNF animals? There is currently no mention in the text or figures.

      We have now incorporated the CLNF task latency data into both the Results text and Figure 4C. Briefly, task latency decreased as performance improved, increased following a rule change, and then decreased again as the animals relearned the task. The previous Figure 4C has been updated to Figure 4D, and the former Figure 4D has been moved to Supplementary Figure 4E.

      Reviewer #2 (Public review):

      Summary:

      In this work, Gupta & Murphy present several parallel efforts. On one side, they present the hardware and software they use to build a head-fixed mouse experimental setup that they use to track in "real-time" the calcium activity in one or two spots at the surface of the cortex. On the other side, the present another setup that they use to take advantage of the "real-time" version of DeepLabCut with their mice. The hardware and software that they used/develop is described at length, both in the article and in a companion GitHub repository. Next, they present experimental work that they have done with these two setups, training mice to max out a virtual cursor to obtain a reward, by taking advantage of auditory tone feedback that is provided to the mice as they modulate either (1) their local cortical calcium activity, or (2) their limb position.

      Strengths:

      This work illustrates the fact that thanks to readily available experimental building blocks, body movement and calcium imaging can be carried using readily available components, including imaging the brain using an incredibly cheap consumer electronics RGB camera (RGB Raspberry Pi Camera). It is a useful source of information for researchers that may be interested in building a similar setup, given the highly detailed overview of the system. Finally, it further confirms previous findings regarding the operant conditioning of the calcium dynamics at the surface of the cortex (Clancy et al. 2020) and suggests an alternative based on deeplabcut to the motor tasks that aim to image the brain at the mesoscale during forelimb movements (Quarta et al. 2022).

      Weaknesses:

      This work covers 3 separate research endeavors: (1) The development of two separate setups, their corresponding software. (2) A study that is highly inspired from the Clancy et al. 2020 paper on the modulation of the local cortical activity measured through a mesoscale calcium imaging setup. (3) A study of the mesoscale dynamics of the cortex during forelimb movements learning. Sadly, the analyses of the physiological data appears uncomplete, and more generally the paper tends to offer overstatements regarding several points:

      In contrast to the introductory statements of the article, closed-loop physiology in rodents is a well-established research topic. Beyond auditory feedback, this includes optogenetic feedback (O'Connor et al. 2013, Abbasi et al. 2018, 2023), electrical feedback in hippocampus (Girardeau et al. 2009), and much more.

      We have included and referenced these papers in our introduction section (quoted below) and rephrased the part where our previous text indicated there are fewer studies involving closed-loop physiology.

      “Some related studies have demonstrated the feasibility of closed-loop feedback in rodents, including hippocampal electrical feedback to disrupt memory consolidation (Girardeau et al.2009), optogenetic perturbations of somatosensory circuits during behavior (O'Connor et al.2013), and more recent advances employing targeted optogenetic interventions to guide behavior (Abbasi et al. 2023).”

      The behavioral setups that are presented are representative of the state of the art in the field of mesoscale imaging/head fixed behavior community, rather than a highly innovative design. In particular, the closed-loop latency that they achieve (>60 ms) may be perceived by the mice. This is in contrast with other available closed-loop setups.

      We thank the reviewer for this thoughtful comment and fully agree that our closed-loop latency is larger than that achieved in some other contemporary setups. Our primary aim in presenting this work, however, is not to compete with the lowest possible latencies, but to provide an open-source, accessible, and flexible platform that can be readily adopted by a broad range of laboratories. By building on widely available and lower-cost components, our design lowers the barrier of entry for groups that wish to implement closed-loop imaging and behavioral experiments, while still achieving latencies well within the range that can support many biologically meaningful applications.

      For example, our latency (~60 ms) remains compatible with experimental paradigms such as:

      Motor learning and skill acquisition, where sensorimotor feedback on the scale of tens to hundreds of milliseconds is sufficient to modulate performance.

      Operant conditioning and reward-based learning, in which reinforcement timing windows are typically broader and not critically dependent on sub-20 ms latencies.

      Cortical state dependent modulation, where feedback linked to slower fluctuations in brain activity (hundreds of milliseconds to seconds) can provide valuable insight.

      Studies of perception and decision-making, in which stimulus response associations often unfold on behavioral timescales longer than tens of milliseconds.

      We believe that emphasizing openness, affordability, and flexibility will encourage widespread adoption and adaptation of our setup across laboratories with different research foci. In this way, our contribution complements rather than competes with ultra-low-latency closed-loop systems, providing a practical option for diverse experimental needs.

      Through the paper, there are several statements that point out how important it is to carry out this work in a closed-loop setting with an auditory feedback, but sadly there is no "no feedback" control in cortical conditioning experiments, while there is a no-feedback condition in the forelimb movement study, which shows that learning of the task can be achieved in the absence of feedback.

      We fully agree that such a control would provide valuable insight into the contribution of feedback to learning in the CLNF paradigm. In designing our initial experiments, we envisioned multiple potential control conditions, including No-feedback and Random-feedback. However, our first and primary objective was to establish whether mice could indeed learn to modulate cortical ROI activation through auditory feedback, and to further investigate this across multiple cortical regions. For this reason, we focused on implementing the CLNF paradigm directly, without the inclusion of these additional control groups. To broaden the applicability of the system, we subsequently adapted the platform to the CLMF experiments, where we did incorporate a No-feedback group. These results, as the reviewer notes, strengthen the evidence for the role of feedback in shaping task performance. We agree that the inclusion of a No-feedback control group in the CLNF paradigm will be crucial in future studies to further dissect the specific contribution of feedback to cortical conditioning.

      The analysis of the closed-loop neuronal data behavior lacks controls. Increased performance can be achieved by modulating actively only one of the two ROIs, this is not clearly analyzed (for instance looking at the timing of the calcium signal modulation across the two ROIs. It seems that overall ROIs1 and 2 covariate, in contrast to Clancy et al. 2020. How can this be explained?

      We agree that the possibility of increased performance being driven by modulation of a single ROI is an important consideration. Our study indeed began with 1-ROI closed-loop experiments. In those early experiments, while we did observe animals improving performance across days, we realized that daily variability in ongoing cortical GCaMP activity could lead to fluctuations in threshold-crossing events. The 2-ROI design was subsequently introduced to reduce this variability, as the target activity was defined as the relative activity between the two ROIs (e.g., ROI1 – ROI2). This approach offered a more stable signal by normalizing ongoing fluctuations. In our analysis of the early 2-ROI experiments, we observed that animals adopted diverging strategies to achieve threshold crossings. Specifically, some animals increased activity in ROI1 relative to ROI2, while others decreased activity in ROI2 to accomplish the same effect. Once discovered, each animal consistently adhered to its chosen strategy throughout subsequent training sessions. This was an early and intriguing observation, but as the experiments were not originally designed to systematically test this effect, we limited our presentation to the analysis of a small number of animals (shown in Figure 11). We have added details about this observation in our Results section as well, quoted below-

      “In the 2-ROI experiment where the task rule required “ROI1 - ROI2” activity to cross a threshold for reward delivery, mice displayed divergent strategies. Some animals predominantly increased ROI1 activity, whereas others reduced ROI2 activity, both approaches leading to successful threshold crossing (Figure 11)”.

      We hope this clarifies how the use of two ROIs helps explain the apparent covariation of the signals, and why some divergence from the observations of Clancy et al. (2020) may be expected.

      Reviewer #3 (Public review):

      Summary:

      The study demonstrates the effectiveness of a cost-effective closed-loop feedback system for modulating brain activity and behavior in head-fixed mice. Authors have tested real-time closed-loop feedback system in head-fixed mice two types of graded feedback: 1) Closed-loop neurofeedback (CLNF), where feedback is derived from neuronal activity (calcium imaging), and 2) Closed-loop movement feedback (CLMF), where feedback is based on observed body movement. It is a python based opensource system, and authors call it CLoPy. The authors also claim to provide all software, hardware schematics, and protocols to adapt it to various experimental scenarios. This system is capable and can be adapted for a wide use case scenario.

      Authors have shown that their system can control both positive (water drop) and negative reinforcement (buzzer-vibrator). This study also shows that using the close loop system mice have shown better performance, learnt arbitrary task and can adapt to change in the rule as well. By integrating real-time feedback based on cortical GCaMP imaging and behavior tracking authors have provided strong evidence that such closed-loop systems can be instrumental in exploring the dynamic interplay between brain activity and behavior.

      Strengths:

      Simplicity of feedback systems designed. Simplicity of implementation and potential adoption.

      Weaknesses:

      Long latencies, due to slow Ca2+ dynamics and slow imaging (15 FPS), may limit the application of the system.

      We appreciate the reviewer’s comment and agree that latency is an important factor in our setup. The latency arises partly from the inherent slow kinetics of calcium signaling and GCaMP6s, and partly from the imaging rate of 15 FPS (every 66 ms). These limitations can be addressed in several ways: for example, using faster calcium indicators such as GCaMP8f, or adapting the system to electrophysiological signals, which would require additional processing capacity. In our implementation, image acquisition was fixed at 15 FPS to enable real-time frame processing (256 × 256 resolution) on Raspberry Pi 4B devices. With newer hardware, such as the Raspberry Pi 5, substantially higher acquisition and processing rates are feasible (although we have not yet benchmarked this extensively). More powerful platforms such as Nvidia Jetson or conventional PCs would further support much faster data acquisition and processing.

      Major comments:

      (1) Page 5 paragraph 1: "We tested our CLNF system on Raspberry Pi for its compactness, general-purpose input/output (GPIO) programmability, and wide community support, while the CLMF system was tested on an Nvidia Jetson GPU device." Can these programs and hardware be integrated with windows-based system and a microcontroller (Arduino/ Tency). As for the broad adaptability that's what a lot of labs would already have (please comment/discuss)?

      While we tested our CLNF system on a Raspberry Pi (chosen for its compactness, GPIO programmability, and large user community) and our CLMF system on an Nvidia Jetson GPU device (to leverage real-time GPU-based inference), the underlying software is fully written in Python. This design choice makes the system broadly adaptable: it can be run on any device capable of executing Python scripts, including Windows-based PCs, Linux machines, and macOS systems. For hardware integration, we have confirmed that the framework works seamlessly with microcontrollers such as Arduino or Teensy, requiring only minor modifications to the main script to enable sending and receiving of GPIO signals through those boards. In fact, we are already using the same system in an in-house project on a Linux-based PC where an Arduino is connected to the computer to provide GPIO functionality. Furthermore, the system is not limited to Raspberry Pi or Arduino boards; it can be interfaced with any GPIO-capable devices, including those from Adafruit and other microcontroller platforms, depending on what is readily available in individual labs. Since many neuroscience and engineering laboratories already possess such hardware, we believe this design ensures broad accessibility and ease of integration across diverse experimental setups.

      (2) Hardware Constraints: The reliance on Raspberry Pi and Nvidia Jetson (is expensive) for real-time processing could introduce latency issues (~63 ms for CLNF and ~67 ms for CLMF). This latency might limit precision for faster or more complex behaviors, which authors should discuss in the discussion section.

      In our system, we measured latencies of approximately ~63 ms for CLNF and ~67 ms for CLMF. While such latencies indeed limit applications requiring millisecond precision, such as fast whisker movements, saccades, or fine-reaching kinematics, we emphasize that many relevant behaviors, including postural adjustments, limb movements, locomotion, and sustained cortical state changes, occur on timescales that are well within the capture range of our system. Thus, our platform is appropriate for a range of mesoscale behavioral studies that probably needs to be discussed more. It is also important to note that these latencies are not solely dictated by hardware constraints. A significant component arises from the inherent biological dynamics of the calcium indicator (GCaMP6s) and calcium signaling itself, which introduce slower temporal kinetics independent of processing delays. Newer variants, such as GCaMP8f, offer faster response times and could further reduce effective biological latency in future implementations.

      With respect to hardware, we acknowledge that Raspberry Pi provides a low-cost solution but contributes to modest computational delays, while Nvidia Jetson offers faster inference at higher cost. Our choice reflects a balance between accessibility, cost-effectiveness, and performance, making the system deployable in many laboratories. Importantly, the modular and open-source design means the pipeline can readily be adapted to higher-performance GPUs or integrated with electrophysiological recordings, which provide higher temporal resolution. Finally, we agree with the reviewer that the issue of latency highlights deeper and interesting questions regarding the temporal requirements of behavior classification. Specifically, how much data (in time) is required to reliably identify a behavior, and what is the minimum feedback delay necessary to alter neural or behavioral trajectories? These are critical questions for the design of future closed-loop systems and ones that our work helps frame.

      We have added a slightly modified version of our response above in the discussion section under “Experimental applications and implications”.

      (3) Neurofeedback Specificity: The task focuses on mesoscale imaging and ignores finer spatiotemporal details. Sub-second events might be significant in more nuanced behaviors. Can this be discussed in the discussion section?

      This is a great point  and we have added the following to the discussion section. “In the case of CLNF we have focused on regional cortical GCAMP signals that are relatively slow in kinetics. While such changes are well suited for transcranial mesoscale imaging assessment, it is possible that cellular 2-photon imaging (Yu et al. 2021) or preparations that employ cleared crystal skulls (Kim et al. 2016) could resolve more localized and higher frequency kinetic signatures.”

      (4) The activity over 6s is being averaged to determine if the threshold is being crossed before the reward is delivered. This is a rather long duration of time during which the mice may be exhibiting stereotyped behaviors that may result in the changes in DFF that are being observed. It would be interesting for the authors to compare (if data is available) the behavior of the mice in trials where they successfully crossed the threshold for reward delivery and in those trials where the threshold was not breached. How is this different from spontaneous behavior and behaviors exhibited when they are performing the test with CLNF? 

      We would like to emphasize that we are not directly averaging activity over 6 s to compare against the reward threshold. Instead, the preceding 6 s of activity is used solely to compute a dynamic baseline for ΔF/F<sub>0</sub> ( ΔF/F<sub>0</sub> = (F –F<sub>0</sub> )/F<sub>0</sub>). Here, F<sub>0</sub>is calculated as the mean fluorescence intensity over the prior 6 s window and is updated continuously throughout the session. This baseline is then subtracted from the instantaneous fluorescence signal to detect relative changes in activity. The reward threshold is therefore evaluated against these baseline-corrected ΔF/F<sub>0</sub> values at the current time point, not against an average over 6 s. This moving-window baseline correction is a standard approach in calcium imaging analyses, as it helps control for slow drifts in signal intensity, bleaching effects, or ongoing fluctuations unrelated to the behavior of interest. Thus, the 6-s window is not introducing a temporal lag in reward assignment but is instead providing a reference to detect rapid increases in cortical activity.  We have added the term dynamic baseline to the Methods to clarify.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      Additional suggestions for improved or additional experiments, data or analyses.

      For: "Looking closely at their reward rate on day 5 (day of rule change), they had a higher reward rate in the second half of the session as compared to the first half, indicating they were adapting to the rule change within one session." It would be helpful to see this data, and would be good to see within-session learning on the rule change day

      Thank you for pointing this out. We had missed referencing the figure in the text, and have now added a citation to Supplementary Figure 4A, which shows the cumulative rewards for each day of training. As seen in the plot for day 5, the cumulative rewards are comparable to those on day 1, with most rewards occurring during the second half of the session.

      For: "These results suggest that motor learning led to less cortical activation across multiple regions, which may reflect more efficient processing of movement-related activity," it could also be the case that the behaviour became more stereotyped over learning, which would lead to more concentrated, correlated activity. To test this, it would be good to look at the limb variability across sessions. Similarly, if it is movement-related, there should be good decoding of limb kinematics.

      Indeed, we observed that behavior became more stereotyped over the course of learning, as shown in Supplementary Figure 4C, 4D. One plausible explanation for the reduction in cortical activation across multiple regions is that behavior itself became more stereotyped, a possibility we have explored in the manuscript. Specifically, forelimb movements during the trial became increasingly correlated as mice improved on the task, particularly in the groups that received auditory feedback (Rule-change and No-rule-change groups; Figure 8). As movements became more correlated, overall body movements during trials decreased and aligned more closely with the task rule (Figure 9D). This suggests that reduced cortical activity may in part reflect changes in behavior. Importantly, however, in the Rule-change group, we observed that on the day of the rule switch (day 5), when the target shifted from the left to the right forelimb, cortical activity increased bilaterally (Figure 9A–C). This finding highlights our central point: groups that received feedback (Rule-change and No-rule-change) were able to identify the task rule more effectively, and both their behavior and cortical activity became more specifically aligned with the rule compared to the No-feedback group. We agree with the reviewers that additional analyses along these lines would be valuable future directions. To facilitate this, we have included the movement data for readers who may wish to pursue further analyses, details can be found under “Data and code availability” in Methods section. However, given the limited sample sizes in our dataset and the need to keep the manuscript focused on the central message, we felt that including these additional analyses here would risk obscuring the main findings.

      For: "We believe the decrease in ΔF/F0peak is unlikely to be driven by changes in movement, as movement amplitudes did not decrease significantly during these periods (Figure 7D CLMF Rule-change)." I would formally compare the two conditions. This is an important control. Also, another way to see if the change in deltaF is related to movement would be to see if you can predict movement from the deltaF.

      Figure 7D in the previous version is Figure 9D in the current revision of the manuscript. We've assessed this for the examples shown based on graphing the movement data, unfortunately there is not enough of that data to do a group analysis of movement magnitude. We would suggest that this would be an excellent future direction that would take advantage of the flexible open source nature of our tool.

      Recommendations for improving the writing and presentation.

      In the abstract there is no mention of the rationale for the project, or the resulting significance. I would modify this to increase readership by the behavioral neuroscience community. Similarly, the introduction also doesn't highlight the value of this resource for the field. Again, I think the pyControl paper does a good job of this. For readability, I would add more subheadings earlier in the results, to separate the different technical aspects of the system.

      We have revised the introduction to include the rationale for the project, its potential implications, and its relevance for translational research. We have also framed the work within the broader context of the behavioral and systems neuroscience community. We greatly appreciate this suggestion, as we believe it enhances the clarity and accessibility of the manuscript for the community.

      For: "While brain activity can be controlled through feedback, other variables such as movements have been less studied, in part because their analysis in real time is more challenging." I would highlight research that has studied the control of behavior through feedback, such as the Mathis paper where mice learn to pull a joystick to a virtual box, and adapt this motion to a force perturbation.

      We have added a citation to the Mathis paper and describe this as an additional form of feedback. The text is quoted below:

      “Opportunities also exist in extending real time pose classification (Forys et al. 2020; Kane et al. 2020) and movement perturbation (Mathis et al. 2017) to shape aspects of an animal’s motor repertoire.”

      Some of the results content would be better suited for the methods, one example: "A previous version of the CLNF system was found to have non-linear audio generation above 10 kHz, partly due to problems in the audio generation library and partly due to the consumer-grade speaker hardware we were employing. This was fixed by switching to the Audiostream (https://github.com/kivy/audiostream) library for audio generation and testing the speakers to make sure they could output the commanded frequencies"

      This is now moved to the Methods section.

      For: "There are reports of cortical plasticity during motor learning tasks, both at cellular and mesoscopic scales (17-19), supporting the idea that neural efficiency could improve with learning," not sure I agree with this, the studies on cortical plasticity are usually to show a neural basis for the learning observed, efficiency is separate from this.

      We have modified this statement to remove the concept of efficiency "There are reports of cortical plasticity during motor learning tasks, both at cellular and mesoscopic scales (17-19).”

      The paragraph that opens "Distinct task- and reward-related cortical dynamics" that describes the experiment should appear in the previous section, as the data is introduced there.

      We have moved the mentioned paragraphs in the previous section where we presented the data and other experiment details. This makes the text more readable and contextual.

      I would present the different ROI rules with better descriptors and visualization to improve the readability.

      We have added Supplementary Figure 7, which provides visualizations of the ROIs across all task rules used in the CLNF experiments.

      Minor corrections to the text and figures.

      Figure 1 is a little crowded, combining the CLNF and CLMF experiments, I would turn this into a 2 panel figure, one for each, similar to how you did figure 2.

      We have revised Figure 1 to include two panels, one for CLNF and one for CLMF. The colored components indicate elements specific to each setup, while the uncolored components represent elements shared between CLNF and CLMF. Relevant text in the manuscript is updated to refer to these figures.

      For Figure 2, the organization of the CLMF section is not intuitive for the reader. I would reorder it so it has a similar flow as the CLNF experiment.

      We have revised the figure by updating the layout of panel B (CLMF) to align with panel A (CLNF), thereby creating a more intuitive and consistent flow between the panels. We appreciate this helpful suggestion, which we believe has substantially improved the clarity of the figure. The corresponding text in the manuscript has also been updated to reflect these changes.

      For Figure 3, highlight that C and E are examples. They also seem a little out of place, so they could even be removed.

      We have now explicitly labeled Figures 3C and 3E as representative examples (figure legend and on figure itself). We believe including these panels provides helpful context for readers: Figure 3C illustrates how the ROIs align on the dorsal cortical brain map with segmented cortical regions, while Figure 3E shows example paw trajectories in three dimensions, allowing visualization of the movement patterns observed during the trials.

      In the plots, I would add sample sizes, for instance, in CLNF learning curve in Figure 4A, how many animals are in each group? 

      We have labeled Figure 4 with number of animals used in CLNF (No-rule-change, N=23; Rule-change, N=17), and CLMF (Rule-change, N=8; No-rule-change, N=4; No-feedback, N=4).

      Also, Figure 7 for example, which figures are single-sessions, versus across animals? For Figure 7c, what time bin is the data taken from?

      We have clarified this now and mentioned it in all the figures. Figure 7 in the previous version is Figure 9 in the current updated manuscript. Figure 9A is from individual sessions on different days from the same mouse. Figure 9B is the group average reward centered ΔF/F<sub>0</sub> activity in different cortical regions (Rule-change, N=8; No-rule-change, N=4; No-feedback, N=4). Figure 9C shows average ΔF/F<sub>0</sub> peak values obtained within -1sec to +1sec centered around the reward point (N=8).

      It says "punish" in Figure 3, but there is no punishment?

      Yes, the task did not involve punishment. Each trial resulted in either a success, which is followed by a reward, or a failure, which is followed by a buzzer sound. To better reflect these outcomes, we have updated Figure 3 and replaced the labels “Reward” with “Success” and “Punish” with “Failure.”

      The regression on 5c doesn't look quite right, also this panel is not mentioned in the text.

      The figure referred to by the reviewer as Figure 5 is now presented as Figure 6 in the revised manuscript. Regarding the reviewer’s observation about the regression line in the left panel of Figure 5C, the apparent misalignment arises because the majority of the data points are densely clustered at the center of the scatter plot, where they overlap substantially. The regression line accurately reflects this concentration of overlapping data. To improve clarity, we have updated the figure and ensured that it is now appropriately referenced in the Results section.

      Reviewer #2 (Recommendations for the authors):

      (1) There would be many interesting observations and links between the peripheral and cortical studies if there was a body video available during the cortical study. Is there any such data available?

      We agree that a detailed analysis of behavior during the CLNF task would be necessary to explore any behavior correlates with success in the task. Unfortunately, we do not have a sufficient video of the whole body to perform such an analysis.

      (2) The text (p. 24) states: [intracortical GCAMP transients measured over days became more stereotyped in kinetics and were more correlated (to each other) as the task performance increased over the sessions (Figure 7E).] But I cannot find this quantification in the figures or text?

      Figure 7 in the previous version of the manuscript now appears as Figure 9. In this figure, we present cortical activity across selected regions during trials, and in Figure 9E we highlight that this activity becomes more correlated. Since we did not formally quantify variability, we have removed the previous claim that the activity became stereotyped and revised the text in the updated manuscript accordingly.

      Typos:

      10-serest c (page 13)

      Inverted color codes in figure 4E vs F

      Reviewer #3 (Recommendations for the authors):

      We have mostly attempted to limit the feedback to suggestions and posed a few questions that might be interesting to explore given the dataset the authors have collected.

      Comments:

      In close loop systems the latency is primary concern, and authors have successfully tested the latency of the system (Delay): from detection of an event to the reaction time was less than 67ms.

      We have commented on the issues and limitations caused by latency, and potential future directions to overcome these challenges in responses to some of the previous comments.

      Additional major comments:

      "In general, all ROIs assessed that encompassed sensory, pre-motor, and motor areas were capable of supporting increased reward rates over time (Figure 4A, Animation 1)." Fig 4A is merely showing change in task performance over time and does not have information regarding the changes observed specific to CLNF for each ROI.

      We acknowledge that the sample size for individual ROI rules was not sufficient for meaningful comparisons. To address this limitation, we pooled the data across all the rules tested. The manuscript includes a detailed list of the rules along with their corresponding sample sizes for transparency.

      A ΔF/F<sub>0</sub> threshold value was calculated from a baseline session on day 0 that would have allowed 25% performance. Starting from this basal performance of around 25% on day 1, mice (CLNF No-rule-change, n=28 and CLNF Rule-change, n=13). It is unclear what the replicates here are. Trials or mice? The corresponding Figure legend has a much smaller n value.

      Thank you for pointing this out. We realized that we had not indicated the sample replicates in the figure, and the use of n instead of N for the number of animals may have been misleading. We have now corrected the notation and clarified this information in the figure to resolve the discrepancy.

      What were the replicates for each ROI pairs evaluated?

      Each ROI rule and number of mice and trials are listed in Table 5 and Table 6.

      Our analysis revealed that certain ROI rules (see description in methods) lead to a greater increase in success rate over time than others (Supplementary Figure 3D). The Supplementary figures 3C and 3D are blurry and could use higher resolution images. 

      We have increased the font size of the text that was previously difficult to read and re-exported the figure at a higher resolution (300 DPI). We believe these changes will resolve the issue.

      Also, It will help the reader is a visual representation of the ROI pairs are provided, instead of the text view. One interesting question is whether there are anatomical biases to fast vs slow learning pairs (Directionality - anterior/posterior, distance between the selected ROIs etc). This could be interesting to tease apart.

      We have added Supplementary Figure 7, which provides visualizations of the ROIs across all task rules used in the CLNF experiments. While a detailed investigation of the anatomical basis of fast versus slow learning cortical ROIs is beyond the scope of the present study, we agree that this represents an exciting future direction for further research.

      How distant should the ROIs be to achieve increased task performance?

      We appreciate this insightful question. We did not specifically test this scenario. In our study, we selected 0.3 × 0.3 mm ROIs centered on the standard AIBS mouse brain atlas (CCF). At this resolution, ROIs do not overlap, regardless of their placement in a two-ROI experiment. Furthermore, because our threshold calculations are based on baseline recordings, we expect the system would function for any combination of ROI placements. Nonetheless, exploring this systematically would be an interesting avenue for future experiments.

      Figures:

      I would leave out some of the methodological details such as the protocol for water restriction (Fig. 3) out of the legend. This will help with readability.

      We have removed some of the methodological details, including those mentioned above, from the legend of Figure 3 in the updated manuscript.

      Fig 1 and Fig 2: In my opinion, It would be easier for the reader if the current Fig. 2, which provides a high level description of CLNF and CLBF is presented as Fig. 1. The current Fig. 1, goes into a lot of methodological implementation details, and also includes a lot of programming jargon that is being introduced early in the paper that is hard to digest early on in the paper's narrative.

      Thank you for the suggestion. In the new manuscript, Figure 1 and Figure 2 have been swapped.

      Higher-resolution images/ plots are needed in many instances. Unsure if this is the pdf compression done by the manuscript portal that is causing this.

      All figures were prepared in vector graphics format using the open-source software Inkscape. For this manuscript, we exported the images at 300 DPI, which is generally sufficient for publication-quality documents. The submission portal may apply additional processing, which could have resulted in a reduction in image quality. We will carefully review the final submission files and ensure that all figures are clear and of high quality.

      The authors repeatedly show ROI specific analysis M1_L, F1_R etc. It will be helpful to provide a key, even if redundant in all figures to help the reader.

      We have now included keys to all such abbreviations in all the figures.

      There are also instances of editorialization and interpretation e.g., "Surprisingly, the "Rule-change" mice were able to discover the change in rule and started performing above 70% within a day of the rule change, on day 6" that would be more appropriate in the main body of the paper.

      Thank you for pointing this out in the figure legend, and we have removed it now since we already discussed this in the Results.

      Minor comments

      (1) The description of Figure 1 is hard to follow and can be described better based on how the information is processed and executed in the system from source to processing and back. Using separated colors (instead of shaded of grey) for the neuro feedback and movement feedback would help as well. Common components could have a different color. The specification like the description of the config file should come later.

      Figure 1 in the previous version is Figure 2 in the updated version. We have taken suggestions from other reviewers and made the figure easier to understand and split it into two panels with color coding Green for CLNF, Pink for CLMF specific parts while common shared parts are left without any color.

      (2) Page 20 last paragraph:

      Authors are neglecting that the rule change is done one day prior and the results that you see in the second half on the 6th day are not just because of the first half of the 6th day instead combined training on the 5th day (rule change) and then the first half of the 6th day. Rephrasing this observation is essential.

      We have revised the text for clarity to indicate that the performance increase observed on day 6 is not necessarily attributable to training on that day. In fact, we noted and mentioned that mice began to perform the task better during the second half of the session on day 5 itself.

      (3)  The method section description of the CLMF setup (Page no 39 first paragraph) is more detailed, a diagram of this setup would make it easy to follow and a better read.

      We have made changes to the CLMF setup (Figure 1B) and CLMF schematic (Figure 2B) to make it easier to understand parts of the setup and flow of control.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Bansal et al. present a study on the fundamental blood and nectar feeding behaviors of the critical disease vector, Anopheles stephensi. The study encompasses not just the fundamental changes in blood feeding behaviors of the crucially understudied vector, but then uses a transcriptomic approach to identify candidate neuromodulation pathways which influence blood feeding behavior in this mosquito species. The authors then provide evidence through RNAi knockdown of candidate pathways that the neuromodulators sNPF and Rya modulate feeding either via their physiological activity in the brain alone or through joint physiological activity along the brain-gut axis (but critically not the gut alone). Overall, I found this study to be built on tractable, well-designed behavioral experiments.

      Their study begins with a well-structured experiment to assess how the feeding behaviors of A. stephensi change over the course of its life history and in response to its age, mating, and oviposition status. The authors are careful and validate their experimental paradigm in the more well-studied Ae. aegypti, and are able to recapitulate the results of prior studies, which show that mating is a prerequisite for blood feeding behaviors in Ae. aegypt. Here they find A. Stephensi, like other Anopheline mosquitoes, has a more nuanced regulation of its blood and nectar feeding behaviors.

      The authors then go on to show in a Y-maze olfactometer that ,to some degree, changes in blood feeding status depend on behavioral modulation to host cues, and this is not likely to be a simple change to the biting behaviors alone. I was especially struck by the swap in valence of the host cues for the blood-fed and mated individuals, which had not yet oviposited. This indicates that there is a change in behavior that is not simply desensitization to host cues while navigating in flight, but something much more exciting is happening.

      The authors then use a transcriptomic approach to identify candidate genes in the blood-feeding stages of the mosquito's life cycle to identify a list of 9 candidates that have a role in regulating the host-seeking status of A. stephensi. Then, through investigations of gene knockdown of candidates, they identify the dual action of RYa and sNPF and candidate neuromodulators of host-seeking in this species. Overall, I found the experiments to be well-designed. I found the molecular approach to be sound. While I do not think the molecular approach is necessarily an all-encompassing mechanism identification (owing mostly to the fact that genetic resources are not yet available in A. stephensi as they are in other dipteran models), I think it sets up a rich line of research questions for the neurobiology of mosquito behavioral plasticity and comparative evolution of neuromodulator action.

      We appreciate the reviewer’s detailed summary of our work. We thank them for their positive comments and agree with them on the shortcomings of our approach.

      Strengths:

      I am especially impressed by the authors' attention to small details in the course of this article. As I read and evaluated this article, I continued to think about how many crucial details could potentially have been missed if this had not been the approach. The attention to detail paid off in spades and allowed the authors to carefully tease apart molecular candidates of blood-seeking stages. The authors' top-down approach to identifying RYamide and sNPF starting from first principles behavioral experiments is especially comprehensive. The results from both the behavioral and molecular target studies will have broad implications for the vectorial capacity of this species and comparative evolution of neural circuit modulation.

      We really appreciate that the reviewer has recognised the attention to detail we have tried to put, thank you!

      Weaknesses:

      There are a few elements of data visualizations and methodological reporting that I found confusing on a first few read-throughs. Figure 1F, for example, was initially confusing as it made it seem as though there were multiple 2-choice assays for each of the conditions. I would recommend removing the "X" marker from the x-axis to indicate the mosquitoes did not feed from either nectar, blood, or neither in order to make it clear that there was one assay in which mosquitoes had access to both food sources, and the data quantify if they took both meals, one meal, or no meals.

      We thank the reviewer for flagging the schematic in figure 1F. As suggested, we have removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose in the assay. For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data, as it does not capture the variability in the data.

      I would also like to know more about how the authors achieved tissue-specific knockdown for RNAi experiments. I think this is an intriguing methodology, but I could not figure out from the methods why injections either had whole-body or abdomen-specific knockdown.

      The tissue-specific knockdown (abdomen only or abdomen+head) emerged from initial standardisations where we were unable to achieve knockdown in the head unless we used higher concentrations of dsRNA and did the injections in older females. We realised that this gave us the opportunity to isolate the neuronal contribution of these neuropeptides in the phenotype produced. Further optimisations revealed that injecting dsRNA into 0-10h old females produced abdomen-specific knockdowns without affecting head expression, whereas injections into 4 days old females resulted in knockdowns in both tissues. Moreover, head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts.

      We have mentioned the knockdown conditions- time of injection and the amount dsRNA injected- for tissue-specific knockdowns in methods but realise now that it does not explain this well enough. We have now edited it to state our methodology more clearly (see lines 932-948).

      I also found some interpretations of the transcriptomic to be overly broad for what transcriptomes can actually tell us about the organism's state. For example, the authors mention, "Interestingly, we found that after a blood meal, glucose is neither spent nor stored, and that the female brain goes into a state of metabolic 'sugar rest', while actively processing proteins (Figure S2B, S3)".

      This would require a physiological measurement to actually know. It certainly suggests that there are changes in carbohydrate metabolism, but there are too many alternative interpretations to make this broad claim from transcriptomic data alone.

      We thank the reviewer for pointing this out and agree with them. We have now edited our statement to read:

      “Instead, our data suggests altered carbohydrate metabolism after a blood meal, with the female brain potentially entering a state of metabolic 'sugar rest' while actively processing proteins (Figure S2B, S3). However, physiological measurements of carbohydrate and protein metabolism will be required to confirm whether glucose is indeed neither spent nor stored during this period.” See lines 271-277.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Bansal et al examine and characterize feeding behaviour in Anopheles stephensi mosquitoes. While sharing some similarities to the well-studied Aedes aegypti mosquito, the authors demonstrate that mated females, but not unmated (virgin) females, exhibit suppression in their bloodfeeding behaviour. Using brain transcriptomic analysis comparing sugar-fed, blood-fed, and starved mosquitoes, several candidate genes potentially responsible for influencing blood-feeding behaviour were identified, including two neuropeptides (short NPF and RYamide) that are known to modulate feeding behaviour in other mosquito species. Using molecular tools, including in situ hybridization, the authors map the distribution of cells producing these neuropeptides in the nervous system and in the gut. Further, by implementing systemic RNA interference (RNAi), the study suggests that both neuropeptides appear to promote blood-feeding (but do not impact sugar feeding), although the impact was observed only after both neuropeptide genes underwent knockdown.

      Strengths and/or weaknesses:

      Overall, the manuscript was well-written; however, the authors should review carefully, as some sections would benefit from restructuring to improve clarity. Some statements need to be rectified as they are factually inaccurate.

      Below are specific concerns and clarifications needed in the opinion of this reviewer:

      (1) What does "central brains" refer to in abstract and in other sections of the manuscript (including methods and results)? This term is ambiguous, and the authors should more clearly define what specific components of the central nervous system was/were used in their study.

      Central brain, or mid brain, is a commonly used term to refer to brain structures/neuropils without the optic lobes (For example: https://www.nature.com/articles/s41586-024-07686-5). In this study we have focused our analysis on the central brain circuits involved in modulating blood-feeding behaviour and have therefore excluded the optic lobes. As optic lobes account for nearly half of all the neurons in the mosquito brain (https://pmc.ncbi.nlm.nih.gov/articles/PMC8121336/), including them would have disproportionately skewed our transcriptomic data toward visual processing pathways. 

      We have indicated this in figure 3A and in the methods (see lines 800-801, 812). We have now also clarified it in the results section for neurotranscriptomics to avoid confusion (see lines 236-237).

      (2) The abstract states that two neuropeptides, sNPF and RYamide are working together, but no evidence is summarized for the latter in this section.

      We thank the reviewer for pointing this out. We have now added a statement “This occurs in the context of the action of RYa in the brain” to end of the abstract, for a complete summary of our proposed model. 

      (3) Figure 1

      Panel A: This should include mating events in the reproductive cycle to demonstrate differences in the feeding behavior of Ae. aegypti.

      Our data suggest that mating can occur at any time between eclosion and oviposition in An. stephensi and between eclosion and blood feeding in Ae. aegypti. Adding these into (already busy) 1A, would cloud the purpose of the schematic, which is to indicate the time points used in the behavioural assays and transcriptomics.

      Panel F: In treatments where insects were not provided either blood or sugar, how is it that some females and males had fed? Also, it is unclear why the y-axis label is % fed when the caption indicates this is a choice assay. Also, it is interesting that sugar-starved females did not increase sugar intake. Is there any explanation for this (was it expected)?

      We apologise for the confusion. The experiment is indeed a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. The x-axis indicates the choice made by the mosquitoes, not the choice provided in the assay, and the y-axis indicates the percentage of males or females that made each particular choice. We have now removed the “X” markers from the x-axis and revised the axis label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      In this assay, we scored females only for the presence or absence of each meal type (blood or sugar) and are therefore unable to comment on whether sugar-starved females consumed more sugar than sugarsated females. However, when sugar-starved, a higher proportion of females consumed both blood and sugar, while fewer fed on blood alone.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data as it does not capture the variability in the data.

      (4) Figure 3

      In the neurotranscriptome analysis of the (central) brain involving the two types of comparisons, can the authors clarify what "excluded in males" refers to? Does this imply that only genes not expressed in males were considered in the analysis? If so, what about co-expressed genes that have a specific function in female feeding behaviour?

      This is indeed correct. We reasoned that since blood feeding is exclusive to females, we should focus our analysis on genes that were specifically upregulated in them. As the reviewer points out, it is very likely that genes commonly upregulated in males and females may also promote blood feeding and we will miss out on any such candidates based on our selection criteria. 

      (5) Figure 4

      The authors state that there is more efficient knockdown in the head of unfed females; however, this is not accurate since they only get knockdown in unfed animals, and no evidence of any knockdown in fed animals (panel D). This point should be revised in the results test as well.

      Perhaps we do not understand the reviewer’s point or there has been a misunderstanding. In figure 4D, we show that while there is more robust gene knockdown in unfed females, blood-fed females also showed modest but measurable knockdowns ranging from 5-40% for RYamide and 2-21% for sNPF. 

      Relatedly, blood-feeding is decreased when both neuropeptide transcripts are targeted compared to uninjected (panel C) but not compared to dsGFP injected (panel E). Why is this the case if authors showed earlier in this figure (panel B) that dsGFP does not impact blood feeding?

      We realise this concern stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens. 4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomens. We have now added a schematic in the plots to make this clearer.

      In addition, do the uninjected and dsGFP-injected relative mRNA expression data reflect combined RYa and sNPF levels? Why is there no variation in these data,…

      In these qPCRs, we calculated relative mRNA expression using the delta-delta Ct method (see line 975). For each neuropeptide its respective control was used. For simplicity, we combined the RYa and sNPF control data into a single representation. The value of this control is invariant because this method sets the control baseline to a value of 1.

      …and how do transcript levels of RYa and sNPF compare in the brain versus the abdomen (the presentation of data doesn't make this relationship clear).

      The reviewer is correct in pointing out that we have not clarified this relationship in our current presentation. While we have not performed absolute mRNA quantifications, we extracted relative mRNA levels from qPCR data of 96h old unmanipulated control females. We observed that both sNPF and RYa transcripts are expressed at much lower levels in the abdomens, as compared to those in the heads, as shown in Author response Image 1 below. 

      Author response image 1.

      (6) As an overall comment, the figure captions are far too long and include redundant text presented in the methods and results sections.

      We thank the reviewer for flagging this and have now edited the legends to remove redundancy.  

      (7) Criteria used for identifying neuropeptides promoting blood-feeding: statement that reads "all neuropeptides, since these are known to regulate feeding behaviours". This is not accurate since not all neuropeptides govern feeding behaviors, while certainly a subset do play a role.

      We agree with the reviewer that not all neuropeptides regulate feeding behaviours. Our statement refers to the screening approach we used: in our shortlist of candidates, we chose to validate all neuropeptides.

      (8) In the section beginning with "Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels...", the authors state that there was no change in blood-feeding and later state the opposite. The wording should be clarified as it is unclear.

      Thank you for pointing this out. We were referring to an unchanged proportion of the blood fed females. We have now edited the text to the following: 

      “Two neuropeptides - sNPF and RYa - showed about 25% and 40% reduced mRNA levels in the heads but the proportion of females that took blood meals remained unchanged”. See lines 338-340.

      (9) Just before the conclusions section, the statement that "neuropeptide receptors are often ligandpromiscuous" is unjustified. Indeed, many studies have shown in heterologous systems that high concentrations of structurally related peptides, which are not physiologically relevant, might cross-react and activate a receptor belonging to a different peptide family; however, the natural ligand is often many times more potent (in most cases, orders of magnitude) than structurally related peptides. This is certainly the case for various RYamide and sNPF receptors characterized in various insect species.

      We agree with the reviewer and apologise for the mistake. We have now removed the statement.

      (10) Methods

      In the dsRNA-mediated gene knockdown section, the authors could more clearly describe how much dsRNA was injected per target. At the moment, the reader must carry out calculations based on the concentrations provided and the injected volume range provided later in this section.

      We have now edited the section to reflect the amount of dsRNA injected per target. Please see lines 921-931.

      It is also unclear how tissue-specific knockdown was achieved by performing injection on different days/times. The authors need to explain/support, and justify how temporal differences in injection lead to changes in tissue-specific expression. Does the blood-brain barrier limit knockdown in the brain instead, while leaving expression in the peripheral organs susceptible?

      To achieve tissue-specific knockdowns of sNPF and RYa, we optimised both the time of injection as well as the dsRNA concentration to be injected. Injecting dsRNA into 0-10h females produced abdomen-specific knockdowns without affecting head expression, whereas injections into 96h old females resulted in knockdowns in both tissues. Head knockdowns in older females required higher dsRNA concentrations, with knockdown efficiency correlating with the amount injected. In contrast, abdominal knockdowns in younger females could be achieved even with lower dsRNA amounts, reflecting the lower baseline expression of sNPF in abdomens compared to heads and the age-dependent increase in head expression (as confirmed by qPCR). It is possible that the blood-brain barrier also limits the dsRNA entering the brain, thereby requiring higher amounts to be injected for head knockdowns. 

      We have now edited this section to state our methodology more clearly (see lines 932-948).

      For example, in Figure 4, the data support that knockdown in the head/brain is only effective in unfed animals compared to uninjected animals, while there is no evidence of knockdown in the brain relative to dsGFP-injected animals. Comparatively, evidence appears to show stronger evidence of abdominal knockdown mostly for the RYa transcript (>90%) while still significantly for the sNPF transcript (>60%).

      As we explained earlier, this concern likely stems from our representation of the data. Since we had earlier determined that dsGFP-injected females fed similarly to uninjected females (fig 4B), we used these controls interchangeably in subsequent experiments. To avoid confusion, we have now only used the label ‘control’ in figure 4 (and supplementary figure S9) and specified which control was used for each experiment in the legend.

      In addition to this, we wanted to clarify that fig 4C and 4E are independent experiments. 4C is the behaviour corresponding to when the neuropeptides were knocked down in both heads and abdomens.  4E is the behaviour corresponding to when the neuropeptides were knocked down in only the abdomen. We have now added a schematic in the plots to make this clearer.

      Reviewer #3 (Public review):

      Summary:

      This manuscript investigates the regulation of host-seeking behavior in Anopheles stephensi females across different life stages and mating states. Through transcriptomic profiling, the authors identify differential gene expression between "blood-hungry" and "blood-sated" states. Two neuropeptides, sNPF and RYamide, are highlighted as potential mediators of host-seeking behavior. RNAi knockdown of these peptides alters host-seeking activity, and their expression is anatomically mapped in the mosquito brain (sNPF and RYamide) and midgut (sNPF only).

      Strengths:

      (1) The study addresses an important question in mosquito biology, with relevance to vector control and disease transmission.

      (2) Transcriptomic profiling is used to uncover gene expression changes linked to behavioral states.

      (3) The identification of sNPF and RYamide as candidate regulators provides a clear focus for downstream mechanistic work.

      (4) RNAi experiments demonstrate that these neuropeptides are necessary for normal host-seeking behavior.

      (5) Anatomical localization of neuropeptide expression adds depth to the functional findings.

      Weaknesses:

      (1) The title implies that the neuropeptides promote host-seeking, but sufficiency is not demonstrated (for example, with peptide injection or overexpression experiments).

      Demonstrating sufficiency would require injecting sNPF peptide or its agonist. To date, no small-molecule agonists (or antagonists) that selectively mimic sNPF or RYa neuropeptides have been identified in insects. An NPY analogue, TM30335, has been reported to activate the Aedes aegypti NPY-like receptor 7 (NPYLR7; Duvall et al., 2019), which is also activated by sNPF peptides at higher doses (Liesch et al., 2013). Unfortunately, the compound is no longer available because its manufacturer, 7TM Pharma, has ceased operations. Synthesising the peptides is a possibility that we will explore in the future.

      (2) The proposed model regarding central versus peripheral (gut) peptide action is inconsistently presented and lacks strong experimental support.

      The best way to address this would be to conduct tissue-specific manipulations, the tools for which are not available in this species. Our approach to achieve head+abdomen and abdomen only knockdown was the closest we could get to achieving tissue specificity and allowed us to confirm that knockdown in the head was necessary for the phenotype. However, as the reviewer points out, this did not allow us to rule out any involvement of the abdomen. This point has been addressed in lines 364-371.

      (3) Some conclusions appear premature based on the current data and would benefit from additional functional validation.

      The most definitive way of demonstrating necessity of sNPF and RYa in blood feeding would be to generate mutant lines. While we are pursuing this line of experiments, they lie beyond the scope of a revision. In its absence, we relied on the knockdown of the genes using dsRNA. We would like to posit that despite only partial knockdown, mosquitoes do display defects in blood-feeding behaviour, without affecting sugar-feeding. We think this reflects the importance of sNPF in promoting blood feeding.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Overall, I found this manuscript to be well-prepared, visually the figures are great and clearly were carefully thought out and curated, and the research is impactful. It was a wonderful read from start to finish. I have the following recommendations:

      Thank you very much, we are very pleased to hear that you enjoyed reading our manuscript!

      (1) For future manuscripts, it would make things significantly easier on the reviewer side to submit a format that uses line numbers.

      We sincerely apologise for the oversight. We have now incorporated line numbers in the revised manuscript.

      (2) There are a few statements in the text that I think may need clarification or might be outside the bounds of what was actually studied here. For example, in the introduction "However, mating is dispensable in Anophelines even under conditions of nutritional satiety". I am uncertain what is meant by this statement - please clarify.

      We apologise for the lack of clarity in the statement and have now deleted it since we felt it was not necessary.

      (3) Typo/Grammatical minutiae:

      (a) A small idiosyncrasy of using hyphens in compound words should also be fixed throughout. Typically, you don't hyphenate if the words are being used as a noun, as in the case: e.g. "Age affects blood feeding.". However, you would hyphenate if the two words are used as a compound adjective "Age affects blood-feeding behavior". This may not be an all-inclusive list, but here are some examples where hyphens need to either be removed or added. Some examples:

      "Nutritional state also influences other internal state outputs on blood-feeding": blood-feeding -> blood feeding

      "... the modulation of blood-feeding": blood-feeding -> blood feeding

      "For example, whether virgin females take blood-meals...": blood-meals -> blood meals

      ".... how internal and external cues shape meal-choice"-> meal choice

      "blood-meal" is often used throughout the text, but is correctly "blood meal" in the figures.

      There are many more examples throughout.

      We apologise for these errors and appreciate the reviewer’s keen eye. We have now fixed them throughout the manuscript.  

      (b) Figure 1 Caption has a typo: "co-housed males were accessed for sugar-feeding" should be "co-housed males were assessed for sugar feeding"

      We apologise for the typo and thank the reviewer for spotting it. We have now corrected this.  

      (c) It would be helpful in some other figure captions to more clearly label which statement is relevant to which part of the text. For example, in Figure 4's caption.

      "C,D. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head (C). Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected blood-fed and unfed females, as compared to that in uninjected females, analysed via qPCR (D)."

      I found re-referencing C and D at the end of their statements makes it look as thought C precedes the "Relative mRNA expression" and on a first read through, I thought the figure captions were backwards. I'd recommend reformatting here and throughout consistently to only have the figure letter precede its relevant caption information, e.g.:

      "C. Blood-feeding and sugar-feeding behaviour of females when both RYa and sNPF are knocked down in the head. D. Relative mRNA expressions of RYa and sNPF in the heads of dsRYa+dssNPF - injected bloodfed and unfed females, as compared to that in uninjected females, analysed via qPCR."

      We have now edited the legends as suggested.

      Reviewer #2 (Recommendations for the authors):

      Separately from the clarifications and limitations listed above, the authors could strengthen their study and the conclusions drawn if they could rescue the behavioural phenotype observed following knockdown of sNPF and RYamide. This could be achieved by injection of either sNPF or RYa peptide independently or combined following knockdown to validate the role of these peptides in promoting blood-feeding in An. stephensi. Additionally, the apparent (but unclear) regionalized (or tissue-specific) knockdown of sNPF and RYamide transcripts could be visualized and verified by implementing HCR in situ hyb in knockdown animals (or immunohistochemistry using antibodies specific for these two neuropeptides). 

      In a follow up of this work, we are generating mutants and peptides for these candidates and are planning to conduct exactly the experiments the reviewer suggests.

      Reviewer #3 (Recommendations for the authors):

      The loss-of-function data suggest necessity but not sufficiency. Synthetic peptide injection in non-hostseeking (blood-fed mated or juvenile) mosquitoes would provide direct evidence for peptide-induced behavioral activation. The lack of these experiments weakens the central claim of the paper that these neuropeptides directly promote blood feeding.

      As noted above, we plan to synthesise the peptide to test rescue in a mutant background and sufficiency.  

      Some of the claims about knockdown efficiency and interpretation are conflicting; the authors dismiss Hairy and Prp as candidates due to 30-35% knockdown, yet base major conclusions on sNPF and RYamide knockdowns with comparable efficiencies (25-40%). This inconsistency should be addressed, or the justification for different thresholds should be clearly stated.

      We have not defined any specific knockdown efficacy thresholds in the manuscript, as these can vary considerably between genes, and in some cases, even modest reductions can be sufficient to produce detectable phenotypes. For example, knockdown efficiencies of even as low as about 25% - 40% gave us observable phenotypes for sNPF and RYa RNAi (Figure S9B-G).

      No such phenotypes were observed for Hairy (30%) or Prp (35%) knockdowns. Either these genes are not involved in blood feeding, or the knockdown was not sufficient for these specific genes to induce phenotypes. We cannot distinguish between these scenarios. 

      The observation that knockdown animals take smaller blood meals is interesting and could reflect a downstream effect of altered host-seeking or an independent physiological change. The relationship between meal size and host-seeking behavior should be clarified.

      We agree with the reviewer that the reduced meal size observed in sNPF and RYa knockdown animals could result from their inability to seek a host or due to an independent effect on blood meal intake. Unfortunately, we did not measure host-seeking in these animals. We plan to distinguish between these possibilities using mutants in future work.

      Several figures are difficult to interpret due to cluttered labeling and poorly distinguishable color schemes. Simplifying these and improving contrast (especially for co-housed vs. virgin conditions) would enhance readability. 

      We regret that the reviewer found the figures difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B”</sup> is now “D1<sup>PBM”</sup> (post-bloodmeal) and “D1<sup>O”</sup> is now “D1<sup>PO”</sup> (post-oviposition). Wherever mated females were used, we have now appended “(m)” to the annotations and consistently depicted these females with striped abdomens in all the schematics. We believe these changes will improve clarity and readability.

      The manuscript does not clearly justify the use of whole-brain RNA sequencing to identify peptides involved in metabolic or peripheral processes. Given that anticipatory feeding signals are often peripheral, the logic for brain transcriptomics should be explained.

      The reviewer is correct in pointing out that feeding signals could also emerge from peripheral tissues. Signals from these tissues – in response to both changing nutritional and reproductive states – are then integrated by the central brain to modulate feeding choices. For example, in Drosophila, increased protein intake is mediated by central brain circuitry including those in the SEZ and central complex (Munch et al., 2022; Liu et al., 2017; Goldschmidt et al., 202ti). In the context of mating, male-derived sex peptide further increases protein feeding by acting on a dedicated central brain circuitry (Walker et al., 2015). We, therefore focused on the central brain for our studies.

      The proposed model suggests brain-derived peptides initiate feeding, while gut peptides provide feedback. However, gut-specific knockdowns had no effect, undermining this hypothesis. Conversely, the authors also suggest abdominal involvement based on RNAi results. These contradictions need to be resolved into a consistent model.

      We thank the reviewer for raising this point and recognise their concern. Our reasons for invoking an involvement of the gut were two-fold:

      (1) We find increased sNPF transcript expression in the entero-endocrine cells of the midgut in blood-hungry females, which returns to baseline after a blood-meal (Fig. 4L, M).

      (2) While the abdomen-only knockdowns did not affect blood feeding, every effective head knockdown that affected blood feeding also abolished abdominal transcript levels (Fig. S9C, F). (Achieving a head-only reduction proved impossible because (i) systemic dsRNA delivery inevitably reaches the abdomen and (ii) abdominal expression of both peptides is low, leaving little dynamic range for selective manipulation.) Consequently, we can only conclude the following: 1) that brain expression is required for the behaviour, 2) that we cannot exclude a contributory role for gut-derived sNPF. We have discussed this in lines 364-371.

      The identification of candidate receptors is promising, but the manuscript would be significantly strengthened by testing whether receptor knockdowns phenocopy peptide knockdowns. Without this, it is difficult to conclude that the identified receptors mediate the behavioral effects.

      We agree that functional validation of the receptors would strengthen the evidence for sNPF and RYa-mediated control of blood feeding in An. stephensi. We selected these receptors based on sequence homology. A possibility remains that sNPF neuropeptides activate more than one receptor, each modulating a distinct circuit, as shown in the case of Drosophila Tachykinin (https://pmc.ncbi.nlm.nih.gov/articles/PMC10184743/). This will mean a systematic characterisation and knockdown of each of them to confirm their role. We are planning these experiments in the future.  

      The authors compared the percentage changes in sugar-fed and blood-fed animals under sugar-sated or sugar-starved conditions. Figure 1F should reflect what was discussed in the results.

      Perhaps this concern stems from our representation of the data in figure 1F? We have now edited the xaxis and revised its label from “choice of food” to “choice made” to better reflect what food the mosquitoes chose to take.

      For clarity, we have now also plotted the same data as stacked graphs at the bottom of Fig. 1F, which clearly shows the proportion of mosquitoes fed on each particular choice. We avoid the stacked graph as the sole representation of this data because it does not capture the variability in the data.

      Minor issues:

      (1) The authors used mosquitoes with belly stripes to indicate mated females. To be consistent, the post-oviposition females should also have belly stripes.

      We thank the reviewer for pointing this out. We have now edited all the figures as suggested.

      (2) In the first paragraph on the right column of the second page, the authors state, "Since females took blood-meals regardless of their prior sugar-feeding status and only sugar-feeding was selectively suppressed by prior sugar access." Just because the well-fed animals ate less than the starved animals does not mean their feeding behavior was suppressed.

      Perhaps there has been a misunderstanding in the experimental setup of figure 1F, probably stemming from our data representation. The experiment is a choice assay in which sugar-starved or sugar-sated females, co-housed with males, were provided simultaneous access to both blood and sugar, and were assessed for the choice made (indicated on the x-axis): both blood and sugar, blood only, sugar only, or neither. We scored females only for the presence or absence of each meal type (blood or sugar) and did not quantify the amount consumed.

      (3) The figure legend for Figure 1A and the naming convention for different experimental groups are difficult to follow. A simplified or consistently abbreviated scheme would help readers navigate the figures and text.

      We regret that the reviewer found the figure difficult to follow. We have now revised our annotations throughout the manuscript for enhanced readability. For example, “D1<sup>B”</sup> is now “D1<sup>PBM”</sup> (post-bloodmeal) and “D1<sup>O”</sup> is now “D1<sup>PO”</sup> (post-oviposition).

      (4) In the last paragraph of the Y-maze olfactory assay for host-seeking behaviour in An. stephensi in Methods, the authors state, "When testing blood-fed females, aged-matched sugar-fed females (bloodhungry) were included as positive controls where ever possible, with satisfactory results." The authors should explicitly describe what the criteria are for "satisfactory results".

      We apologise for the lack of clarity. We have now edited the statement to read:

      “When testing blood-fed females, age-matched sugar-fed females (blood-hungry) were included wherever possible as positive controls. These females consistently showed attraction to host cues, as expected.” See lines 786-790.

      (5) In the first paragraph of the dsRNA-mediated gene knockdown section in Methods, dsRNA against GFP is used as a negative control for the injection itself, but not for the potential off-target effect.

      We agree with the reviewer that dsGFP injections act as controls only for injection-related behavioural changes, and not for off-target effects of RNAi. We have now corrected the statement. See lines 919-920.

      To control for off-target effects, we could have designed multiple dsRNAs targeting different parts of a given gene. We regret not including these controls for potential off-target effects of dsRNAs injected. 

      (6) References numbers 48, 89, and 90 are not complete citations.

      We thank the reviewer for spotting these. We have now corrected these citations.

    1. Author response:

      The following is the authors’ response to the original reviews.

      First, we thank the reviewers for the valuable and constructive reviews. Thanks to these, we believe the article has been considerably improved. We have organized our response to address points that are relevant to both reviewers first, after which we address the unique concerns of each individual reviewer separately. We briefly paraphrase each concern and provide comments for clarification, outlining the precise changes that we have made to the text.

      Common Concerns (R1 & R2):

      Can you clarify how NREM and REM sleep relate to the oneirogen hypothesis?

      Within the submission draft we tried to stay agnostic as to whether mechanistically similar replay events occur during NREM or REM sleep; however, upon a more thorough literature review, we think that there is moderately greater evidence in favor of Wake-Sleep-type replay occurring during REM sleep which is related to classical psychedelic drug mechanisms of action.

      First, we should clarify that replay has been observed during both REM and NREM sleep, and dreams have been documented during both sleep stages, though the characteristics of dreams differ across stages, with NREM dreams being more closely tied to recent episodic experience and REM dreams being more bizarre/hallucinatory (see Stickgold et al., 2001 for a review). Replay during sleep has been studied most thoroughly during NREM sharp-wave ripple events, in which significant cortical-hippocampal coupling has been observed (Ji & Wilson, 2007). However, it is critical to note that the quantification methods used to identify replay events in the hippocampal literature usually focus on identifying what we term ‘episodic replay,’ which involves a near-identical recapitulation of neural trajectories that were recently experienced during waking experimental recordings (Tingley & Peyrach, 2020). In contrast, our model focuses on ‘generative replay,’ where one expects only a statistically similar reproduction of neural activity, without any particular bias towards recent or experimentally controlled experience. This latter form of replay may look closer to the ‘reactivation’ observed in cortex by many studies (e.g. Nguyen et al., 2024), where correlation structures of neural activity similar to those observed during stimulus-driven experience are recapitulated. Under experimental conditions in which an animal is experiencing highly stereotyped activity repeatedly, over extended periods of time, these two forms of replay may be difficult to dissociate.

      Interestingly, though NREM replay has been shown to couple hippocampal and cortical activity, a similar study in waking animals administered psychedelics found hippocampal replay without any obvious coupling to cortical activity (Domenico et al., 2021). This could be because the coupling was not strong enough to produce full trajectories in the cortex (psychedelic administration did not increase ‘alpha’ enough), and that a causal manipulation of apical/basal influence in the cortex may be necessary to observe the increased coupling. Alternatively, as Reviewer 1 noted, it may be that psychedelics induce a form of hippocampus-decoupled replay, as one would expect from the REM stage of a recently proposed complementary learning systems model (Singh et al., 2022). 

      Evidence in favor of a similarity between the mechanism of action of classical psychedelics and the mechanism of action of memory consolidation/learning during REM sleep is actually quite strong. In particular, studies have shown that REM sleep increases the activity of soma-targeting parvalbumin (PV) interneurons and decreases the activity of apical dendrite-targeting somatostatin (SOM) interneurons (Niethard et al., 2021), that this shift in balance is controlled by higher-order thalamic nuclei, and that this shift in balance is critical for synaptic consolidation of both monocular deprivation effects in early visual cortex (Zhou et al., 2020) and for the consolidation of auditory fear conditioning in the dorsal prefrontal cortex (Aime et al., 2022). These last studies were not discussed in our previous text–we have added them, in addition to a more nuanced description of the evidence connecting our model to NREM and REM replay. 

      Relevant modifications: Page 4, 1st paragraph; Page 11, 1st paragraph.

      Can you explain how synaptic plasticity induced by psychedelics within your model relates to learning at a behavioral level?

      While the Wake-Sleep algorithm is a useful model for unsupervised statistical learning, it is not a model of reward or fear-based conditioning, which likely occur via different mechanisms in the brain (e.g. dopamine-dependent reinforcement learning or serotonin-dependent emotional learning). The Wake-Sleep algorithm is a ‘normative plasticity algorithm,’ that connects synaptic plasticity to the formation of structured neural representations, but it is not the case that all synaptic plasticity induced by psychedelic administration within our model should induce beneficial learning effects. According to the Wake-Sleep algorithm, plasticity at apical synapses is enhanced during the Wake phase, and plasticity at basal synapses is enhanced during the Sleep phase; under the oneirogen hypothesis, hallucinatory conditions (increased ‘alpha’) cause an increase in plasticity at both apical and basal sites. Because neural activity is in a fundamentally aberrant state when ‘alpha’ is increased, there are no theoretical guarantees that plasticity will improve performance on any objective: psychedelic-induced plasticity within our model could perhaps better be thought of as ‘noise’ that may have a positive or negative effect depending on the context.

      In particular, such ‘noise’ may be beneficial for individuals or networks whose synapses have become locked in a suboptimal local minimum. The addition of large amounts of random plasticity could allow a system to extricate itself from such local minima over subsequent learning (or with careful selection of stimuli during psychedelic experience), similar to simulated annealing optimization approaches. If our model were fully validated, this view of psychedelic-induced plasticity as ‘noise’ could have relevance for efforts to alleviate the adverse effects of PTSD, early life trauma, or sensory deprivation; it may also provide a cautionary note against repeated use of psychedelic drugs within a short time frame, as the plasticity changes induced by psychedelic administration under our model are not guaranteed to be good or useful in-and-of themselves without subsequent re-learning and compensation.

      We should also note that we have deliberately avoided connecting the oneirogen hypothesis model to fear extinction experimental results that have been observed through recordings of the hippocampus or the amygdala (Bombardi & Giovanni, 2013; Jiang et al., 2009; Kelly et al., 2024; Tiwari et al., 2024). Both regions receive extensive innervation directly from serotonergic synapses originating in the dorsal raphe nucleus, which have been shown to play an important role in emotional learning (Lesch & Waider, 2012); because classical psychedelics may play a more direct role in modulating this serotonergic innervation, it is possible that fear conditioning results (in addition to the anxiolytic effects of psychedelics) cannot be attributed to a shift in balance between apical and basal synapses induced by psychedelic administration. We have provided a more detailed review of these results in the text, as well as more clarity regarding their relation to our model.

      Relevant modifications: Page 9, final paragraph; Page 12, final paragraph.

      Reviewer 1 Concerns:

      Is it reasonable to assign a scalar parameter ‘alpha’ to the effects of classical psychedelics? And is your proposed mechanism of action unique to classical psychedelics? E.g. Could this idea also apply to kappa opioid agonists, ketamine, or the neural mechanisms of hallucination disorders?

      We have clarified that within our model ‘alpha’ is a parameter that reflects the balance between apical and basal synapses in determining the activity of neurons in the network. For the sake of simplicity we used a single ‘alpha’ parameter, but realistically, each neuron would have its own ‘alpha’ parameter, and different layers or individual neurons could be affected differentially by the administration of any particular drug; therefore, our scalar ‘alpha’ value can be thought of as a mean parameter for all neurons, disregarding heterogeneity across individual neurons.

      There are many different mechanisms that could theoretically affect this ‘alpha’ parameter, including: 5-HT2a receptor agonism, kappa opioid receptor binding, ketamine administration, or possibly the effects of genetic mutations underlying the pathophysiology of complex developmental hallucination disorders. We focused exclusively on 5-HT2a receptor agonism for this study because the mechanism is comparatively simple and extensively characterized, but similar mechanisms may well be responsible for the hallucinatory symptoms of a variety of drugs and disorders.

      Relevant modifications: Page 4, first paragraph; Page 13, first paragraph.

      Can you clarify the role of 5-HT2a receptor expression on interneurons within your model?

      While we mostly focused on the effects of 5-HT2a receptors on the apical dendrites of pyramidal neurons, these receptors are also expressed on soma-targeting parvalbumin (PV) interneurons. This expression on PV interneurons is consistent with our proposed psychedelic mechanism of action, because it could lead to a coordinated decrease in the influence of somatic and proximal dendritic inputs while increasing the influence of apical dendritic inputs. We have elaborated on this point, and moved the discussion earlier in the text.

      Relevant modifications: Page 1, 1st paragraph; Page 4, 2nd paragraph.

      Discussions of indigenous use of psychedelics over millenia may amount to over-romanticization.

      We ultimately decided to remove these discussions from the main text, as they had little bearing on the content of our work. Within the Ethics Declarations section we softened our claims from “millenia” to “centuries,” as indigenous psychedelic use over this latter period of time is well-substantiated.

      Relevant modifications: removed from introduction; modified Ethics Declarations

      You isolate the 5-HT2a agonism as the mechanism of action underlying ‘alpha’ in your model, but there exist 5-HT2a agonists that do not have hallucinatory effects (e.g. lisuride). How do you explain this?

      Lisuride has much-reduced hallucinatory effects compared to other psychedelic drugs at clinical doses (though it does indeed induce hallucinations at high doses; Marona-Lewicka et al., 2002), and we should note that serotonin (5-HT) itself is pervasive in the cortex without inducing hallucinatory effects during natural function. Similarly, MDMA is a partial agonist for 5-HT2a receptors, but it has much-reduced perceptual hallucination effects relative to classical psychedelics (Green et al., 2003) in addition to many other effects not induced by classical psychedelics.

      Therefore, while we argue that 5-HT2a agonism induces an increase in influence of apical dendritic compartments and a decrease in influence of basal/somatic compartments, and that this change induces hallucinations, we also note that there are many other factors that control whether or not hallucinations are ultimately produced, so that not all 5-HT2a agonists are hallucinogenic. There are two possible additional factors that could contribute to this phenomenon: 5-HT receptor binding affinity and cellular membrane permeability.

      Importantly, many 5-HT2a receptor agonists are also 5-HT1a receptor agonists (e.g. serotonin itself and lisuride), while MDMA has also been shown to increase serotonin, norepinephrine, and dopamine release (Green et al., 2003). While 5-HT2a receptor agonism has been shown to reduce sensory stimulus responses (Michaiel et al., 2019), 5-HT1a receptor agonism inhibits spontaneous cortical activity (Azimi et al., 2020); thus one might expect the net effect of administering serotonin or a nonselective 5-HT receptor agonist to be widespread inhibition of a circuit, as has been observed in visual cortex (Azimi et al., 2020). Therefore, selective 5-HT2a agonism is critical for the induction of hallucinations according to our model, though any intervention that jointly excites pyramidal neurons’ apical dendrites and inhibits their basal/somatic compartments across a broad enough area of cortex would be predicted to have a similar effect. Lisuride has a much higher binding affinity for 5-HT1a receptors than, for instance, LSD (Marona-Lewicka et al., 2002).

      Secondly, it has recently been shown that both the head-twitch effect (a coarse behavioral readout of hallucinations in animals) and the plasticity effects of psychedelics are abolished when administering 5-HT2a agonists that are impermeable to the cellular membrane because of high polarity, and that these effects can be rescued by temporarily rendering the cellular membrane permeable (Vargas et al., 2023). This suggests that the critical hallucinatory effects of psychedelics (apical excitation according to our model) may be mediated by intracellular 5-HT2a receptors. Notably, serotonin itself is not membrane permeable in the cortex.

      Therefore, either of these two properties could play a role in whether a given 5-HT2a agonist induces hallucinatory effects. We have provided an extended discussion of these nuances in our revision.

      Relevant modifications: Page 1, paragraph 2.

      Your model proposes that an increase in top-down influence on neural activity underlies the hallucinatory effects of psychedelics. How do you explain experimental results that show increases in bottom-up functional connectivity (either from early sensory areas or the thalamus)?

      Firstly, we should note that our proposed increase in top-down influence is a causal, biophysical property, not necessarily a statistical/correlative one. As such, we will stress that the best way to test our model is via direct intervention in cortical microcircuitry, as opposed to correlative approaches taken by most fMRI studies, which have shown mixed results with regard to this particular question. Correlative approaches can be misleading due to dense recurrent coupling in the system, and due to the coarse temporal and spatial resolution provided by noninvasive recording technologies (changes in statistical/functional connectivity do not necessarily correspond to changes in causal/mechanistic connectivity, i.e. correlation does not imply causation).

      There are two experimental results that appear to contradict our hypothesis that deserve special consideration. The first shows an increase in directional thalamic influence on the distributed cortical networks after psychedelic administration (Preller et al., 2018). To explain this, we note that this study does not distinguish between lower-order sensory thalamic nuclei (e.g. the lateral and medial geniculate nuclei receiving visual and auditory stimuli respectively) and the higher-order thalamic nuclei that participate in thalamocortical connectivity loops (Whyte et al., 2024). Subsequent more fine-grained studies have noted an increase in influence of higher order thalamic nuclei on the cortex (Pizzi et al., 2023; Gaddis et al., 2022), and in fact extensive causal intervention research has shown that classical psychedelics (and 5-HT2a agonism) decrease the influence of incoming sensory stimuli on the activity of early sensory cortical areas, indicating decoupling from the sensory thalamus (Evarts et al., 1955; Azimi et al., 2020; Michaiel et al. 2019). The increased influence of higher-order thalamic nuclei is consistent with both the cortico-striatal-thalamo-cortical (CTSC) model of psychedelic action as well as the oneirogen hypothesis, since higher-order thalamic inputs modulate the apical dendrites of pyramidal neurons in cortex (Whyte et al., 2024).

      The second experimental result notes that DMT induces traveling waves during resting state activity that propagate from early visual cortex to deeper cortical layers (Alamia et al., 2020). There are several possibilities that could explain this phenomenon: 1) it could be due to the aforementioned difficulties associated with directed functional connectivity analyses, 2) it could be due to a possible high binding affinity for DMT in the visual cortex relative to other brain areas, or 3) it could be due to increases in apical influence on activity caused by local recurrent connectivity within the visual cortex which, in the absence of sensory input, could lead to propagation of neural activity from the visual cortex to the rest of the brain. This last possibility is closest to the model proposed by (Ermentrout & Cowan, 1979), and which we believe would be best explained within our framework by a topographically connected recurrent network architecture trained on video data; a potentially fruitful direction for future research.

      Relevant modifications: Page 9, paragraph 1; Page 10, final paragraph; Page 11, final paragraph.

      Shouldn’t the hallucinations generated by your model look more ‘psychedelic,’ like those produced by the DeepDream algorithm?

      We believe that the differences in hallucination visualization quality between our Wake-Sleep-trained models and DeepDream are mostly due to differences in the scale and power of the models used across these two studies. We are confident that with more resources (and potentially theoretical innovations to improve the Wake-Sleep algorithm’s performance) the produced hallucination visualizations could become more realistic.

      We note that more powerful generative models trained with backpropagation are able to produce surreal images of comparable quality (Rezende et al., 2014; Goodfellow et al., 2020; Vahdat & Kautz, 2020), though these have not yet been used as a model of psychedelic hallucinations. However, the DeepDream model operates on top of large pretrained image processing models, and does not provide an biologically mechanistic/testable interpretation of its hallucination effects. When training smaller models with a local synaptic plasticity rule (as opposed to backpropagation), the hallucination effects are less visually striking due to the reduced quality of our trained generative model, though they are still strongly tied to the statistics of sensory inputs, as quantified by our correlation similarity metric (Fig. 5b).

      To demonstrate that our proposed hallucination mechanism is capable of producing more complex hallucinations in larger, more powerful models, we employed our same hallucination generation mechanism in a pretrained Very Deep Variational Autoencoder (VDVAE) (Child et al., 2021), which is a hierarchical variational autoencoder with a nearly identical structure compared to our Wake-Sleep-trained networks, with both a bottom-up inference pathway and a top-down generative pathway that maps cleanly onto our multicompartmental neuron model. VDVAEs are trained on the same objective function as our Wake-Sleep-trained networks, but using the backpropagation algorithm. The VDVAE models were able to generate much more complex hallucinations (emergence of complex geometric patterns, smooth deformations of objects and faces), whose complexity arguably exceeds those produced by the DeepDream algorithm. Therefore while the VDVAEs are less biologically realistic (they do not learn via local synaptic plasticity), they function as a valuable high-level model of hallucination generation that complements our Wake-Sleep-trained approach. As further validation, we were also able to replicate our key results and testable predictions with these models.

      Relevant modifications: Results section “Modeling hallucinations in large-scale pretrained networks”; Figure 6, S7, S8; Page 12, paragraph 3; Methods section “Generating hallucinations in hierarchical variational autoencoders.”

      Your model assumes domination by entirely bottom-up activity during the ‘wake’ phase, and domination entirely by top-down activity during ‘sleep,’ despite experimental evidence indicating that a mixture of top-down and bottom-up inputs influence neural activity during both stages in the brain. How do you explain this?

      Our use of the Wake-Sleep algorithm, in which top-down inputs (Sleep) or bottom-up inputs (Wake) dominate network activity is an over-simplification made within our model for computational and theoretical reasons. Models that receive a mixture of top-down and bottom-up inputs during ‘Wake’ activity do exist (in particular the closely related Boltzmann machine (Ackley et al., 1985)), but these models are considerably more computationally costly to train due to a need to run extensive recurrent network relaxation dynamics for each input stimulus. Further, these models do not generalize as cleanly to processing temporal inputs. For this reason, we focused on the Wake-Sleep algorithm, at the cost of some biological realism, though we note that our model should certainly be extended to support mixed apical-basal waking regimes. We have added a discussion of this in our ‘Model Limitations’ section.

      Relevant modifications: Page 12, paragraph 4.

      Your model proposes that 5-HT2a agonism enhances glutamatergic transmission, but this is not true in the hippocampus, which shows decreases in glutamate after psychedelic administration.

      We should note that our model suggests only compartment specific increases in glutamatergic transmission; as such, our model does not predict any particular directionality for measures of glutamatergic transmission that includes signaling at both apical and basal compartments in aggregate, as was measured in the provided study (Mason et al., 2020).

      You claim that your model is consistent with the Entropic Brain theory, but you report increases in variance, not entropy. In fact, it has been shown that variance decreases while entropy increases under psychedelic administration. How do you explain this discrepancy?

      Unfortunately, ‘entropy’ and ‘variance’ are heavily overloaded terms in the noninvasive imaging literature, and the particularities of the method employed can exert a strong influence on the reported effects. The reduction in variance reported by (Carhart-Harris et al., 2016) is a very particular measure: they are reporting the variance of resting state synchronous activity, averaged across a functional subnetwork that spans many voxels; as such, the reduction in variance in this case is a reduction in broad, synchronous activity. We do not have any resting state synchronous activity in our network due to the simplified nature of our model (particularly an absence of recurrent temporal dynamics), so we see no reduction in variance in our model due to these effects.

      Other studies estimate ‘entropy’ or network state disorder via three different methods that we have been able to identify. 1) (Carhart-Harris et al., 2014) uses a different measure of variance: in this case, they subtract out synchronous activity within functional subnetworks, and calculate variability across units in the network. This measure reports increases in variance (Fig. 6), and is the closest measure to the one we employ in this study. 2) (Lebedev et al., 2016) uses sample entropy, which is a measure of temporal sequence predictability. It is specifically designed to disregard highly predictable signals, and so one might imagine that it is a measure that is robust to shared synchronous activity (e.g. resting state oscillations). 3) (Mediano et al., 2024) uses Lempel-Ziv complexity, which is, similar to sample entropy, a measure of sequence diversity; in this case the signal is binarized before calculation, which makes this method considerably different from ours. All three of the preceding methods report increases in sequence diversity, in agreement with our quantification method. Our strongest explanation for why the variance calculation in (Carhart-Harris et al., 2016) produces a variance reduction is therefore due to a reduction in low-rank synchronous activity in subnetworks during resting state.

      As for whether the entropy increase is meaningful: we share Reviewer 1’s concern that increases in entropy could simply be due to a higher degree of cognitive engagement during resting state recordings, due to the presence of sensory hallucinations or due to an inability to fall asleep. This could explain why entropy increases are much more minimal relative to non-hallucinating conditions during audiovisual task performance (Siegel et al., 2024; Mediano et al., 2024). However, we can say that our model is consistent with the Entropic Brain Theory without including any form of ‘cognitive processing’: we observe increases in variability during resting state in our model, but we observe highly similar distributions of activity when averaging over a wide variety of sensory stimulus presentations (Fig. 5b-c). This is because variability in our model is not due to unstructured noise: it corresponds to an exploration of network states that would ordinarily be visited by some stimulus. Therefore, when averaging across a wide variety of stimuli, the distribution of network states under hallucinating or non-hallucinating conditions should be highly similar.

      One final point of clarification: here we are distinguishing Entropic Brain Theory from the REBUS model–the oneirogen hypothesis is consistent with the increase in entropy observed experimentally, but in our model this entropy increase is not due to increased influence of bottom-up inputs (it is due instead to an increase in top-down influence). Therefore, one could view the oneirogen hypothesis as consistent with EBT, but inconsistent with REBUS.

      Relevant modifications: Page 10, paragraph 1.

      You relate your plasticity rule to behavioral-timescale plasticity (BTSP) in the hippocampus, but plasticity has been shown to be reduced in the hippocampus after psychedelic administration. Could you elaborate on this connection?

      When we were establishing a connection between our ‘Wake-Sleep’ plasticity rule and BTSP learning, the intended connection was exclusively to the mathematical form of the plasticity rule, in which activity in the apical dendrites of pyramidal neurons functions as an instructive signal for plasticity in basal synapses (and vice versa): we will clarify this in the text. Similarly, we point out that such a plasticity rule tends to result in correlated tuning between apical and basal dendritic compartments, which has been observed in hippocampus and cortex: this is intended as a sanity check of our mapping of the Wake-Sleep algorithm to cortical microcircuitry, and has limited further bearing on the effects of psychedelics specifically.

      Reduction in plasticity in the hippocampus after psychedelic administration could be due to a complementary learning systems-type model, in which the hippocampus becomes partly decoupled from the cortex during REM sleep (Singh et al., 2022); were this to be the case, it would not be incompatible with our model, which is mostly focused on the cortex. Notably, potentiating 5HT-2a receptors in the ventral hippocampus does not induce the head-twitch response, though it does produce anxiolytic effects (Tiwari et al., 2024), indicating that the hallucinatory and anxiolytic effects of classical psychedelics may be partly decoupled. 

      Reviewer 2 Concerns:

      Could you provide visualizations of the ‘ripple’ phenomenon that you’re referring to?

      In our revised submission, ‘ripple’ phenomena are now visible in two places: Fig 2c-d, and Fig 6 (rows 2 and 3). Because the VDVAE models used to generate Figure 6 produce higher quality generated images, the ripples appearing in these plots are likely more prototypical, but it is not easy to evaluate the quality of these visualizations relative to subjective hallucination phenomena.

      Could you provide a more nuanced description of alternative roles for top-down feedback, beyond being used exclusively for learning as depicted in your model?

      For the sake of simplicity, we only treat top-down inputs in our model as a source of an instructive teaching signal, the originator of generative replay events during the Sleep phase, and as the mechanism of hallucination generation. However, as discussed in a response to a previous question, in the cortex pyramidal neurons receive and respond to a mixture of top-down and bottom-up processing.

      There are a variety of theories for what role top-down inputs could play in determining network activity. To name several, top-down input could function as: 1) a denoising/pattern completion signal (Kadkhodaie & Simoncelli, 2021), 2) a feedback control signal (Podlaski & Machens, 2020), 3) an attention signal (Lindsay, 2020), 4) ordinary inputs for dynamic recurrent processing that play no specialized role distinct from bottom-up or lateral inputs except to provide inputs from higher-order association areas or other sensory modalities (Kar et al., 2019; Tugsbayar et al., 2025). Though our model does not include these features, they are perfectly consistent with our approach.

      In particular, denoising/pattern completion signals in the predictive coding framework (closely related to the Wake-Sleep algorithm) also play a role as an instructive learning signal (Salvatori et al., 2021); and top-down control signals can play a similar role in some models (Gilra & Gerstner, 2017; Meulemans et al., 2021). Thus, options 1 and 2 are heavily overlapping with our approach, and are a natural consequence of many biologically plausible learning algorithms that minimize a variational free energy loss (Rao & Ballard, 1997; Ackley et al., 1985). Similarly, top-down attentional signals can exist alongside top-down learning signals, and some models have argued that such signals can be heavily overlapping or mutually interchangeable (Roelfsema & van Ooyen, 2005). Lastly, generic recurrent connectivity (from any source) can be incorporated into the Wake-Sleep algorithm (Dayan & Hinton, 1996), though we avoided doing this in the present study due to an absence of empirical architecture exploration in the literature and the computational complexity associated with training on time series data.

      To conclude, there are a variety of alternative functions proposed for top-down inputs onto pyramidal neurons in the cortex, and we view these additional features as mutually compatible with our approach; for simplicity we did not include them in our Wake-Sleep-trained model, but we believe that these features are unlikely to interfere with our testable predictions or empirical results. In fact, the pretrained VDVAE models that we worked with do include top-down influence during the Wake-stage inference process, and these models recapitulated our key results and testable predictions (Fig. S8).

      Relevant modifications: Fig. S8; Page 12, paragraph 4.

    1. Author response:

      The following is the authors’ response to the original reviews.

      We thank the editor and reviewers for their constructive questions, valuable feedback, and for approving our manuscript. We truly appreciate the opportunity to improve our work based on their insightful comments. Before addressing the editor’s and each referee’s remarks individually, we provide below a point-by-point response summarizing the revisions made.

      Duplication of control groups across experiments

      We appreciate the reviewers’ concern regarding the potential duplication of control groups. In the revised manuscript, we have explicitly clarified that independent groups of control mice were used for each experiment. These details are now clearly indicated in the Materials and Methods section to avoid any ambiguity and to reinforce the rigor of our experimental design (Page 15, Line 453-455): “Furthermore, knockout animals and those treated with pharmacological inhibitors or neutralizing antibodies shared the same control groups (chow and HFCD), as required by the animal ethics committee.”

      Validation of the MASLD model

      To strengthen the metabolic characterization of our MASLD model, we have now included additional parameters, including liver weight, Picrosirius staining and blood glucose measurements. These data are presented as new graphs in the revised manuscript and support the metabolic relevance of the HFCD diet model (Figure Suplementary S1). The corresponding description has been added to the Results section (Page 5, Lines 116-117) as follows: “Mice fed HFCD showed no increase in liver weight and collagen deposition as evidenced by Picrosirius staining (Fig. S1A and Fig. S1C)”

      Assessment of liver injury in RagKO and anti-NK1.1 mice

      We fully agree that assessment of liver injury is essential for these models. For mice treated with antiNK1.1, ALT levels are shown in Figure 4G, confirming increased liver injury after treatment. Regarding Rag⁻/⁻ mice, the animals exhibit exacerbation of liver injury when fed a HFCD diet and challenged with LPS (Page 7, Lines 183–184). The corresponding description has been added to the Results section (Page 7, Lines 175-176) as follows: “Interestingly, Rag1-deficient animals under the HFCD remained susceptible to the LPS challenge (Fig. 4C) with exacerbation of liver injury (Fig. 4D) ”

      Discussion of limitations

      We have expanded the Discussion section to provide a more comprehensive and balanced perspective on the limitations of our model and experimental approach (Page 13-14, Lines 401–414) “Our study presents several limitations that should be acknowledged and discussed. First, we cannot entirely rule out the possibility that our mice deficient in pro-inflammatory components exhibit reduced responsiveness to LPS. However, our ex vivo analyses using splenocytes from these animals revealed a preserved cytokine production following LPS stimulation. These results suggest that the in vivo differences observed are primarily driven by the MAFLD condition rather than by intrinsic defects in LPS sensitivity. Second, the absence of publicly available single-cell RNA-seq datasets from MAFLD subjects under endotoxemic or septic conditions limited our ability to perform direct translational comparisons. To overcome this, we analyzed existing MAFLD patients and experimental MAFLD datasets, which consistently demonstrated upregulation of IFN-y and TNF-α inflammatory pathways in MALFD. In line with these findings, our murine model revealed TNF-α⁺ myeloid and IFN-y⁺ NK cell populations, thereby reinforcing the validity and translational relevance of our results.”. This revision highlights the constraints of the MASLD model, the inherent variability among in vivo experiments, and the interpretative limitations related to immunodeficient mouse strains.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) In Figure 4 the authors are showing the number of IFN+ positive CD4, CD8, and NK 1.1+ cells. Could they show from total IFNg production, how much it goes specifically on NK cells and how much on other cell populations since NK1.1 is NK but also NKT and gamma delta T cell marker? Also, in Figure 2E the authors see a substantial increase in IFNg signal in T cells.

      While we did not specifically assess IFNγ production in NKT cells or other minor populations, our data indicate that the NK1.1+CD3+ cells (NKT cells) cited in Page 7, Lines  188-192 were essentially absent in the liver tissue of LPS-challenged animals, as shown in Supplementary Figures 3C and S10. The corresponding description has been added to the Results section (Page 7, Lines 188-192) as follows: “We observed that the number of NK cells increased in the liver tissue of PBS-treated MAFLD mice compared with mice fed a control diet (Fig. 4E). LPS challenge increased the accumulation of NK1.1+CD3− NK cells in the liver tissue of MAFLD mice and the absence of NK1.1+CD3+ NKT cells (Fig. S3C and 4E)”.

      This absence was consistent across all experimental conditions, corroborating our focus on NK1.1+CD3− cells as the primary source of NK1.1-associated IFNγ production. Furthermore, data demonstrated in Figure 2E illustrate the presence of IFNγ primarily in NK cells. Therefore, the observed IFNγ signal, attributed to NK1.1+ cells, predominantly reflects conventional NK cells, with minimal contribution from NKT or γδ T cells.

      (2) In Figure 4C, the authors state that the results suggest that T and B cells do not contribute to susceptibility to LPS challenge. However, they observe a drop in survival compared to chow+LPS. Are the authors certain there is no statistical significance there?

      The observed decrease in survival is consistent with our expectations, as T and B cells are not the primary source of interferon-gamma (IFNγ) in this context. Even in their absence, animals remain susceptible to LPS challenge due to the presence of other IFNγ-producing cells that drive the observed lethality. We have carefully re-examined the statistical analysis and confirm that it was correctly performed.  

      (3) Since the survival curve and rate are exactly the same (60%) in Figures 3F, 3G, 4C, 4F, 5G, and 5H I would just like to double-check that the authors used different controls for each experiment.

      The number of mice used in each experiment was carefully determined to ensure sufficient statistical power while fully complying with the limits established by our institutional Animal Ethics Committee. To minimize animal use, the same control group was shared across multiple survival experiments. Despite using shared controls, the total number of animals per experimental group was adequate to produce robust and reproducible survival outcomes. All groups were properly randomized, and the shared control data were rigorously incorporated into statistical analyses. This strategy allowed us to maintain both ethical standards and the scientific rigor of our findings.

      (4) In Figure 5 the authors are saying that it is neutrophils but not monocytes mediate susceptibility of animals with NAFLD to endotoxemia. However, CXCR2i depletion and CCR2 knock out mice affect both monocytes/macrophages and neutrophils. And in Figures 5E, 5G, and 5H they see that a) LPS+CXCR2i decreases liver damage more than LPS+anti Ly6G, b) HFCD mice challenged with LPS and treated with anti-LY6G do not rescue survival to levels of CHOW LPS and c) anti Ly6G treatment helps less than CXCR2i. Therefore, from both knock out mice and depletion experiments the authors can conclude that most likely monocytes (but potentially also other cells) together with neutrophils are substantial for the development of endotoxemic shock in choline-deficient high-fat diet model.

      While neutrophils express CCR2, our data clearly show that CCR2 deficiency does not impair neutrophil migration, as demonstrated in Supplemental Figures 5A and 5B (added to the manuscript, page 8, lines 213–217). The corresponding description has been added to the Results section (Page 8, Lines 213217) as follows: ``Interestingly, animals deficient in monocyte migration (CCR2-/-) showed a high mortality rate compared to wild type after LPS challenge and neutrophil migration is not altered (Fig. 5SA and Fig. 5SB)``, In contrast, CCR2 deficiency primarily affects monocyte recruitment, yet in our experimental conditions, monocyte depletion or CCR2 knockout did not significantly alter the severity of endotoxemic shock, indicating that monocytes play a minimal role in mediating susceptibility in HFCD-fed mice.

      To specifically investigate neutrophils, we used pharmacological blockade of CXCR2 to inhibit migration and antibody-mediated neutrophil depletion. Both approaches have consistently demonstrated that neutrophils are critical factors in endotoxemic shock.

      These findings support our conclusion that neutrophils are the primary cellular contributors to susceptibility in HFCD-fed mice during endotoxemia, with monocytes making a negligible contribution under the tested conditions.

      (6) In Figure 6A (but also others with PD-L1) did the authors do isotype control? And can they show how much of PD1+ population goes on neutrophils, and how much on all the other populations?

      To address this issue, we performed additional analyses to assess the distribution of PD-L1 expression on CD45+CD11B+ leukocytes. These new results, detailed on Page 9, lines 245-250, and now presented in Supplemental Figure 6, demonstrate that PD-L1 expression is predominantly enriched in neutrophils compared to other immune subsets. This observation further reinforces our conclusion that neutrophils represent a major source of PD-L1 in our experimental model.

      To ensure the robustness of these findings, we also included FMO controls for PD-L1 staining in the newly added Supplemental Figure S6. These controls validate the specificity of our gating strategy and confirm the reliability of the detected PD-L1 signal. The corresponding description has been added to the Results section (Page 9, Lines 245-250) as follows: ``First, we observed that only the MAFLD diet caused a significant increase in PD-L1 expression in CD45+CD11b+ leukocytes after LPS challenge (Fig. S6C). We observed that within this population, neutrophils predominate in their expression when compared to monocytes (Fig. 6SA, Fig. 6SB, and Fig. 6SD). Furthermore, PD-L+1 neutrophils showed an exacerbated migration of PD-L1+ neutrophils towards the liver (Fig. 6A and 6B)”

      (7) In Figure 6D it is interesting that there is not an increase in PD-L1+ neutrophils in LPS HFCD IFNg+/+ mice in comparison to LPS chow IFNg+/+ mice, since those should be like WT mice (Figure 6A going from 50% to 97%) and so an increase should be seen?

      The apparent difference between Figures 6A and 6D likely reflects inter-experimental variability rather than a biological discrepancy. Although the absolute percentages of PD-L1⁺ neutrophils varied slightly among independent experiments, the overall phenotype and trend were consistently maintained namely, that PD-L1 expression on neutrophils is enhanced in response to LPS stimulation and modulated by IFNγ signaling. Thus, the data shown in Figure 6D are representative of this consistent phenotype despite minor quantitative variation.

      (8) In Figure 7 do the authors have isotype control for TNFa because gating seems a bit random so an isotype control graph would help a lot as supplementary information, in order to make the figure more persuasive

      To address the concern regarding gating in Figure 7, we have included the FMO showing TNFα as a histogram Supplementary Figure 8gG. These control reaffirm the accuracy and reliability of our gating strategy for TNFα, further supporting the robustness of our data. The corresponding description has been added to the Results section (Page 9, Lines 272-274) as follows:`` We observed an exacerbated TNF-α expression by PD-L1+ neutrophils from MAFLD when compared to control chow animals (Fig. 7A, Fig. 7B, Fig. 7D, and Fig8SG).

      (9) Figure 6C IFNg+/+ mice on CHOW +LPS is same as Figure 8E mice chow +LPS but just with different numbers. Can the authors explain this?

      Although the data points in Figures 6C and 8E may appear similar, we confirm that they originate from entirely independent experiments and represent distinct datasets. To enhance clarity and avoid any potential confusion, we have adjusted the figure presentation and sizing in the revised manuscript. These changes make it clear that the datasets, while comparable, are derived from separate experimental replicates.

      (10) Figure 1E chow B6+LPS is the same as Figure 5D B6+LPS but should they be different since those should be two different experiments?

      We confirm that Figures 1E and 5D correspond to data obtained from independent experiments. Although the experimental conditions were similar, each dataset was generated and analyzed separately to ensure the reproducibility and robustness of our results.

      Reviewer #2 (Recommendations for the authors):

      (1) Why did you look at kidney injury in Figure 1D? I think this should be explained a little.

      We assessed kidney injury alongside ALT, a marker of liver damage, because both the liver and kidneys are among the primary organs affected during sepsis and endotoxemia. This rationale has been added to the manuscript (page 5, lines 129–131): “Remarkably, compared to the Chow group, HFCD mice exposed to LPS did not show greater changes in other organs commonly affected by endotoxemia, such as the kidneys (Figure 1D).” By evaluating markers of injury in both organs, we aimed to determine whether our physiopathological condition was liver-specific or indicative of broader systemic injury.

      (2) I know Figure 2C isn't your data, but why are there so few NK cells, considering NK cells are a resident liver cell type? Doesn't that also bring into question some of your data if there are so few NK cells? And the IFNG expression (2E) looks to mostly come from T-cells (CD8?).

      The data shown in Figure 2C were reanalyzed from a separate NAFLD model based on a 60% high-fat diet. Although this model differs from ours, the observed low number of NK cells is consistent with expectations for animals subjected solely to a hyperlipidic diet, which primarily provides an inflammatory stimulus that promotes recruitment rather than maintaining high baseline NK cell numbers.

      In our experimental model, these observations align with published data. Specifically, liver tissue from NAFLD animals typically exhibits low baseline NK cell numbers, but upon LPS challenge, there is a marked increase in NK cell recruitment to the liver. This dynamic illustrates the interplay between dietinduced inflammation and immune cell recruitment in our experimental context and supports the interpretation of our IFNγ data.

      (3) In your methods, I think you didn't explain something. You said LPS was administered to 56 week old mice, but that HFCD diet was started in 5-6 week old mice and lasted 2 weeks, then LPS was administered. So LPS administration happened when the mice were 7-8 weeks old, right?

      We thank the reviewer for pointing out this inconsistency in our Methods section. The reviewer is correct: the HFCD diet was initiated in 5–6-week-old mice, and LPS was administered after 2 weeks on the diet, such that LPS challenge occurred when the mice were 7–8 weeks old.

      We have revised the Methods section (add page 15-16, lines 474–480).  to clarify this timeline and ensure it is accurately described in the manuscript. The corresponding description has been added to the Materials and Methods section (Page 14, Lines 436-442) as follows: “Lipopolysaccharide (LPS; Escherichia coli (O111:B4), L2630, Sigma-Aldrich, St. Louis, MO, USA) was administered intraperitoneally (i.p.; 10 mg/kg) in C57BL/6, CCR2 -/-, IFN-/-, and TNFR1R2 -/- mice. The HFCD was initiated in 5–6 week-old mice, and LPS was administered after 2 weeks on the diet, meaning that LPS administration occurred when the mice were 7–8 weeks old, with body weights ranging from 22 to 26 g. LPS was previously solubilized in sterile saline and frozen at -70°C. The animals were euthanized 6 hours after LPS administration”.

      (4) Throughout the manuscript, I would consider changing the term NAFLD to something else. I think HFCD diet is a closer model to NASH, so there needs to be some discussion on that. And the field is changing these terms, so NAFLD is now MASLD and NASH is now MASH.

      We appreciate the reviewer’s comment regarding the terminology and disease classification. In our experimental conditions, the animals were subjected to a high-fat, choline-deficient (HFCD) diet for only two weeks, a period considered very early in the progression of diet-induced liver disease. At this stage, histological analysis revealed lipid accumulation in hepatocytes without evidence of hepatocellular injury, inflammation, or fibrosis. Therefore, our model more closely resembles the metabolic-associated fatty liver disease (MAFLD, formerly NAFLD) stage rather than the more advanced metabolic-associated steatohepatitis (MASH, formerly NASH).

      Indeed, prolonged exposure to HFCD diets, typically 8 to 16 weeks, is required to induce the inflammatory and fibrotic features characteristic of MASH. Since our objective was to study the initial metabolic and immune alterations preceding overt liver injury, we believe that using the term MAFLD more accurately reflects the pathological stage represented in our model. Accordingly, we have revised the text to align with the updated nomenclature and disease context.

      (6) I am concerned about over interpretation of the publicly available RNA-seq data in Figure 2. This data comes from human NAFLD patients with unknown endotoxemia and mouse models using a traditional high-fat diet model. So it is hard to compare these very disparate datasets to yours. Also, if these datasets have elevated IFNG, why does your model require LPS injection?

      We thank the reviewer for their thoughtful comments regarding the interpretation of the RNA-seq data presented in Figure 2. We would like to clarify that the human NAFLD datasets referenced in our study do not specifically include patients with endotoxemia; rather, they focus on individuals with NAFLD alone.

      Comparing data from human and murine MAFLD models, we observed that NK cells, T cells, and neutrophils are present and contribute to the hepatic inflammatory environment. Our reanalysis indicates that the elevations of IFNγ and TNF in NAFLD are primarily derived from NK cells, T cells, and myeloid cells, respectively.

      In our experimental model, LPS administration was used to evaluate whether these immune populations particularly NK cells are further potentiated under a hyperinflammatory state, leading to exacerbated IFNγ production. This approach allows us to determine whether increased IFNγ contributes to worsening outcomes in NAFLD, providing mechanistic insights that cannot be obtained from static human or traditional mouse datasets alone.

      (7) The zoom-ins for the histology (for example, Figure 1E) don't look right compared to the dotted square. The shape and area expanded don't match. And the cells in the zoom-in don't look exactly the same either.

      We have thoroughly re-examined the histological sections and the corresponding zoom-ins, including the example in Figure 1E. Upon verification, we confirm that the zoom-ins accurately represent the highlighted areas indicated by the dotted squares. The apparent discrepancies in shape or cellular appearance are likely due to minor differences in orientation or cropping during figure preparation. Nevertheless, the content and regions depicted are consistent with the original sections.  

      (8) Did the authors measure myeloid infiltration in the CCR2-/- mice? Did you measure Neutrophil infiltration in the TNF-Receptor KO mice?

      Analysis of CD45+ cell migration in CCR2 knockout mice, as shown in Supplemental Figure 5C and 5D, demonstrates that the absence of CCR2 does not impair overall leukocyte migration. Similarly, assessment of neutrophil migration in TNF receptor (TNFR1/2) knockout mice, presented in Supplemental Figure 8A, shows that neutrophil trafficking is not affected in these animals. These results indicate that the respective knockouts do not compromise the migration of the analyzed immune populations, supporting the interpretations presented in our study.

      (9) Regarding Methods for RNA-seq Analysis. Was the Mitochondrial percentage cutoff 0.8%, because that seems low. And was there not a Padj or FDR cutoff for the differential expression?

      The mitochondrial percentage in our scRNA-seq analysis reflects the proportion of mitochondrial gene expression per cell, which serves as a quality control metric. A low mitochondrial gene expression percentage, such as the 0.8% cutoff used here, is indicative of highly viable cells.

      For differential gene expression analysis, we employed the FindMarkers function in Seurat with standard parameters: adjusted p-value (Padj) < 0.05 and log2 fold change > 0.25 for upregulated genes, and adjusted p-value < 0.05 with log2 fold change < -0.25 for downregulated genes. These thresholds ensure robust identification of differentially expressed genes while balancing sensitivity and specificity.

      (10) Regarding Methods for Flow Cytometry. How were IFNG and TNF staining performed? Was this an intracellular stain? Did you need to block secretion? TNF and IFNG antibodies have the same fluorophore (PE), so were these stainings and analyses performed separately?

      Six hours after LPS challenge, non-parenchymal liver cells were isolated using Percoll gradient centrifugation. Because the animals were in a hyperinflammatory state induced by LPS, no in vitro stimulation was performed; all staining was carried out immediately after cell isolation. Detection of IFNγ and TNF was performed via intracellular staining using the Foxp3 staining kit (eBioscience). Due to both antibodies being conjugated to PE, IFN-γ and TNF-α staining and analyses were conducted in separate experiments. These distinct staining protocols and analyses are detailed in Supplemental Figures 10 and 11. The corresponding description has been added to the Materials and Methods section (Page 16, Lines 490-493) as follows: ``As animals were already in a hyperinflammatory state, no additional in vitro stimulation was required. Intracellular detection of IFN-γ and TNF-α was conducted using the Foxp3 staining kit (eBioscience). Since both antibodies were conjugated to PE, staining and analyses were performed in separate experiments``

      Reviewer #3 (Recommendations for the authors):

      (1) Achieving an NAFLD model/disease is the starting point of this study. I understand that a two-week HFCD diet period was applied due to the decrease in lymphocyte numbers. Was it enough to initiate NAFLD then? Or is it a milder metabolic disease? Which parameters have been evaluated to accept this model as a NAFLD model?

      Indeed, the two-week HFCD diet induces an early-stage form of NAFLD, characterized by initial fat accumulation in the liver without significant hepatic injury. While this represents a milder metabolic phenotype, it is sufficient to study the inflammatory and immune responses associated with NAFLD. To validate this model, we assessed multiple parameters: liver weight, blood glucose levels, and collagen deposition. These measurements confirmed the presence of early-stage NAFLD features in the animals, providing a relevant and reliable context for investigating susceptibility to endotoxemia and immune cell dynamics. They are shown in Figure Suplementary 1 and the text was included in the manuscript (Page 5, Lines 116-117): “Mice fed HFCD showed no increase in liver weight and collagen deposition as evidenced by Picrosirius staining (Fig. S1A and Fig. S1C) ”.

      (2) It is true that the CD274 gene (encoding PD-L1) and the IFNGR2 gene, corresponding to the IFNγ receptor, are among the upregulated genes when authors analyzed the publicly available RNAseq data but they are not the most significantly elevated genes. What is the reasoning behind this cherrypicking? Why are other high DEGs not analyzed but these two are analyzed?

      We highlighted the expression of the IFN-γ receptor (IFNGR2) and CD274 (encoding PD-L1) in the publicly available RNA-seq data to align and corroborate these findings with the key results observed later in our study. To avoid redundancy, we chose to present these genes in the initial figures as they are directly relevant to the subsequent analyses. Regarding the broader analysis of human RNA-seq data, our primary objective was to identify enriched biological processes and pathways, which served as a foundation for the focus and direction of this study.

      (3) Figures 3C-3G: I understand that IFNg-/- and NFR1R2a-/- mice are not showing elevated liver damage but it may simply be because of the non-responsiveness to the LPS challenge. I suggest using a different challenge or recovery experiments with the cytokines to show that the challenge is successful and results are caused by NAFLD, truly. The same goes for Figure 6: Looking at Figure 6D one may think that IFNg deficiency alters the LPS response independent of the diet condition (or NAFLD condition).

      We appreciate the reviewer’s insightful comment and fully understand the concern regarding the potential non-responsiveness of IFN-γ⁻/⁻ and TNFR1R2a⁻/⁻ mice to the LPS challenge. To address this point and confirm that these knockout animals are indeed responsive to LPS stimulation, we conducted an additional set of ex vivo experiments.

      Specifically, WT and cytokine-deficient (IFN-γ⁻/⁻) mice were fed either Chow or HFCD for two weeks, after which spleens were collected, and splenocytes were challenged in vitro with LPS. We then quantified TNF, IFN, and IL-6 production to confirm that these mice are capable of mounting cytokine responses upon LPS stimulation.

      Due to current breeding limitations and a temporary issue in colony maintenance of TNF-deficient mice, we were unable to include TNFR1R2a⁻/⁻ animals in this additional experiment. Nevertheless, we prioritized performing the analysis with the available knockout line to avoid leaving this important point unaddressed.

      These additional data demonstrate that IFN-γ-deficient mice remain responsive to LPS, reinforcing that the differences observed in vivo are related to the NAFLD condition rather than a lack of LPS responsiveness.

      (4) Figure 1 vs Figure 4: Rag-/- mice seem more susceptible to LPS-derived death even after normal conditions. But If I compare the survival data between Figure 1 and Figure 4, Rag-/- HFCD diet mice seem to be doing better than wt mice after LPS treatment. (1 day survival vs 2 days survival). How do you explain these different outcomes?

      We thank the reviewer for this insightful question regarding the survival data in Figures 1 and 4. Although there is a one-day difference in survival outcomes, Rag-/- mice consistently exhibit increased susceptibility to LPS-induced mortality can influence the exact survival timing. Nonetheless, across all experiments, Rag-/- mice display a reproducible phenotype of heightened sensitivity to LPS challenge, which is supported by multiple independent observations in our study.

      (5) How do you explain Figure 4J in connection to the observation presented with Figure 7: TNFa tissue levels, even though significant, seem very similar between the conditions?

      We would like to clarify that the animals in this study are in a metabolic syndrome state, with early-stage NAFLD characterized by hepatic fat accumulation without significant tissue injury, as shown in Figure 1C.

      Under these conditions, the LPS challenge triggers an exacerbated inflammatory response, leading to increased secretion of IFN-γ and TNF-α, primarily from NK cells and neutrophils. While TNFα levels may appear visually similar across conditions, the HFCD mice exhibit a heightened predisposition for an amplified immune response compared to chow-fed mice. This difference is consistent with the functional outcomes observed in our study and highlights the diet-specific sensitization of the immune system.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:  

      Reviewer #1 (Public review):  

      Summary:  

      The image analysis pipeline is tested in analysing microscopy imaging data of gastruloids of varying sizes, for which an optimised protocol for in toto image acquisition is established based on whole mount sample preparation using an optimal refractive index matched mounting media, opposing dual side imaging with two-photon microscopy for enhanced laser penetration, dual view registration, and weighted fusion for improved in toto sample data representation. For enhanced imaging speed in a two-photon microscope, parallel imaging was used, and the authors performed spectral unmixing analysis to avoid issues of signal cross-talk.  

      In the image analysis pipeline, different pre-treatments are done depending on the analysis to be performed (for nuclear segmentation - contrast enhancement and normalisation; for quantitative analysis of gene expression - corrections for optical artifacts inducing signal intensity variations). Stardist3D was used for the nuclear segmentation. The study analyses into properties of gastruloid nuclear density, patterns of cell division, morphology, deformation, and gene expression.  

      Strengths:  

      The methods developed are sound, well described, and well-validated, using a sample challenging for microscopy, gastruloids. Many of the established methods are very useful (e.g. registration, corrections, signal normalisation, lazy loading bioimage visualisation, spectral decomposition analysis), facilitate the development of quantitative research, and would be of interest to the wider scientific community.

      We thank the reviewer for this positive feedback.

      Weaknesses:  

      A recommendation should be added on when or under which conditions to use this pipeline. 

      We thank the reviewer for this valuable feedback, we added the text in the revised version, ines 418 to 474. “In general, the pipeline is applicable to any tissue, but it is particularly useful for large and dense 3D samples—such as organoids, embryos, explants, spheroids, or tumors—that are typically composed of multiple cell layers and have a thickness greater than 50 µm”.

      “The processing and analysis pipeline are compatible with any type of 3D imaging data (e.g. confocal, 2 photon, light-sheet, live or fixed)”.

      “Spectral unmixing to remove signal cross-talk of multiple fluorescent targets is typically more relevant in two-photon imaging due to the broader excitation spectra of fluorophores compared to single-photon imaging. In confocal or light-sheet microscopy, alternating excitation wavelengths often circumvents the need for unmixing. Spectral decomposition performs even better with true spectral detectors; however, these are usually not non-descanned detectors, which are more appropriate for deep tissue imaging. Our approach demonstrates that simultaneous cross-talk-free four-color two-photon imaging can be achieved in dense 3D specimen with four non-descanned detectors and co-excitation by just two laser lines. Depending on the dispersion in optically dense samples, depth-dependent apparent emission spectra need to be considered”.

      “Nuclei segmentation using our trained StarDist3D model is applicable to any system under two conditions: (1) the nuclei exhibit a star-convex shape, as required by the StarDist architecture, and (2) the image resolution is sufficient in XYZ to allow resampling. The exact sampling required is object- and system-dependent, but the goal is to achieve nearly isotropic objects with diameters of approximately 15 pixels while maintaining image quality. In practice, images containing objects that are natively close to or larger than 15 pixels in diameter should segment well after resampling. Conversely, images with objects that are significantly smaller along one or more dimensions will require careful inspection of the segmentation results”.

      “Normalization is broadly applicable to multicolor data when at least one channel is expected to be ubiquitously expressed within its domain. Wavelength-dependent correction requires experimental calibration using either an ubiquitous signal at each wavelength. Importantly, this calibration only needs to be performed once for a given set of experimental conditions (e.g., fluorophores, tissue type, mounting medium)”.

      “Multi-scale analysis of gene expression and morphometrics is applicable to any 3D multicolor image. This includes both the 3D visualization tools (Napari plugins) and the various analytical plots (e.g., correlation plots, radial analysis). Multi-scale analysis can be performed even with imperfect segmentation, as long as segmentation errors tend to cancel out when averaged locally at the relevant spatial scale. However, systematic errors—such as segmentation uncertainty along the Z-axis due to strong anisotropy—may accumulate and introduce bias in downstream analyses. Caution is advised when analyzing hollow structures (e.g., curved epithelial monolayers with large cavities), as the pipeline was developed primarily for 3D bulk tissues, and appropriate masking of cavities would be needed”.

      Reviewer #2 (Public review):  

      Summary:  

      This study presents an integrated experimental and computational pipeline for high-resolution, quantitative imaging and analysis of gastruloids. The experimental module employs dual-view two-photon spectral imaging combined with optimized clearing and mounting techniques to image whole-mount immunostained gastruloids. This approach enables the acquisition of comprehensive 3D images that capture both tissue-scale and single-cell level information.  

      The computational module encompasses both pre-processing of acquired images and downstream analysis, providing quantitative insights into the structural and molecular characteristics of gastruloids. The pre-processing pipeline, tailored for dual-view two-photon microscopy, includes spectral unmixing of fluorescence signals using depth-dependent spectral profiles, as well as image fusion via rigid 3D transformation based on content-based block-matching algorithms. Nuclei segmentation was performed using a custom-trained StarDist3D model, validated against 2D manual annotations, and achieving an F1 score of 85+/-3% at a 50% intersection-over-union (IoU) threshold. Another custom-trained StarDist3D model enabled accurate detection of proliferating cells and the generation of 3D spatial maps of nuclear density and proliferation probability. Moreover, the pipeline facilitates detailed morphometric analysis of cell density and nuclear deformation, revealing pronounced spatial heterogeneities during early gastruloid morphogenesis.  

      All computational tools developed in this study are released as open-source, Python-based software.  

      Strengths:  

      The authors applied two-photon microscopy to whole-mount deep imaging of gastruloids, achieving in toto visualization at single-cell resolution. By combining spectral imaging with an unmixing algorithm, they successfully separated four fluorescent signals, enabling spatial analysis of gene expression patterns.  

      The entire computational workflow, from image pre-processing to segmentation with a custom-trained StarDist3D model and subsequent quantitative analysis, is made available as open-source software. In addition, user-friendly interfaces are provided through the open-source, community-driven Napari platform, facilitating interactive exploration and analysis.

      We thank the reviewer for this positive feedback.

      Weaknesses:  

      The computational module appears promising. However, the analysis pipeline has not been validated on datasets beyond those generated by the authors, making it difficult to assess its general applicability.

      We agree that applying our analysis pipeline to published datasets—particularly those acquired with different imaging systems—would be valuable. However, only a few high-resolution datasets of large organoid samples are publicly available, and most of these either lack multiple fluorescence channels or represent 3D hollow structures. Our computational pipeline consists of several independent modules: spectral filtering, dual-view registration, local contrast enhancement, 3D nuclei segmentation, image normalization based on a ubiquitous marker, and multiscale analysis of gene expression and morphometrics. We added the following sentences to the Discussion, lines 418 to 474, and completed the discussion on applicability with a table showing the purpose, requirements, applicability and limitations of each step of the processing and analysis pipeline.

      “Spectral filtering has already been applied in other systems (e.g. [7] and [8]), but is here extended to account for imaging depth-dependent apparent emission spectra of the different fluorophores. In our pipeline, we provide code to run spectral filtering on multichannel images, integrated in Python. In order to apply the spectral filtering algorithm utilized here, spectral patterns of each fluorophore need to be calibrated as a function of imaging depth, which depend on the specific emission windows and detector settings of the microscope”.

      “Image normalization using a wavelength-dependent correction also requires calibration on a given imaging setup to measure the difference in signal decay among the different fluorophores species. To our knowledge, the calibration procedures for spectral-filtering and our image-normalization approach have not been performed previously in 3D samples, which is why validation on published datasets is not readily possible. Nevertheless, they are described in detail in the Methods section, and the code used—from the calibration measurements to the corrected images—is available open-source at the Zenodo link in the manuscript”.

      Dual-view registration, local contrast enhancement, and multiscale analysis of gene expression and morphometrics are not limited to organoid data or our specific imaging modalities. To evaluate our 3D nuclei segmentation model, we tested it on diverse systems, including gastruloids stained with the nuclear marker Draq5 from Moos et al. [1]; breast cancer spheroids; primary ductal adenocarcinoma organoids; human colon organoids and HCT116 monolayers from Ong et al. [2]; and zebrafish tissues imaged by confocal microscopy from Li et al [3]. These datasets were acquired using either light-sheet or confocal microscopy, with varying imaging parameters (e.g., objective lens, pixel size, staining method). The results are added in the manuscript, Fig. S9b.

      Besides, the nuclei segmentation component lacks benchmarking against existing methods.  

      We agree with the reviewer that a benchmark against existing segmentation methods would be very useful. We tried different pre-trained models:

      CellPose, which we tested in a previous paper ([4]) and which showed poor performances compared to our trained StarDist3D model.

      DeepStar3D ([2]) is only available in the software 3DCellScope. We could not benchmark the model on our data, because the free and accessible version of the software is limited to small datasets. An image of a single whole-mount gastruloid with one channel, having dimensions (347,467,477) was too large to be processed, see screenshot below. The segmentation model could not be extracted from the source code and tested externally because the trained DeepStar3D weights are encrypted.

      Author response image 1.

      Screenshot of the 3DCellScore software. We could not perform 3D nuclei segmentation of a whole-mount gastruloids because the image size was too large to be processed.

      AnyStar ([5]), which is a model trained from the StarDist3D architecture, was not performing well on our data because of the heterogeneous stainings. Basic pre-processing such as median and gaussian filtering did not improve the results and led to wrong segmentation of touching nuclei. AnyStar was demonstrated to segment well colon organoids in Ong et al, 2025 ([2]), but the nuclei were more homogeneously stained. Our Hoechst staining displays bright chromatin spots that are incorrectly labeled as individual nuclei.

      Cellos ([6]), another model trained from StarDist3D, was also not performing well. The objects used for training and to validate the results are sparse and not touching, so the predicted segmentation has a lot of false negatives even when lowering the probability threshold to detect more objects. Additionally, the network was trained with an anisotropy of (9,1,1), based on images with low z resolution, so it performed poorly on almost isotropic images. Adapting our images to the network’s anisotropy results in an imprecise segmentation that can not be used to measure 3D nuclei deformations.

      We tried both Cellos and AnyStar predictions on a gastruloid image from Fig. S2 of our main manuscript.  The results are added in the manuscript, Fig. S9b. Fig3 displays the results qualitatively compared to our trained model Stardist-tapenade.

      Author response image 2.

      Qualitative comparison of two published segmentation models versus our model. We show one slice from the XY plane for simplicity. Segmentations are displayed with their contours only. (Top left) Gastruloid stained with Hoechst, image extracted from Fig S2 of our manuscript. (Top right) Same image overlayed with the prediction from the Cellos model, showing many false negatives. (Bottom left) Same image overlayed with the prediction from our Stardist-tapenade model. (Bottom right) Same image overlayed with the prediction from the AnyStar model, false positives are indicated with a red arrow.

      CellPose-SAM, which is a recent model developed building on the CellPose framework. The pre-trained model performs well on gastruloids imaged using our pipeline, and performs better than StarDist3D at segmenting elongated objects such as deformed nuclei. The performances are qualitatively compared on Fig. S9a and S10.  We also demonstrate how using local contrast enhancement improves the results of CellPose-SAM (Fig. S10a), showing the versatility of the Tapenade pre-processing module. Tissue-scale, packing-related metrics from Cellpose–SAM labels qualitatively match those from stardist-tapenade as shown Fig.10c and d.

      Appraisal:  

      The authors set out to establish a quantitative imaging and analysis pipeline for gastruloids using dual-view two-photon microscopy, spectral unmixing, and a custom computational framework for 3D segmentation and gene expression analysis. This aim is largely achieved. The integration of experimental and computational modules enables high-resolution in toto imaging and robust quantitative analysis at the single-cell level. The data presented support the authors' conclusions regarding the ability to capture spatial patterns of gene expression and cellular morphology across developmental stages.  

      Impact and utility:  

      This work presents a compelling and broadly applicable methodological advance. The approach is particularly impactful for the developmental biology community, as it allows researchers to extract quantitative information from high-resolution images to better understand morphogenetic processes. The data are publicly available on Zenodo, and the software is released on GitHub, making them highly valuable resources for the community.  

      We thank the reviewer for these positive feedbacks.

      Reviewer #3 (Public review):

      Summary  

      The paper presents an imaging and analysis pipeline for whole-mount gastruloid imaging with two-photon microscopy. The presented pipeline includes spectral unmixing, registration, segmentation, and a wavelength-dependent intensity normalization step, followed by quantitative analysis of spatial gene expression patterns and nuclear morphometry on a tissue level. The utility of the approach is demonstrated by several experimental findings, such as establishing spatial correlations between local nuclear deformation and tissue density changes, as well as the radial distribution pattern of mesoderm markers. The pipeline is distributed as a Python package, notebooks, and multiple napari plugins.  

      Strengths  

      The paper is well-written with detailed methodological descriptions, which I think would make it a valuable reference for researchers performing similar volumetric tissue imaging experiments (gastruloids/organoids). The pipeline itself addresses many practical challenges, including resolution loss within tissue, registration of large volumes, nuclear segmentation, and intensity normalization. Especially the intensity decay measurements and wavelength-dependent intensity normalization approach using nuclear (Hoechst) signal as reference are very interesting and should be applicable to other imaging contexts. The morphometric analysis is equally well done, with the correlation between nuclear shape deformation and tissue density changes being an interesting finding. The paper is quite thorough in its technical description of the methods (which are a lot), and their experimental validation is appropriate. Finally, the provided code and napari plugins seem to be well done (I installed a selected list of the plugins and they ran without issues) and should be very helpful for the community.

      We thank the reviewer for his positive feedback and appreciation of our work.

      Weaknesses  

      I don't see any major weaknesses, and I would only have two issues that I think should be addressed in a revision:  

      (1) The demonstration notebooks lack accompanying sample datasets, preventing users from running them immediately and limiting the pipeline's accessibility. I would suggest to include (selective) demo data set that can be used to run the notebooks (e.g. for spectral unmixing) and or provide easily accessible demo input sample data for the napari plugins (I saw that there is some sample data for the processing plugin, so this maybe could already be used for the notebooks?).  

      We thank the reviewer for this relevant suggestion. The 7 notebooks were updated to automatically download sample tests. The different parts of the pipeline can now be run immediately:

      https://github.com/GuignardLab/tapenade/tree/chekcs_on_notebooks/src/tapenade/notebooks

      (2) The results for the morphometric analysis (Figure 4) seem to be only shown in lateral (xy) views without the corresponding axial (z) views. I would suggest adding this to the figure and showing the density/strain/angle distributions for those axial views as well.

      A morphometric analysis based on the axial views was added as Fig. S6a of the manuscript, complementary to the XY views.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):  

      In lines 64 and 65, it is mentioned that confocal and light-sheet microscopy remain limited to samples under 100μm in diameter. I would recommend revising this sentence. In the paper of Moos and colleagues (also cited in this manuscript; PMID: 38509326), gastruloid samples larger than 100μm are imaged in toto with an open-top dual-view and dual-illumination light-sheet microscope, and live cell behaviour is analysed. Another example, if considering also multi-angle systems, is the impressive work of McDole and colleagues (PMID: 30318151), in which one of the authors of this manuscript is a corresponding author. There, multi-angle light sheet microscopy is used for in toto imaging and reconstruction of post-implantation mouse development (samples much larger than 100μm). Some multi-sample imaging strategies have been developed for this type of imaging system, though not to the sample number extent allowed by the Viventis LS2 system or the Bruker TruLive3D imager, which have higher image quality limitations.

      We thank the reviewer for this remark. As reported in their paper, Moos et al. used dual-view light-sheet microscopy to image gastruloids, which are particularly dense and challenging tissues, with whole-mount samples of approximately 250 µm in diameter. Nevertheless, their image quality metric (DCT) shows a rapid twofold decrease within 50 µm depth (Extended Fig 5.h), whereas with two-photon microscopy, our image quality metric (FRC-QE) decreases by a factor of two over 150 µm in non-cleared samples (PBS) (see Fig. 2 c). While these two measurements (FRC-QE versus DCT) are not directly comparable, the observed difference reflects the superior depth performance of two-photon microscopy, owing in part to the use of non-descanned detectors. In our case, imaging was performed with Hoechst, a blue fluorophore suboptimal for deep imaging, whereas in the Moos dataset (Draq5, far-red), the configuration was more favorable for imaging in depth  which further supports our conclusion.

      In McDole et al, tissues reaching 250µm were imaged from 4 views, but do not reach cellular-scale resolution in deeper layers compatible with cell segmentation to our knowledge.

      We corrected the sentence ‘However, light-sheet and confocal imaging approaches remain limited to relatively small organoids typically under 100 micrometers in diameter ‘ by the following (line 64) :

      “While advances in light-sheet microscopy have extended imaging depth in organoids, maintaining high image quality throughout thick samples remains challenging. In practice, quantitative analyses are still largely restricted to organoids under roughly 100 µm in diameter”.

      It is worth mentioning that two-photon microscopes are much more widely available than light sheet microscopes, and light sheet systems with 2-photon excitation are even less accessible, which makes the described workflow of Gros and colleagues have a wide community interest.  

      We thank the reviewer for this remark, and added this suggestion line 74:

      “Finally, two-photon microscopes are typically more accessible than light-sheet systems and allow for straightforward sample mounting, as they rely on procedures comparable to standard confocal imaging”.

      Reviewer #2 (Recommendations for the authors):  

      Suggestions:  

      A comparison with established pre-trained models for 3D organoid image segmentation (e.g., Cellos[1], AnyStar[2], and DeepStar3D[3], all based on StarDist3D) would help highlight the advantages of the authors' custom StarDist3D model, which has been specifically optimized for two-photon microscopy images.  

      (1)  Cellos: https://doi.org/10.1038/s41467-023-44162-6

      (2)  AnyStar: https://doi.org/10.1109/WACV57701.2024.00742

      (3)  DeepStar3D: https://doi.org/10.1038/s41592-025-02685-4

      We agree with the reviewer that a benchmark against existing segmentation methods is very useful. This is addressed in the revised version, as detailed above (Figure 3).

      Recommendations:  

      Please clarify the following point. In line 195, the authors state, "This allowed us to detect all mitotic nuclei in whole-mount samples for any stage and size." Does this mean that the custom-trained StarDist3D model can detect 100% of mitotic nuclei? It was not clear from the manuscript, figures, or videos how this was validated. Given the reported performance scores of the StarDist3D model for detecting all nuclei, claiming 100% detection of mitotic nuclei seems surprisingly high.

      We thank the reviewer for this comment. As it was detailed in the methods section, the detection score reaches 82%, and only the complete pipeline (detection+minimal manual curation) allows us to detect all mitotic nuclei. To make it clearer, the following precisions were added in the Results section:

      ”To detect division events, we stained gastruloids with phosphohistone H3 (ph3) and trained a separate custom Stardist3D model using 3D annotations of nuclei expressing ph3 (see Methods III H). This model together allowed us to detect nearly all mitotic nuclei in whole-mount samples for any stage and size (Fig.3f and Suppl.Movie 4), and we used minimal manual curation to correct remaining errors.”

      Minor corrections:  

      It appears that Figures 4-6 are missing from the submitted version, but they can be found in the manuscript available on bioRxiv.

      We thank the reviewer for this remark, this was corrected immediately to add Figures 4 to 6.

      In line 185, is the intended phrase "by comparing the 2D predictions and the 2D sliced annotated segments..."? 

      To gain some clarity, we replaced the initial sentence:

      “The f1 score obtained by comparing the 3D prediction and the 3D ground-truth is well approximated by the f1 score obtained by comparing the 2D annotations and the 2D sliced annotated segments, with at most a 5% difference between the two scores.” by

      “The f1 score obtained in 3D (3D prediction compared with the 3D ground-truth) is well approximated by the f1 score obtained in 2D (2D predictions compared with the 2D sliced annotated segments). The difference between the 2 scores was at most 5%.”

      Reviewer #3 (Recommendations for the authors):

      (1) How is the "local neighborhood volume" defined, and how was it computed?

      The reviewer is referring to this paragraph (the term is underscored) :

      “To probe quantities related to the tissue structure at multiple scales, we smooth their signal with a Gaussian kernel of width σ, with σ defined as the spatial scale of interest. From the segmented nuclei instances, we compute 3D fields of cell density (number of cells per unit volume), nuclear volume fraction (ratio of nuclear volume to local neighborhood volume), and nuclear volume at multiple scales.”

      To improve clarity, the phrasing has been revised: the term local neighborhood volume has been replaced by local averaging volume, and a reference to the Methods section has been added.

      From the segmented nuclei instances, we compute 3D fields of cell density (number of cells per unit volume), nuclear volume fraction (ratio of space occupied by nuclear volume within the local averaging volume, as defined in the Methods III I), and nuclear volume at multiple scales.

      (2) In the definition of inertia tensor (18), isn't the inner part normally defined in the reversed way (delta_i,j - ...)?

      We thank the reviewer for noticing this error, which we fixed in the manuscript.

      (3) For intensity normalization, the paper uses the Hoechst signal density as a proxy for a ubiquitous nuclei signal. I would assume that this is problematic, for eg, dividing cells (which would overestimate it). Would using the average Hoechst signal per nucleus mask (as segmentation is available) be a better proxy?

      We agree that this idea is appealing if one assumes a clear relationship between nuclear volume and Hoechst intensity. However, since cell and nuclear volumes vary substantially with differentiation state (see Fig. 4), such a normalization approach would introduce additional biases at large spatial scales. We believe that the most robust improvement would instead consist in masking dividing cells during the normalization procedure, as these events could be detected and excluded from the computation.

      Nonetheless, we believe the method proposed by the reviewer could prove relevant for other types of data, so we will implement this recommendation in the code available in the Tapenade package.

      (4) Figures 4-6 were part of the Supplementary Material, but should be included in the main text?

      We thank the reviewer for this remark, this was corrected immediately to add Figures 4-6.

      We also noticed a missing reference to Fig. S3 in the main text, so we added lines 302 to 307 to comment on the wavelength-dependency of the normalization method. We improved the description of Fig.6, which lacked clarity (line 316 to 321, line 327).

      (1) Moos, F., Suppinger, S., de Medeiros, G., Oost, K.C., Boni, A., Rémy, C., Weevers, S.L., Tsiairis, C., Strnad, P. and Liberali, P., 2024. Open-top multisample dual-view light-sheet microscope for live imaging of large multicellular systems. Nature Methods, 21(5), pp.798-803.

      (2) Ong, H. T.; Karatas, E.; Poquillon, T.; Grenci, G.; Furlan, A.; Dilasser, F.; Mohamad Raffi, S. B.; Blanc, D.; Drimaracci, E.; Mikec, D.; Galisot, G.; Johnson, B. A.; Liu, A. Z.; Thiel, C.; Ullrich, O.; OrgaRES Consortium; Racine, V.; Beghin, A. (2025). Digitalized organoids: integrated pipeline for high-speed 3D analysis of organoid structures using multilevel segmentation and cellular topology.  Nature Methods, 22(6), pp.1343-1354

      (3) Li, L., Wu, L., Chen, A., Delp, E.J. and Umulis, D.M., 2023. 3D nuclei segmentation for multi-cellular quantification of zebrafish embryos using NISNet3D. Electronic Imaging, 35, pp.1-9.

      (4) Vanaret, J., Dupuis, V., Lenne, P. F., Richard, F., Tlili, S., & Roudot, P. (2023). A detector-independent quality score for cell segmentation without ground truth in 3D live fluorescence microscopy. IEEE Journal of Selected Topics in Quantum Electronics, 29(4:Biophotonics), 1-12.

      (5) Dey, N., Abulnaga, M., Billot, B., Turk, E. A., Grant, E., Dalca, A. V., & Golland, P. (2024). AnyStar: Domain randomized universal star-convex 3D instance segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 7593-7603).

      (6) Mukashyaka, P., Kumar, P., Mellert, D. J., Nicholas, S., Noorbakhsh, J., Brugiolo, M., ... & Chuang, J. H. (2023). High-throughput deconvolution of 3D organoid dynamics at cellular resolution for cancer pharmacology with Cellos. Nature Communications, 14(1), 8406.

      (7) Rakhymzhan, A., Leben, R., Zimmermann, H., Günther, R., Mex, P., Reismann, D., ... & Niesner, R. A. (2017). Synergistic strategy for multicolor two-photon microscopy: application to the analysis of germinal center reactions in vivo. Scientific reports, 7(1), 7101.

      (8) Dunsing, V., Petrich, A., & Chiantia, S. (2021). Multicolor fluorescence fluctuation spectroscopy in living cells via spectral detection. Elife, 10, e69687.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review):

      We thank Reviewer #1 for its thoughtful and constructive feedback. We found the suggestions particularly helpful in refining the conceptual framework and clarifying key aspects of our interpretations.

      Summary:

      This paper investigates the potential link between amygdala volume and social tolerance in multiple macaque species. Through a comparative lens, the authors considered tolerance grade, species, age, sex, and other factors that may contribute to differing brain volumes. They found that amygdala, but not hippocampal, volume differed across tolerance grades, such that hightolerance species showed larger amygdala than low-tolerance species of macaques. They also found that less tolerant species exhibited increases in amygdala volume with age, while more tolerant species showed the opposite. Given their wide range of species with varied biological and ecological factors, the authors' findings provide new evidence for changes in amygdala volume in relation to social tolerance grades. Contributions from these findings will greatly benefit future efforts in the field to characterize brain regions critical for social and emotional processing across species.

      Strengths:

      (1) This study demonstrates a concerted and impressive effort to comparatively examine neuroanatomical contributions to sociality in monkeys. The authors impressively collected samples from 12 macaque species with multiple datapoints across species age, sex, and ecological factors. Species from all four social tolerance grades were present. Further, the age range of the animals is noteworthy, particularly the inclusion of individuals over 20 years old - an age that is rare in the wild but more common in captive settings. 

      (2) This work is the first to report neuroanatomical correlates of social tolerance grade in macaques in one coherent study. Given the prevalence of macaques as a model of social neuroscience, considerations of how socio-cognitive demands are impacted by the amygdala are highly important. The authors' findings will certainly inform future studies on this topic.

      (3) The methodology and supplemental figures for acquiring brain MRI images are well detailed. Clear information on these parameters is crucial for future comparative interpretations of sociality and brain volume, and the authors do an excellent job of describing this process in full.

      Weaknesses:

      (1) The nature vs. nurture distinction is an important one, but it may be difficult to draw conclusions about "nature" in this case, given that only two data points (from grades 3 and 4) come from animals under one year of age (Method Figure 1D). Most brains were collected after substantial social exposure-typically post age 1 or 1.5-so the data may better reflect developmental changes due to early life experience rather than innate wiring. It might be helpful to frame the findings more clearly in terms of how early experiences shape development over time, rather than as a nature vs. nurture dichotomy.

      We agree with the reviewer that presenting our findings through a strict nature vs. nurture dichotomy was potentially misleading. We have revised the introduction and the discussion (e.g. lines 85-95 and 363-365) to clarify that we examined how neurodevelopmental trajectories differ across social grades with the caveat of related to the absence of very young individuals in our samples.  We now explicitly mention that our results may reflect both early species-typical biases and experience-dependent maturation.

      We positioned our study on social tolerance in a comparative neuroscience framework and introduced a tentative working model that articulates behavioral traits, cognitive dimensions, and their potential subcortical neural substrates

      Drawing upon 18 behavioral traits identified in Thierry’s comparative analyses (Thierry, 2021, 2007), we organize these traits into three core dimensions: socio-cognitive demands, behavioral inhibition, and the predictability of the social environment (Table 1). This conceptualization does not aim to redefine social tolerance itself, but rather to provide a structured basis for testing neuroanatomical hypotheses related to social style variability. It echoes recent efforts to bridge behavioral ecology and cognitive neuroscience by linking specific mental abilities – such as executive functions or metacognition – with distinct prefrontal regions shaped by social and ecological pressures (Bouret et al., 2024).

      “Cross-fostering experiments (De Waal and Johanowicz, 1993), along with our own results, suggest that social tolerance grades reflect both early, possibly innate predispositions and later environmental shaping”.

      (2) It would be valuable to clarify how the older individuals, especially those 20+ years old, may have influenced the observed age-related correlations (e.g., positive in grades 1-2, negative in grades 3-4). Since primates show well-documented signs of aging, some discussion of the potential contribution of advanced age to the results could strengthen the interpretation.

      We thank the reviewer for highlighting this important point. In our dataset, younger and older subjects are underrepresented, but they are distributed across all subgroups. Therefore, we do not think that it could drive the interaction effect we are reporting. In our sample, amygdala volume tended to increase with age in intolerant species and decrease in tolerant species. We included a new analysis (Figure 4) that allows providing a clearer assessment of when social grades 1 vs 4 differed in terms of amygdala and hippocampus volume. While our model accounts for age continuously, we agree that age-related variation deserves cautious interpretation and require longitudinal designs in future studies.

      We also added the following statements in the discussion (lines 386-391)

      “Due to a limited sample size of our study, this crossing trend, already accounted for by our continuous age model, should be further investigated. These results call for cautious interpretation of age-related variation and further emphasize the importance of longitudinal studies integrating both behavioral, cognitive and anatomical data in non-human primates, which would help to better understand the link between social environment and brain development (Song et al., 2021)”.

      (3) The authors categorize the behavioral traits previously described in Thierry (2021) into 3 selfdefined cognitive requirements, however, they do not discuss under what conditions specific traits were assigned to categories or justify why these cognitive requirements were chosen. It is not fully clear from Thierry (2021) alone how each trait would align with the authors' categories. Given that these traits/categories are drawn on for their neuroanatomical hypotheses, it is important that the authors clarify this. It would be helpful to include a table with all behavioral traits with their respective categories, and explain their reasoning for selecting each cognitive requirement category.

      Thank you for this important suggestion. We have extensively revised the introduction to explain how we derived from the scientific literature the three cognitive dimensions—socio-cognitive demands, behavioral inhibition, and predictability of the social environment—. We now provide a complete overview of the 18 behavioral traits described in Thierry’s framework and their cognitive classification in a dedicated table , along with hypothesized neural correlates. We have also mentioned traits that were not classified in our framework along with short justification of this classification. We believe this addition significantly improves the transparency and intelligibility of our conceptual approach.

      “The concept of social tolerance, central to this comparative approach, has sometimes been used in a vague or unidimensional way. As Bernard Thierry (2021) pointed out, the notion was initially constructed around variations in agonistic relationships – dominance, aggressiveness, appeasement or reconciliation behaviors – before being expanded to include affiliative behaviors, allomaternal care or male–male interactions (Thierry, 2021). These traits do not necessarily align along a single hierarchical axis but rather reflect a multidimensional complexity of social style, in which each trait may have co-evolved with others (Thierry, 2021, 2000; Thierry et al., 2004). Moreover, the lack of a standardized scientific definition has sometimes led to labeling species as “tolerant” or “intolerant” without explicit criteria (Gumert and Ho, 2008; Patzelt et al., 2014). These behavioral differences are characterized by different styles of dominance (Balasubramaniam et al., 2012), severity of agonistic interactions (Duboscq et al., 2014), nepotism (Berman and Thierry, 2010; Duboscq et al., 2013; Sueur et al., 2011) and submission signals (De Waal and Luttrell, 1985; Rincon et al., 2023), among the 18 covariant behavioral traits described in Thierry's classification of social tolerance (Thierry, 2021, 2017, 2000)”.

      “To ground the investigation of social tolerance in a comparative neuroanatomical framework, we introduce a tentative working model that articulates behavioral traits, cognitive dimensions, and their potential subcortical neural substrates. Drawing upon 18 behavioral traits identified in Thierry’s comparative analyses (Thierry, 2021, 2007), we organized these traits into three core dimensions: socio-cognitive demands, behavioral inhibition, and the predictability of the social environment (Table 1). This conceptualization does not aim to redefine social tolerance itself, but rather to provide a structured basis for testing neuroanatomical hypotheses related to social style variability. It echoes recent efforts to bridge behavioral ecology and cognitive neuroscience by linking specific mental abilities – such as executive functions or metacognition – with distinct prefrontal regions shaped by social and ecological pressures (Bouret et al., 2024; Testard 2022)”.

      (4) One of the main distinctions the authors make between high social tolerance species and low tolerance species is the level of complex socio-cognitive demands, with more tolerant species experiencing the highest demands. However, socio-cognitive demands can also be very complex for less tolerant species because they need to strategically balance behaviors in the presence of others. The relationships between socio-cognitive demands and social tolerance grades should be viewed in a more nuanced and context-specific manner. 

      We fully agree and we did not mean that intolerant species lives in a ‘simple’ social environment but that the ones of more tolerant species is markedly more demanding. Evidence supporting this statement include their more efficient social networks (Sueur et al., 2011) and more complex communicative skills (e.g. tolerant macaques displayed higher levels of vocal diversity and flexibility than intolerant macaques in social situation with high uncertainty (Rebout et al., 2020).

      In the revised version (lines 106-122), we now highlight that socio-cognitive challenges arise across the tolerance spectrum, including in less tolerant species where strategic navigation of rigid hierarchies and risk-prone interactions is required. We hope that this addition offers a more balanced and nuanced framing of socio-cognitive demands across macaque societies

      “The first category, socio-cognitive demands, refers to the cognitive resources needed to process, monitor, and flexibly adapt to complex social environments. Linking those parameters to neurological data is at the core of the social brain theory to explain the expansion of the neocortex in primates (Dunbar). Macaques social systems require advanced abilities in social memory, perspective-taking, and partner evaluation (Freeberg et al., 2012). This is particularly true in tolerant species, where the increased frequency and diversity of interactions may amplify the demands on cognitive tracking and flexibility. Tolerant macaque species typically live in larger groups with high interaction frequencies, low nepotism, and a wider range of affiliative and cooperative behaviors, including reconciliation, coalition-building, and signal flexibility (REF). Tolerant macaque species also exhibit a more diverse and flexible vocal and facial repertoire than intolerants ones which may help reduce ambiguity and facilitate coordination in dense social networks (Rincon et al., 2023; Scopa and Palagi, 2016; Rebout 2020). Experimental studies further show that macaques can use facial expressions to anticipate the likely outcomes of social interactions, suggesting a predictive function of facial signals in managing uncertainty (Micheletta et al., 2012; Waller et al., 2016). Even within less tolerant species, like M. mulatta, individual variation in facial expressivity has been linked to increased centrality in social networks and greater group cohesion, pointing to the adaptive value of expressive signaling across social styles (Whitehouse et al., 2024)”.

      (5) While the limitations section touches on species-related considerations, the issue of individual variability within species remains important. Given that amygdala volume can be influenced by factors such as social rank and broader life experience, it might be useful to further emphasize that these factors could introduce meaningful variation across individuals. This doesn't detract from the current findings but highlights the importance of considering life history and context when interpreting subcortical volumes-particularly in future studies.

      We have now emphasized this point in the limitations section (lines 441-456). While our current dataset does not allow us to fully control for individual-level variables across all collection centers, we recognize that factors such as rank, social exposure, and individual life history may influence subcortical volumes

      “Although we explained some interspecies variability, adding subjects to our database will increase statistical power and will help addressing potential confounding factors such as age or sex in future studies. One will benefit from additional information about each subject. While considered in our modelling, the social living and husbandry conditions of the individuals in our dataset remain poorly documented. The living environment has been considered, and the size of social groups for certain individuals, particularly for individuals from the CdP, have been recorded. However, these social characteristics have not been determined for all individuals in the dataset. As previously stated, the social environment has a significant impact on the volumetry of certain regions. Furthermore, there is a lack of data regarding the hierarchy of the subjects under study and the stress they experience in accordance with their hierarchical rank and predictability of social outcomes position (McCowan et al., 2022)”. 

      Reviewer #2 (Public review):

      We thank Reviewer #2 for its thoughtful remarks and for acknowledging the value of our comparative approach despite its inherent constraints.

      Summary:

      This comparative study of macaque species and the type of social interaction is both ambitious and inevitably comes with a lot of caveats. The overall conclusion is that more intolerant species have a larger amygdala. There are also opposing development profiles regarding amygdala volume depending on whether it is a tolerant or intolerant species.

      To achieve any sort of power, they have combined data from 4 centres, which have all used different scanning methods, and there are some resolution differences. The authors have also had to group species into 4 classifications - again to assist with any generalisations and power. They have focused on the volumes of two structures, the amygdala and the hippocampus, which seems appropriate. Neither structure is homogeneous and so it may well be that a targeted focus on specific nuclei or subfields would help (the authors may well do this next) - but as the variables would only increase further along with the number of potential comparisons, alongside small group numbers, it seems only prudent to treat these findings are preliminary. That said, it is highly unlikely that large numbers of macaque brains will become available in the near future.

      This introduction is by way of saying that the study achieves what it sets out to do, but there are many reasons to see this study as preliminary. The main message seems to be twofold: (1) that more intolerant species have relatively larger amygdalae, and (2) that with development, there is an opposite pattern of volume change (increasing with age in intolerant species and decreasing with age in tolerant species). Finding 1 is the opposite of that predicted in Table 1 - this is fine, but it should be made clearer in the Discussion that this is the case, otherwise the reader may feel confused. As I read it, the authors have switched their prediction in the Discussion, which feels uncomfortable. 

      We thank the reviewer for this important observation. In the original version, Table 1 presented simplified direct predictions linking social tolerance grades to amygdala and hippocampus volumes. We recognize that this formulation may have created confusion In the revised manuscript, we have thoroughly restructured the table and its accompanying rationale. Table 1 now better reflects our conceptual framework grounded in three cognitive dimensions—sociocognitive demands, behavioral inhibition, and social predictability—each linked to behavioral traits and associated neural hypotheses based on published literature. This updated framework, detailed in lines 144-169 of the introduction, provides a more nuanced basis for interpreting our results and avoids the inconsistencies previously noted. The Discussion was also revised accordingly (lines 329-255) to clarify where our findings diverge from the original predictions and to explore alternative explanations based on social complexity. Rather than directly predicting amygdala size from social tolerance grades, we propose that variation in volume emerges from differing combinations of cognitive pressures across species.

      It is inevitable that the data in a study of this complexity are all too prone to post hoc considerations, to which the authors indulge. In the case of Grade 1 species, the individuals have a lot to learn, especially if they are not top of the hierarchy, but at the same time, there are fewer individuals in the troop, making predictions very tricky. As noted above, I am concerned by the seemingly opposite predictions in Table 1 and those in the Discussion regarding tolerance and amygdala volume. (It may be that the predictions in Table 1 are the opposite of how I read them, in which case the Table and preceding text need to align.)

      In order to facilitate the interpretation of our Bayesian modelling, we have selected a more focused ROI in our automatic segmentation procedure of the Hippocampus (from Hippocampal Formation to Hippocampus) and have added to the new analysis (Figure 4) that helps to properly test whether the hippocampus significantly differs between species from social grade 1 vs 4. The present analysis found that this is the case in adult monkeys. This is therefore consistent with our hypothesis that amygdala volumes are principally explained by heightened sociocognitive demands in more tolerant species.

      We also acknowledge the reviewer’s concerns about the limited generalizability due to our sample. The challenges of comparative neuroimaging in non-human primates—especially when using post-mortem datasets—are substantial. Given the ethical constraints and the rarity of available specimens, increasing the number of individuals or species is not feasible in the short term. However, we have made all data and code publicly available and clearly stated the limitations of our sample in the manuscript. Despite these constraints, we believe our dataset offers an unprecedented comparative perspective, particularly due to the inclusion of rare and tolerant species such as M. tonkeana, M. nigra, and M. thibetana, which have never been included in structural MRI studies before. We hope this effort will serve as a foundation for future collaborative initiatives in primate comparative neuroscience.

      Reviewer #3 (Public review):

      We thank Reviewer #3 for their thoughtful and detailed review. Their comments helped us refine both the conceptual and interpretative aspects of the manuscript. We respond point by point below.

      Summary:

      In this study, the authors were looking at neurocorrelates of behavioural differences within the genus Macaca. To do so, they engaged in real-world dissection of dead animals (unconnected to the present study) coming from a range of different institutions. They subsequently compare different brain areas, here the amygdala and the hippocampus, across species. Crucially, these species have been sorted according to different levels of social tolerance grades (from 1 to 4). 12 species are represented across 42 individuals. The sampling process has weaknesses ("only half" of the species contained by the genus, and Macaca mulatta, the rhesus macaque, representing 13 of the total number of individuals), but also strengths (the species are decently well represented across the 4 grades) for the given purpose and for the amount of work required here. I will not judge the dissection process as I am not a neuroanatomist, and I will assume that the different interventions do not alter volume in any significant ways / or that the different conditions in which the bodies were kept led to the documented differences across species. 

      25 brains were extracted by the authors themselves who are highly with this procedure. Overall, we believe that dissection protocols did not alter the total brain volume. Despite our expertise, we experienced some difficulties to not damage the cerebellum. Therefore, this region was not included in our analysis. We also noted that this brain region was also damaged or absent from the Prime-DE dataset.

      Several protocols were used to prepare and store tissue. It could have impacted the total brain volume.

      We agree that differences in tissue preparation and storage could potentially affect total brain volume. Therefore, we explicitly included the main sample preparation variable — whether brains had been previously frozen — as a covariate in our model. This factor did not explain our results. Moreover, Figures 1D and 1I display the frozen status and its correlation with the amygdala and hippocampus ratios, respectively. Figure 2 shows the parameters of the model and the posterior distributions for the frozen status and total brain volume effects.

      There are two main results of the study. First, in line with their predictions, the authors find that more tolerant macaque species have larger amygdala, compared to the hippocampus, which remains undifferentiated across species. Second, they also identify developmental effects, although with different trends: in tolerant species, the amygdala relative volume decreases across the lifespan, while in intolerant species, the contrary occurs. The results look quite strong, although the authors could bring up some more clarity in their replies regarding the data they are working with. From one figure to the other, we switch from model-calculated ratio to modelpredicted volume. Note that if one was to sample a brain at age 20 in all the grades according to the model-predicted volumes, it would not seem that the difference for amygdala would differ much across grades, mostly driven with Grade 1 being smaller (in line with the main result), but then with Grade 2 bigger than Grade 3, and then Grade 4 bigger once again, but not that different from Grade 2.

      Overall, despite this, I think the results are pretty strong, the correlations are not to be contested, but I also wonder about their real meaning and implications. This can be seen under 3 possible aspects:

      (1)  Classification of the social grade

      While it may be familiar to readers of Thierry and collaborators, or to researchers of the macaque world, there is no list included of the 18 behavioral traits used to define the three main cognitive requirements (socio-cognitive demands, predictability of the environment, inhibitory control). It would be important to know which of the different traits correspond to what, whether they overlap, and crucially, how they are realized in the 12 study species, as there could be drastic differences from one species to the next. For now, we can only see from Table S1 where the species align to, but it would be a good addition to have them individually matched to, if not the 18 behavioral traits, at least the 3 different broad categories of cognitive requirements.

      We fully agree with this observation. In the revised version of the manuscript, we now include a detailed conceptual table listing all 18 behavioral traits from Thierry’s framework. For each trait, we provide its underlying social implications, its associated cognitive dimension (when applicable), and the hypothesized neural correlate. 

      While some traits may could have been arguably classified in several cognitive dimensions (e.g. reconciliation rate), we preferred to assign each to a unique dimension for clarity. Additionally, the introduction (lines 95-169 + Table1) now explains how each trait was evaluated based on existing literature and assigned to one of the three proposed cognitive categories: socio-cognitive demands, behavioral inhibition, or social unpredictability. This structure offers a clearer and more transparent basis for the neuroanatomical hypotheses tested in the study.

      “Navigating social life in primate societies requires substantial cognitive resources: individuals must not only track multiple relationships, but also regulate their own behavior, anticipate others’ reactions, and adapt flexibly to changing social contexts. Taken advantage of databases of magnetic resonance imaging (MRI) structural scans, we conducted the first comparative study integrating neuroanatomical data and social behavioral data from closely related primate species of the same genus to address the following questions: To what extent can differences in volumes of subcortical brain structures be correlated with varying degrees of social tolerance? Additionally, we explored whether these dispositions reflect primarily innate features, shaped by evolutionary processes, or acquired through socialization within more or less tolerant social environments”.

      “The first category, socio-cognitive demands, refers to the cognitive resources needed to process, monitor, and flexibly adapt to complex social environments. Linking those parameters to neurological data is at the core of the social brain theory to explain the expansion of the neocortex in primates (Dunbar). Macaques social systems require advanced abilities in social memory, perspective-taking, and partner evaluation (Freeberg et al., 2012). This is particularly true in tolerant species, where the increased frequency and diversity of interactions may amplify the demands on cognitive tracking and flexibility. Tolerant macaque species typically live in larger groups with high interaction frequencies, low nepotism, and a wider range of affiliative and cooperative behaviors, including reconciliation, coalition-building, and signal flexibility (REF). Tolerant macaque species also exhibit a more diverse and flexible vocal and facial repertoire than intolerants ones which may help reduce ambiguity and facilitate coordination in dense social networks (Rincon et al., 2023; Scopa and Palagi, 2016; Rebout 2020). Experimental studies further show that macaques can use facial expressions to anticipate the likely outcomes of social interactions, suggesting a predictive function of facial signals in managing uncertainty (Micheletta et al., 2012; Waller et al., 2016). Even within less tolerant species, like M. mulatta, individual variation in facial expressivity has been linked to increased centrality in social networks and greater group cohesion, pointing to the adaptive value of expressive signaling across social styles (Whitehouse et al., 2024)”.

      “The second category, inhibitory control, includes traits that involve regulating impulsivity, aggression, or inappropriate responses during social interactions. Tolerant macaques have been shown to perform better in tasks requiring behavioral inhibition and also express lower aggression and emotional reactivity in both experimental and natural contexts (Joly et al., 2017; Loyant et al., 2023). These features point to stronger self-regulation capacities in species with egalitarian or less rigid hierarchies. More broadly, inhibition – especially in its strategic form (self-control) – has been proposed to play a key role in the cohesion of stable social groups. Comparative analyses across mammals suggest that this capacity has evolved primarily in anthropoid primates, where social bonds require individuals to suppress immediate impulses in favour of longer-term group stability (Dunbar and Shultz, 2025). This view echoes the conjecture of Passingham and Wise (2012), who proposed that the emergence of prefrontal area BA10 in anthropoids enabled the kind of behavioural flexibility needed to navigate complex social environments (Passingham et al., 2012)”.

      “The third category, social environment predictability, reflects how structured and foreseeable social interactions are within a given society. In tolerant species, social interactions are more fluid and less kin-biased, leading to greater contextual variation and role flexibility, which likely imply a sustained level of social awareness. In fact, as suggested by recent research, such social uncertainty and prolonged incentives are reflected by stress-related physiology : tolerant macaques such as M. tonkeana display higher basal cortisol levels, which may be indicative of a chronic mobilization of attentional and regulatory resources to navigate less predictable social environments (Sadoughi et al., 2021)”.

      “Each behavioral trait was individually evaluated based on existing empirical literature regarding the types of cognitive operations it likely involves. When a primary cognitive dimension could be identified, the trait was assigned accordingly. However, some behaviors – such as maternal protection, allomaternal care, or delayed male dispersal – do not map neatly onto a single cognitive process. These traits likely emerge from complex configurations of affective and socialmotivational systems, and may be better understood through frameworks such as attachment theory (Suomi, 2008), which emphasizes the integration of social bonding, emotional regulation, and contextual plasticity. While these dimensions fall beyond the scope of the present framework, they offer promising directions for future research, particularly in relation to the hypothalamic and limbic substrates of social and reproductive behavior”.

      “Rather than forcing these traits into potentially misleading categories, we chose to leave them unclassified within our current cognitive framework. This decision reflects both a commitment to conceptual clarity and the recognition that some behaviors emerge from a convergence of cognitive demands that cannot be neatly isolated. This tripartite framework, leaving aside reproductive-related traits, provides a structured lens through which to link behavioral diversity to specific cognitive processes and generate neuroanatomical predictions”.

      (2) Issue of nature vs nurture

      Another way to look at the debate between nature vs nurture is to look at phylogeny. For now, there is no phylogenetic tree that shows where the different grades are realized. For example, it would be illuminating to know whether more related species, independently of grades, have similar amygdala or hippocampus sizes. Then the question will go to the details, and whether the grades are realized in particular phylogenetic subdivisions. This would go in line with the general point of the authors that there could be general species differences.

      As pointed out by Thierry and collaborators, the social tolerance concept is already grounded in a phylogenetic framework as social tolerance matches the phylogenetical tree of these macaque species, suggesting a biological ground of these behavioral observations. Given the modest sample size and uneven species representation, we opted not to adopt tools such as Phylogenetic Generalized Least Squares (PGLS) in our analysis. Our primary aim in this study was to explore neuroanatomical variation as a function of social traits, not to perform a phylogenetic comparative analysis per see. That said, we now explicitly acknowledge this limitation in the Discussion and indicate that future work using larger datasets and phylogenetic methods will be essential to disentangle social effects from evolutionary relatedness. We hope that making our dataset openly available will facilitate such futures analyses.

      With respect to nurture, it is likely more complicated: one needs to take into account the idiosyncrasies of the life of the individual. For example, some of the cited literature in humans or macaques suggests that the bigger the social network, the bigger the brain structure considered. Right, but this finding is at the individual level with a documented life history. Do we have any of this information for any of the individuals considered (this is likely out of the scope of this paper to look at this, especially for individuals that did not originate from CdP)?

      We appreciate this insightful observation. Indeed, findings from studies in humans and nonhuman primates showing associations between brain structure and social network size typically rely on detailed life history and behavioral data at the individual level. Unfortunately, such finegrained information was not consistently available across our entire sample. While some individuals from the Centre de Primatologie (CdP) were housed in known group compositions and social settings, we did not have access to longitudinal social data—such as rank, grooming rates, or network centrality—that would allow for robust individual-level analyses. We now acknowledge this limitation more clearly in the Discussion (lines 436-443), and we fully agree that future work combining neuroimaging with systematic behavioral monitoring will be necessary to explore how species-level effects interact with individual social experience.

      (3) Issue of the discussion of the amygdala's function

      The entire discussion/goal of the paper, states that the amygdala is connected to social life. Yet, before being a "social center", the amygdala has been connected to the emotional life of humans and non-humans alike. The authors state L333/34 that "These findings challenge conventional expectations of the amygdala's primary involvement in emotional processes and highlight the complexity of the amygdala's role in social cognition". First, there is no dichotomy between social cognition and emotion. Emotion is part of social cognition (unless we and macaques are robots). Second, there is nowhere in the paper a demonstration that the differences highlighted here are connected to social cognition differences per se. For example, the authors have not tested, say, if grade 4 species are more afraid of snakes than grade 1 species. If so, one could predict they would also have a bigger amygdala, and they would probably also find it in the model. My point is not that the authors should try to correlate any kind of potential aspect that has been connected to the amygdala in the literature with their data (see for example the nice review by DomínguezBorràs and Vuilleumier, https://doi.org/10.1016/B978-0-12-823493-8.00015-8), but they should refrain from saying they have challenged a particular aspect if they have not even tested it. I would rather engage the authors to try and discuss the amygdala as a multipurpose center, that includes social cognition and emotion.

      We thank the reviewer for this important and nuanced point. We have revised the manuscript to adopt a more cautious and integrative tone regarding the function of the amygdala. In the revised Discussion (lines 341-355), we now explicitly state that the amygdala is involved in a broad range of processes—emotional, social, and affective—and that these domains are deeply intertwined. Rather than proposing a strict dissociation, we now suggest that the amygdala supports integrated socio-emotional functions that are mobilized differently across social tolerance styles. We also cite recent relevant literature (e.g., Domínguez-Borràs & Vuilleumier, 2021) to support this view and have removed any claim suggesting we challenge the emotional function of the amygdala per se. Our aim is to contribute to a richer understanding of how affective and social processes co-construct structural variation in this region.

      Strengths:

      Methods & breadth of species tested.

      Weaknesses:

      Interpretation, which can be described as 'oriented' and should rather offer additional views.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Private Comments:

      (1) Table 1 should be formatted for clarity i.e., bolded table headers, text realignment, and spacing. It was not clear at first glance how information was organized. It may also be helpful to place behavioral traits as the first column, seeing that these traits feed into the author's defined cognitive requirements.

      We have reformatted Table 1 to improve clarity and readability. Behavioral traits now appear in the first column, followed by cognitive dimensions and hypothesized neural correlates. Column headers have been bolded and alignment has been standardized.

      (2) Figures could include more detail to help with interpretations. For example, Figure 3 should define values included on the x-axis in the figure caption, and Figure 4 should explain the use of line, light color, and dark color. Figure 1 does not have a y-axis title.

      The figures have been revised and legends completed to ensure more clarity.

      (3) Please proofread for typos throughout.

      The manuscript has been carefully proofread, and all typographical and grammatical errors have been corrected. These changes are visible in the tracked version.

      Reviewer #2 (Recommendations for the authors):

      Specific comments:

      (1) Given all of the variability would it not be a good idea to just compare (eg in the supplemental) the macaque data from just the Strasbourg centre for m mulatta and m toneanna. I appreciate the ns will be lower, but other matters are more standardized.

      We fully understand the reviewer’s suggestion to restrict the comparison to data collected at a single site in order to minimize inter-site variability. However, as noted, such an analysis would come at the cost of statistical power, as the number of individuals per species within a single center is small. For example, while M. tonkeana is well represented at the Strasbourg centre, only one individual of M. mulatta is available from the same site. Thus, a restricted comparison would severely limit the interpretability of results, particularly for age-related trajectories. To address variability, we included acquisition site and brain preservation method as covariates or predictors where appropriate, and we have been cautious in our interpretations. We also now emphasize in the Methods and Discussion the value of future datasets with more standardized acquisition protocols across species and centers. We hope that by openly sharing our data and workflow, we can contribute to this broader goal.

      (2) I have various minor edits:

      (a) L 25 abstract - Specify what is meant by 'opposite trend'; the reader cannot infer what this is.

      Modified in line 25-28: “Unexpectedly, tolerant species exhibited a decrease in relative amygdala volume across the lifespan, contrasting with the age-related increase observed in intolerant species—a developmental pattern previously undescribed in primates.”

      (b) L67 - The reference 'Manyprimates' needs fixing as it does in the references section.

      After double checking, Manyprimates studies are international collaborative efforts that are supposed to be cite this way (https://manyprimates.github.io/#pubs).

      (c) L74 - Taking not Taken.

      This typo has been corrected.

      (d) L129 - It says 'total volume', but this is corrected total volume?

      We have clarified in the figures legends that the “total brain volume” used in our analyses excludes the cerebellum and the myelencephalon, as specified in our image preprocessing protocol. This ensures consistency across individuals and institutions.

      (e) L138 - Suddenly mentions 'frozen condition' without any prior explanation - this needs explaining in the legend - also L144.

      We have added an explanation of the ‘frozen condition’ variable in in the relevant figure legend.

      (f) L166 - Results - it would be helpful to remind readers what Grade 1 signifies, ie intolerant species.

      We now include a brief reminder in the Results section that Grade 1 corresponds to socially intolerant species, to help readers unfamiliar with the classification (Lines 240-251).

      (g)Figure 4 - Provide the ns for each of the 4 grades to help appreciate the meaningfulness of the curves, etc.

      The number of subjects has been added to the Figure and a novel analysis helps in the revised ms help to appreciate the meaningfulness of some of these curves.

      (h) L235 - 'we had assumed that species of high social tolerance grade would have presented a smaller amygdala in size compared to grade 1'. But surely this is the exact opposite of what is predicted in Table 1 - ie, the authors did not predict this as I read the paper (Unless Table l is misleading/ambiguous and needs clarification).

      As discussed in our response to Reviewer #2 and #3, we have restructured both Table 1 and the Discussion to ensure consistency. We now explicitly state that the findings diverge from our initial inhibitory-control-based prediction and propose alternative interpretations based on sociocognitive demands.

      (i) L270 - 'This observation' which?? Specify.

      We have replaced ‘this observation’ with a precise reference to the observed developmental decrease in amygdala volume in tolerant species.

      (j) L327 - 'groundbreaking' is just hype given that there are so many caveats - I personally do not like the word - novel is good enough.

      We have replaced the word ‘groundbreaking’ with ‘novel’ to adopt a more measured and appropriate tone in the discussion.

      (3) I might add that I am happy with the ethics regarding this study. 

      Thanks, we are also happy that we were able to study macaque brains from different species using opportunistic samplings along with already available data. We are collectively making progress on this!

      (4) Finally, I should commend the authors on all the additional information that they provide re gender/age/species. Given that there are 2xs are many females as males, it would be good to know if this affects the findings. I am not a primatologist, so I don't know, for example, if the females in Grade 1 monkeys are just as intolerant as the males?

      We thank the reviewer for this thoughtful comment. We now explicitly mention the female-biased sex ratio in the Methods section and report in the Results (Figure 2, Figure 3) that sex was included as a covariate in our Bayesian models. While a small effect of sex was found for hippocampal volume, no effect was observed for the amygdala. Given the strong imbalance in our dataset (2:1 female-to-male ratio), we refrained from drawing any conclusion about sex-specific patterns, as these would require larger and more balanced samples. Although we did not test for sex-by-grade interactions, we agree that this question—especially regarding whether females and males express social style differences similarly across grades—represents an important direction for future comparative work.

      Reviewer #3 (Recommendations for the authors):

      I found the article well-written, and very easy to follow, so I have little ways to propose improvements to the article to the authors, besides addressing the various major points when it comes to interpretation of the data.

      One list I found myself wanting was in fact the list of the social tolerance grades, and the process by which they got selected into 3 main bags of socio-cognitive skills. Then it would become interesting to see how each of the 12 species compares within both the 18 grades (maybe once again out of the scope of this paper, there are likely reviews out there that already do that, but then the authors should explicitly mention so in the paper: X, 19XX have compared 15 out of 18 traits in YY number of macaque species); and within the 3 major subcognitive requirements delineated by the authors, maybe as an annex?

      We thank the reviewer for this thoughtful suggestion. In the revised manuscript, we now include a detailed table (Table 1) that lists the 18 behavioral traits derived from Thierry’s framework, along with their associated cognitive dimension and hypothesized neuroanatomical correlate. While we did not create a matrix mapping each of the 12 species across all 18 traits due to space and data availability constraints, we agree this is an important direction that should be tackled by primatologist. We now include a sentence (line 87-90) in the manuscript to guide readers to previous comparative reviews (e.g., Thierry, 2000; Thierry et al., 2004, 2021) that document the expression of these traits across macaque species. We also clarify that our three cognitive categories are conceptual tools intended to structure neuroanatomical predictions, and not formal clusters derived from quantitative analyses.

      In the annex, it would also be good to have a general summarizing excel/R file for the raw data, with important information like age, sex, and the relevant calculated volumes for each individual. The folders available following the links do not make it an easy task for a reader to find the raw data in one place.

      We fully agree with the reviewer on the importance of data accessibility. We have now uploaded an additional supplementary file in .csv format on our OSF repository, which includes individuallevel metadata for all 42 macaques: species, sex, age, social grade, total brain volume, amygdala volume, and hippocampus volume. The link to this file is now explicitly mentioned in the Data Availability section. We hope this will facilitate comparisons with other datasets and improve usability for the community. In addition, we provide in a supplementary table the raw data that were used for our Bayesian modelling (see below).

      The availability of the raw data would also clear up one issue, which I believe results from the modelling process: it looks odd on Figure 2, that volume ratios, defined as the given brain area volume divided by the total brain volume, give values above 1 (especially for the hippocampus). As such, the authors should either modify the legend or the figure. In general, it would be nicer to have the "real values" somewhere easily accessible, so that they can be compared more broadly with: 1) other macaques species to address questions relevant to the species; 2) other primates to address other questions that are surely going to arise from this very interesting work!

      We thank the reviewer for pointing this out. The ratio values in Figure 1 correspond to the proportion of the regional volume (amygdala or hippocampus) relative to the total brain volume, excluding the cerebellum and myelencephalon. As such, values above 0.01 (i.e., above 1% of the brain volume) are expected for these structures and do not indicate an error. We have updated the figure legend to clarify this point explicitly. In addition, we have now made a cleaned .csv file available via OSF, containing all raw volumetric data and metadata in a format that facilitates cross-species or cross-study comparisons. This replaces the previous folder-based structure, which may have been less accessible.

      Typos:

      L233: delete 'in'

      L430: insert space in 'NMT template(Jung et al., 2021).'

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Recommendations for the authors):

      (1) My primary concern is that in some of the studies, there are not enough data points to be totally convincing. This is particularly apparent in the low z-force condition of Figure 1C.

      We agree that adequate sampling is essential for drawing robust conclusions. To address this concern, we performed a post hoc sensitivity analysis to assess the statistical power of our dataset. Given our sample sizes (N = 85 and 45) and observed variability, the experiment had 80% power (α = 0.05) to detect a difference in stall force of approximately 0.36 pN (Cohen’s d ≈ 0.38). The actual difference observed between conditions was 0.25 pN (d ≈ 0.26), which lies below the minimum detectable effect size. Thus, the non-significant result (p = 0.16) likely reflects that any true difference, if present, is smaller than the experimental sensitivity, rather than a lack of sufficient sampling.

      Importantly, both measured stall forces fall within the reported range for kinesin-1 in the literature, supporting that the dataset is representative and the measurements are reliable.

      (2) I'm also concerned about Figure 2B. Does each data point in the three graphs represent only a single event? If so, this should probably be repeated several more times to ensure that the data are robust.

      Each data point shown corresponds to the average of many processive runs, ranging from 32 to 167. This has been updated in the figure caption accordingly.

      (3) Figure 3. I'm surprised that the authors could not obtain a higher occupancy of the multivalent DNA tether with kinesin motors. They were adding up to a 30X higher concentration of kinesin, but still did not achieve stoichiometric labeling. The reasons for this should be discussed. This makes interpretation of the mechanical data much tougher. For instance, only 6-7% of the beads would be driven by three kinesins. Unless the movement of hundreds of beads were studied, I think it would be difficult to draw any meaningful insight, since most of the events would be reflective of beads with only one or sometimes two kinesins bound. I think more discussion is required to describe how these data were treated.

      The mass-photometry data in Figure 3B were acquired in the presence of a 3-fold molar excess of kinesin (Supplemental Figure 4) relative to the DNA chassis. In comparison, optical trapping studies were performed at a 10-20-fold molar excess of kinesin, resulting in a substantially higher percentage of chassis with multiple motors. The reason why we had to perform mass photometry measurements at lower molar excess than the optical trap is that at higher kinesin concentrations, the “kinesin-only” peak dominated and obscured 2- or 3-kinesin-bound species, preventing reliable fitting of the mass photometry data. 

      We have now used the mass photometry measurements to extrapolate occupancies under trapping conditions. We estimate 76-93% of 2-motor chassis are bound to two kinesins and ~70% of 3-motor chassis are bound to three kinesins under our trapping conditions. Moreover, the mean forces in Figures 3C–D exceed those expected for a single kinesin, consistent with occupancy substantially greater than one motor per chassis.

      We wrote: “To estimate the percentage of chassis with two and three motors bound, we performed mass photometry measurements at a 3-fold molar excess of kinesin to the chassis, as higher ratios would obscure the distinction of complexes from the kinesin-only population. Assuming there is no cooperativity among the binding sites, we modeled motor occupancy using a Binomial distribution (Figure 3_figure supplement 2). We observed 17-29% of particles corresponded to the two-motor species on the 2-motor chassis in mass photometry, indicating that 45-78% of the 2-motor chassis was bound to two kinesins. Similarly, 15% and 40% of the 3motor chassis were bound to two and three kinesins, respectively.  

      In optical trapping assays, we used 10-fold and 20-fold molar excess of kinesin for 2-motor and 3-motor chassis, respectively, to substantially increase the percentage of the chassis carried by multiple kinesins. Under these conditions, we estimate 76-93% of the 2-motor chassis were bound to two kinesins, and 30% and 70% of 3-motor chassis were bound to two and three kinesins, respectively.”

      “Multi-motor trapping assays were performed similarly using 10x and 20x kinesin for 2- and 3motor chassis, respectively. To estimate the percentage of chassis with multiple motors, we used the probability of kinesin binding to a site on a chassis from mass photometry in 3x excess condition to compute an effective dissociation constant where r is the molar ratio of kinesin to chassis. Single-site occupancy at higher molar excesses of kinesin was calculated using this parameter. ”

      We also added Figure 3_figure supplement 2 to explain our Binomial model.

      (4) Page 5, 1st paragraph. Here, the authors are comparing time constants from stall experiments to data obtained with dynein from Ezber et al. This study used the traditional "one bead" trapping approach with dynein bound directly to the bead under conditions where it would experience high z-forces. Thus, the comparison between the behavior of kinesin at low z-forces is not necessarily appropriate. Has anyone studied dynein's mechanics under low z-force regimes?

      We thank the reviewer for catching a citation error. The text has been corrected to reference Elshenawy et al. 2020, which reported stall time constants for mammalian dynein. 

      To our knowledge, dynein’s mechanics under explicitly low z-force conditions have not yet been reported; however, given the more robust stalling behavior of dynein and greater collective force generation, the cited paper was chosen to compare low z-force kinesin to a motor that appears comparatively unencumbered by z-forces. Our study adds to growing evidence that high z-forces disproportionately limit kinesin performance. 

      For clarification, we modified that sentence as follows: “These time constants are comparable to those reported for minus-end-directed dynein under high z-forces”.

      Reviewer #2 (Recommendations for the authors):

      (1) P3 pp2, a DNA tensiometer cannot control the force, but it can measure it; get the distance between the two ends of the tensiometer, and apply WLC.

      The text has been updated to more accurately reflect the differences between optical trapping and kinesin motility against a DNA tensiometer with a fixed lattice position.

      (2) Fig. 2b, SEM is a poor estimate or error for exponentially distributed run lengths. Other methods, like bootstrapping an exponential distribution fit, may provide a more realistic estimate.

      Run lengths were plotted as an inverse cumulative distribution function and fitted to a single exponential decay (Supplementary Figure S3). The plotted value represents the fitted decay constant (characteristic run length) ± SE (standard error of the fit), not the arithmetic mean ± SEM. Velocity values are reported as mean ± SEM. Detachment rate was computed as velocity divided by run length, except at 6 and 10 pN hindering loads, where minimal forward displacement necessitated fitting run-time decays directly. In those cases, the plotted detachment rate equals the inverse of the fitted time constant. The figure caption has been updated accordingly.

      (3) Kinesin-1 is covalently bound to a DNA oligo, which then attaches to the DNA chassis by hybridization. This oligo is 21 nt with a relatively low GC%. At what force does this oligo unhybridize? Can the authors verify that their stall force measurements are not cut short by the oligo detaching from the chassis?

      The 21-nt attachment oligo (38 % GC) is predicted to have ΔG<sub>37C</sub> ≈-25 kcal/mole or approximately 42 kT. If we assume this is the approximate amount of work required to unhybridize the oligo, we would expect the rupture force to be >15 pN. This significantly exceeds the stall force of a single kinesin. Since the stalling events rarely exceed a few seconds, it is unlikely that our oligos quickly detach from the chassis under such low forces.  

      Furthermore, optical trapping experiments are tuned such that no more than 30% of beads display motion within several minutes after they are brought near microtubules. After stalling events, the motor dissociates from the MT, and the bead snaps back to the trap center. Most beads robustly reengage with the microtubule, typically within 10 s, suggesting that the same motor chassis reengages with the microtubule after microtubule detachment. Successive runs of the same bead typically have similar stall forces, suggesting that the motors do not disengage from the chassis under resistive forces exerted by the trap.

      (4) Figure 1, a justification or explanation should be provided for why events lower than 1.5 pN were excluded. It appears arbitrary.

      Single-motor stall-force measurements used a trap stiffness of 0.08–0.10 pN/nm. At this stiffness, a 1.5 pN force corresponds to 15–19 nm bead displacement, roughly two kinesin steps, and events below this threshold could not be reliably distinguished from Brownian noise. For this reason, forces < 1.5 pN were excluded.

      In Methods, we wrote “Only peak forces above 1.5 pN (corresponding to a 15-19 nm bead displacement) were analyzed to clearly distinguish runs from the tracking noise.”

      (5) Figure 2b, is the difference in velocity statistically significant?

      The difference in velocity is statistically significant for most conditions. We did not compare velocities for -10 and -6 pN as these conditions resulted in little forward displacement. However, the p-values for all of the other conditions are -4 pN: 0.0026, -2 pN: 0.0001, -1 pN: 0.0446, +0.5 pN: 0.3148, +2 pN: 0.0001, +3 pN: 0.1191, +4 pN: 0.0004.

      (6) The number of measurements for each experimental datapoint in the corresponding figure caption should be provided. SEM is used without, but N is not reported in the caption.

      Figure captions have now been updated to report the number of trajectories (N) for each data point.

      Reviewer #3 (Recommendations for the authors):  

      (1) The method of DNA-tethered motor trapping to enable low z-force is not entirely novel, but adapted from Urbanska (2021) for use in conventional optical trapping laboratories without reliance on microfluidics. However, I appreciate that they have fully established it here to share with the community. The authors could strengthen their methods section by being transparent about protein weight, protein labelling, and DNA ladders shown in the supplementary information. What organism is the protein from? Presumably human, but this should be specified in the methods. While the figures show beautiful data and exemplary traces, the total number of molecules analysed or events is not consistently reported. Overall, certain methodological details should be made sufficient for reproducibility.

      We appreciate the reviewer’s attention to methodological clarity. The constructs used are indeed human kinesin-1, KIF5B. The Methods now specify protein origin, molecular weights, and labeling details, and all figure captions report the number of trajectories analyzed to ensure reproducibility.

      (2) The major limitation the study presents is overarching generalisability, starting with the title. I recommend that the title be specific to kinesin-1. 

      The title has been revised to specify kinesin-1. 

      The study uses two constructs: a truncated K560 for conventional high-force assays, and full-length Kif5b for the low z-force method. However, for the multi-motor assay, the authors use K560 with the rationale of preventing autoinhibition due to binding with DNA, but that would also have limited characterisation in the single-molecule assay. Overall, the data generated are clear, high-quality, and exciting in the low z-force conditions. But why have they not compared or validated their findings with the truncated construct K560? This is especially important in the force-feedback experiments and in comparison with Andreasson et al. and Carter et al., who use Drosophila kinesin-1. Could kinesin-1 across organisms exhibit different force-detachment kinetics? It is quite possible. 

      Construct choice was guided by physiological relevance and considerations of autoinhibition: K560 was used for high z-force single-motor assays. The results of these assays are consistent with conventional bead assays performed by Andreasson et al. and Carter et al. using kinesin from a different organism. Therefore, we do not believe there are major differences between force properties of Drosophila and human kinesin-1.

      For low z-force assays, we used full-length KIF5B, which has nearly identical velocity and stall force to K560 in standard bead assays. We used this construct for low z force assays because it has a longer and more flexible stalk than K560 and better represents the force behavior of kinesin under physiological conditions. We then used constitutively-active K560 motors for multi-motor experiments to avoid potential complications from autoinhibition of full-length kinesin.

      Similarly, the authors test backward slipping of Kif5b and K560 and measure dwell times in multi-motor assays. Why not detail the backward slippage kinetics of Kif5b and any step-size impact under low z-forces? For instance, with the traces they already have, the authors could determine slip times, distances, and frequency in horizontal force experiments. Overall, the manuscript could be strengthened by analysing both constructs more fully.

      Slip or backstep analyses were not performed on single-motor data because such events were rare; kinesin typically detached rather than slipped. In contrast, multi-motor assays exhibited frequent slip events corresponding to the detachment of individual motors, which were analyzed in detail.

      We wrote “In comparison, slipping events were rarely observed in beads driven by a single motor, suggesting that kinesin typically detaches rather than slipping back on the microtubule under hindering loads.”

      Appraisal and impact:

      This study contributes to important and debated evidence on kinesin-1 force-detachment kinetics. The authors conclude that kinesin-1 exhibits a slip-bond interaction with the microtubule under increasing forces, while other recent studies (Noell et al. and Kuo et al.), which also use low z-force setups, conclude catch-bond behaviour under hindering loads. I find the results not fully aligned with their interpretation. The first comparison of low zforces in their setup with Noell et al. (2024), based on stall times, does not hold, because it is an apples-to-oranges comparison. Their data show a stall time constant of 2.52 s, which is comparable to the 3 s reported by Noell et al., but the comparison is made with a weighted average of 1.49 s. The authors do report that detachment rates are lower in low z-force conditions under unloaded scenarios. So, to completely rule out catch-bond-like behaviour is unfair. That said, their data quality is good and does show that higher hindering forces lead to higher detachment rates. However, on closer inspection, the range of 0-5 pN shows either a decrease or no change in detachment rate, which suggests that under a hindering force threshold, catch-bond-like or ideal-bond-like behaviour is possible, followed by slipbond behaviour, which is amazing resolution. Under assisting loads, the slip-bond character is consistent, as expected. Overall, the study contributes to an important discussion in the biophysical community and is needed, but requires cautious framing, particularly without evidence of motor trapping in a high microtubule-affinity state rather than genuine bond strengthening.

      We are not completely ruling out the catch bond behavior in our manuscript. As the reviewer pointed out, our results are consistent with the asymmetric slip bond model, whereas DNA tensiometer assays are more consistent with the catch bond behavior. The advantage of our approach is the capability to directly control the magnitude and direction of load exerted on the motor in the horizontal axis and measure the rate at which the motor detaches from the microtubule as it walks under constant load. In comparison, DNA tensiometer assays cannot control the force, but measure the time it takes the motor to fall off from the microtubule after a brief stall. The extension of the DNA tether is used to estimate the force exerted on the motor during a stall in those assays. The slight disadvantage of our method is the presence of low zforces, whereas DNA tensiometer assays are expected to have little to no z-force. We wrote that the discrepancy between our results can be attributed to the presence of low z forces in our DNA tethered trapping assembly, which may result in a higher-than-normal detachment rate under high hindering loads, thereby resulting in less asymmetry in the force detachment kinetics. We also added that this discrepancy can be addressed by future studies that directly control and measure horizontal force and measure the motor detachment rate in the absence of z forces. Optical trapping assays with small nanoparticles (Sudhakar et al. Science 2021) may be well suited to conclusively reveal the bond characteristics of kinesin under hindering loads.

      Reviewing Editor Comments:

      The reviewers are in agreement with the importance of the findings and the quality of the results. The use of the DNA tether reduces the z-force on the motor and provides biologically relevant insight into the behavior of the motor under load. The reviewers' suggestions are constructive and focus on bolstering some of the data points and clarifying some of the methodological approaches. My major suggestion would be to clarify the rationale for concluding that kinesin-1 exhibits slip-bond behavior with increasing force in light of the work of Noell (10.1101/2024.12.03.626575) and Kuo et al (2022 10.1038/s41467022-31069-x), both of which take advantage of DNA tethers.

      Please see our response to the previous comment. In the revised manuscript, we first clarified that our results are in agreement with previous theoretical (Khataee & Howard, 2019) and experimental studies (Kuo et al., 2022; Noell et al., 2024; Pyrpassopoulos et al., 2020) that kinesin exhibits slower detachment under hindering load. This asymmetry became clear when the z-force was reduced or eliminated. 

      We clarified the differences between our results and DNA tensiometer assays and provided a potential explanation for these discrepancies. We also proposed that future studies might be required to fully distinguish between asymmetric slip, ideal, or catch bonding of kinesin under hindering loads.

      We wrote:

      “Our results agree with the theoretical prediction that kinesin exhibits higher asymmetry in force-detachment kinetics without z-forces (Khataee & Howard, 2019), and are consistent with optical trapping and DNA tensiometer assays that reported more persistent stalling of kinesin in the absence of z-forces (Kuo et al., 2022; Noell et al., 2024; Pyrpassopoulos et al., 2020).

      Force-detachment kinetics of protein-protein interactions have been modeled as either a slip, ideal, or catch bond, which exhibit an increase, no change, or a decrease in detachment rate, respectively, under increasing force (Thomas et al., 2008). Slip bonds are most commonly observed in biomolecules, but studies on cell adhesion proteins reported a catch bond behavior (Marshall et al., 2003). Although previous trapping studies of kinesin reported a slip bond behavior (Andreasson et al., 2015; Carter & Cross, 2005), recent DNA tensiometer studies that eliminated the z-force showed that the detachment rate of the motor under hindering forces is lower than that of an unloaded motor walking on the microtubule (Kuo et al., 2022; Noell et al., 2024), consistent with the catch bond behavior. Unlike these reports, we observed that the stall duration of kinesin is shorter than the motor run time under unloaded conditions, and the detachment rate of kinesin increases with the magnitude of the hindering force. Therefore, our results are more consistent with the asymmetric slip bond behavior. The difference between our results and the DNA tensiometer assays (Kuo et al., 2022; Noell et al., 2024) can be attributed to the presence of low z-forces in our DNA-tethered optical trapping assays, which may increase the detachment rate under high hindering forces. Future studies that could directly control hindering forces and measure the motor detachment rate in the absence of z-forces would be required to conclusively reveal the bond characteristics of kinesin under hindering loads.”

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      This paper undertakes an important investigation to determine whether movement slowing in microgravity is due to a strategic conservative approach or rather due to an underestimation of the mass of the arm. While the experimental dataset is unique and the coupled experimental and computational analyses comprehensive, the authors present incomplete results to support the claim that movement slowing is due to mass underestimation. Further analysis is needed to rule out alternative explanations.

      We thank the editor and reviewers for the thoughtful and constructive comments, which helped us substantially improve the manuscript. In this revised version, we have made the following key changes:

      - Directly presented the differential effect of microgravity in different movement directions, showing its quantitative match with model predictions.

      - Showed that changing cost function with the idea of conservative strategy is not a viable alternative.

      - Showed our model predictions remain largely the same after adding Coriolis and centripetal torques.

      - Discussed alternative explanations including neuromuscular deconditioning, friction, body stability, etc.

      - Detailed the model description and moved it to the main text, as suggested.

      Our point-to-point response is numbered to facilitate cross-referencing.

      We believe the revisions and the responses adequately addresses the reviewers’ concerns, and new analysis results strengthened our conclusion that mass underestimation is the major contributor to movement slowing in microgravity.

      Reviewer #1 (Public review):

      Summary:

      This article investigates the origin of movement slowdown in weightlessness by testing two possible hypotheses: the first is based on a strategic and conservative slowdown, presented as a scaling of the motion kinematics without altering its profile, while the second is based on the hypothesis of a misestimation of effective mass by the brain due to an alteration of gravity-dependent sensory inputs, which alters the kinematics following a controller parameterization error.

      Strengths:

      The article convincingly demonstrates that trajectories are affected in 0g conditions, as in previous work. It is interesting, and the results appear robust. However, I have two major reservations about the current version of the manuscript that prevent me from endorsing the conclusion in its current form.

      Weaknesses:

      (1) First, the hypothesis of a strategic and conservative slow down implicitly assumes a similar cost function, which cannot be guaranteed, tested, or verified. For example, previous work has suggested that changing the ratio between the state and control weight matrices produced an alteration in movement kinematics similar to that presented here, without changing the estimated mass parameter (Crevecoeur et al., 2010, J Neurophysiol, 104 (3), 1301-1313). Thus, the hypothesis of conservative slowing cannot be rejected. Such a strategy could vary with effective mass (thus showing a statistical effect), but the possibility that the data reflect a combination of both mechanisms (strategic slowing and mass misestimation) remains open.

      Response (1): Thank you for raising this point. The basic premise of this concern is that changing the cost function for implementing strategic slowing can reproduce our empirical findings, thus the alternative hypothesis that we aimed to refute in the paper remain possible. At least, it could co-exist with our hypothesis of mass underestimation. In the revision, we show that changing the cost function only, as suggested here, cannot produce the behavioral patterns observed in microgravity.

      As suggested, we modified the relative weighting of the state and control cost matrices (i.e., Q and R in the cost function Eq 15) without considering mass underestimation. While this cost function scaling can decrease peak velocity – a hallmark of strategic slowing – it also inevitably leads to later peak timings. This is opposite to our robust findings: the taikonauts consistently “advanced” their peak velocity and peak acceleration in time. Note, these model simulation patterns have also been shown in Crevecoeur et al. (2010), the paper mentioned by the reviewer (see their Figure 7B).

      We systematically changed the ratio between the state and control weight matrices in the simulation, as suggested. We divided Q and multiplied R by the same factor α, the cost function scaling parameter α as defined in Crevecoeur et al. (2010). This adjustment models a shift in movement strategy in microgravity, and we tested a wide range of α to examine reasonable parameter space. Simulation results for α = 3 and α = 0.3 are shown in Figure 1—figure supplement 2 and Figure 1—figure supplement 3 respectively. As expected, with α = 3 (higher control effort penalty), peak velocities and accelerations are reduced, but their timing is delayed. Conversely, with α = 0.3, both peak amplitude and timing increase. Hence, changing the cost function to implement a conservative strategy cannot produce the kinematic pattern observed in microgravity, which is a combination of movement slowing and peak timing advance.

      Therefore, we conclude that a change in optimal control strategy alone is insufficient to explain our empirical findings. Logically speaking, we cannot refute the possibility of strategic slowing, which can still exist on top of the mass underestimation we proposed here. However, our data does not support its role in explaining the slowing of goal-directed hand reaching in microgravity. We have added these analyses to the Supplementary Materials and expanded the Discussion to address this point.

      (2) The main strength of the article is the presence of directional effects expected under the hypothesis of mass estimation error. However, the article lacks a clear demonstration of such an effect: indeed, although there appears to be a significant effect of direction, I was not sure that this effect matched the model's predictions. A directional effect is not sufficient because the model makes clear quantitative predictions about how this effect should vary across directions. In the absence of a quantitative match between the model and the data, the authors' claims regarding the role of misestimating the effective mass remain unsupported.

      Response (2): First, we have to clarify that our study does not aim to quantitatively fit observed hand trajectory. The two-link arm model simulates an ideal case of moving a point mass (effective mass) on a horizontal plane without friction (Todorov, 2004; 2005). In contrast, in the experiment, participants moved their hand on a tabletop without vertical arm support, so the movement was not strictly planar and was affected by friction. Thus, this kind of model can only illustrate qualitative differences between conditions, as in the majorities of similar modeling studies (e.g., Shadmehr et al., 2016). In our study, qualitative simulation means the model is intended to reproduce the directional differences between conditions—not exact numeric values—in key kinematic measures. Specifically, it should capture how the peak velocity and acceleration amplitudes and their timings differ between normal gravity and microgravity (particularly under the mass-underestimation assumption).

      Second, the reviewer rightfully pointed out that the directional effect is essential for our theorization of the importance of mass underestimation. However, the directional effect has two aspects, which were not clearly presented in our original manuscript. We now clarify both here and in the revision. The first aspect is that key kinematic variables (peak velocity/acceleration and their timing) are affected by movement direction, even before any potential microgravity effect. This is shown by the ranking order of directions for these variables (Figure 1C-H). The direction-dependent ranking, confirmed by pre-flight data, indicates that effective mass is a determining factor for reaching kinematics, which motivated us to study its role in eliciting movement slowing in space. This was what our original manuscript emphasized and clearly presented.

      The second aspect is that the hypothetical mass underestimation might also differentially affect movements in different directions. This was not clearly presented in the original manuscript. However, we would not expect a quantitative match between model predictions and empirical data, for the reasons mentioned above. We now show this directional ranking in microgravity-elicited kinematic changes in both model simulations and empirical data. The overall trend is that the microgravity effect indeed differs between directions, and the model predictions and the data showed a reasonable qualitative match (Author response image 1 below).

      Shown in Author response image 1, we found that for amplitude changes (Δ peak speed, Δ peak acceleration) both the model and the mean of empirical data show the same directional ordering (45° > 90° > 135°) in pre-in and post-in comparisons. For timing (Δ peak-speed time, Δ peak-acceleration time), which we consider the most diagnostic, the same directional ranking was observed. We only found one deviation, i.e., the predicted sign (earlier peaks) was confirmed at 90° and 135°, but not at 45°. As discussed in Response (6), the absence of timing advance at 45° may reflect limitations of our simplified model, which did not consider that the 45° direction is essentially a single-joint reach. Taken together, the directional pattern is largely consistent with the model predictions based on mass underestimation. The model successfully reproduces the directional ordering of amplitude measures -- peak velocity and peak acceleration. It also captures the sign of the timing changes in two out of the three directions. We added these new analysis results in the revision and expanded Discussion accordingly.

      The details of our analysis on directional effects: We compared the model predictions (Author response image 1, left) with the experimental data (Author response image 1, right) across the three tested directions (45°, 90°, 135°). In the experimental data panels, both Δ(pre-in) (solid bars) and Δ(post-in) (semi-transparent bars) with standard error are shown. The directional trends are remarkably similar between model prediction and actual data. The post-in comparison is less aligned with model prediction; we postulate that the incomplete after-flight recovery (i.e., post data had not returned to pre-flight baselines) might obscure the microgravity effect. Incomplete recovery has also been shown in our original manuscript: peak speed and peak acceleration did not fully recover in post-flight sessions when compared to pre-flight sessions. To further quantify the correspondence between model and data, we performed repeated-measures correlation (rm-corr) analyses. We found significant within-subject correlations for three of the four metrics. For pre–in, Δ peak speed time (r<sub>rm</sub> = 0.627, t(23) = 3.858, p < 0.001), Δ peak acceleration time (r<sub>rm</sub> = 0.591, t(23) = 3.513, p = 0.002), and Δ peak acceleration (r<sub>rm</sub> = 0.573, t(23) = 3.351, p = 0.003) were significant, whereas Δ peak speed was not (r<sub>rm</sub> = 0.334, t(23) = 1.696, p = 0.103). These results thus show that the directional effect, as predicted our model, is observed both before spaceflight and in spaceflight (the pre-in comparison).

      Author response image 1.

      Directional comparison between model predictions and experimental data across the three reach directions (45°, 90°, 135°). Left: model outputs. Right: experimental data shown as Δ relative to the in-flight session; solid bars = Δ(in − pre) and semi-transparent bars = Δ(in − post). Colors encode direction consistently across panels (e.g., 45° = darker hue, 90° = medium, 135° = lighter/orange). Panels (clockwise from top-left): Δ peak speed (cm/s), Δ peak speed time (ms), Δ peak acceleration time (ms), and Δ peak acceleration (cm/s²). Bars are group means; error bars denote standard error across participants.

      Citations:

      Todorov, E. (2004). Optimality principles in sensorimotor control. Nature Neuroscience, 7(9), 907.

      Todorov, E. (2005). Stochastic optimal control and estimation methods adapted to the noise characteristics of the sensorimotor system. Neural Computation, 17(5), 1084–1108.

      Shadmehr, R., Huang, H. J., & Ahmed, A. A. (2016). A Representation of Effort in Decision-Making and Motor Control. Current Biology: CB, 26(14), 1929–1934.

      In general, both the hypotheses of slowing motion (out of caution) and misestimating mass have been put forward in the past, and the added value of this article lies in demonstrating that the effect depended on direction. However, (1) a conservative strategy with a different cost function can also explain the data, and (2) the quantitative match between the directional effect and the model's predictions has not been established.

      We agree that both hypotheses have been put forward before, however they are competing hypotheses that have not been resolved. Furthermore, the mass underestimation hypothesis is a conjecture without any solid evidence; previous reports on mass underestimation of object cannot directly translate to underestimation of body. As detailed in our responses above, we have shown that a conservative strategy implemented via a different cost function cannot reproduce the key findings in our dataset, thereby supporting the alternative hypothesis of mass underestimation. Moreover, we found qualitative agreement between the model predictions and the experimental data in terms of directional effects, which further strengthens our interpretation.

      Specific points:

      (1) I noted a lack of presentation of raw kinematic traces, which would be necessary to convince me that the directional effect was related to effective mass as stated.

      Response (3): We are happy to include exemplary speed and acceleration trajectories. Kinematic profiles from one example participant are shown in Figure 2—figure supplement 6.

      (2) The presentation and justification of the model require substantial improvement; the reason for their presence in the supplementary material is unclear, as there is space to present the modelling work in detail in the main text. Regarding the model, some choices require justification: for example, why did the authors ignore the nonlinear Coriolis and centripetal terms?

      Response (4): Great suggestion. In the revision, we have moved the model into the main text and added further justification for using this simple model.

      We initially omitted the nonlinear Coriolis and centripetal terms in order to start with a minimal model. Importantly, excluding these terms does not affect the model’s main conclusions. In the revision we added simulations that explicitly include these terms. The full explanation and simulations are provided in the Supplementary Notes 2 (this time we have to put it into the Supplementary to reduce the texts devoted to the model). More explanations can also be found in our response to Reviewer 2 (response (6)). The results indicate that, although these velocity-dependent forces show some directional anisotropy, their contribution is substantially smaller relative to that of the included inertial component; specifically, they have only a negligible impact on the predicted peak amplitudes and peak times.

      (3) The increase in the proportion of trials with subcomponents is interesting, but the explanatory power of this observation is limited, as the initial percentage was already quite high (from 60-70% during the initial study to 70-85% in flight). This suggests that the potential effect of effective mass only explains a small increase in a trend already present in the initial study. A more critical assessment of this result is warranted.

      Response (5): Thank you for your thoughtful comment. You are correct that the increase in the percentage of trials with submovements is modest, but a more critical change was observed in the timing between submovement peaks—specifically, the inter-peak interval (IPI). These intervals became longer during flight. Taken together with the percentage increase, the submovement changes significantly predicted the increase in movement duration, as shown by our linear mixed-effects model, which indicated that IPI increased.

      Reviewer #2 (Public review):

      This study explores the underlying causes of the generalized movement slowness observed in astronauts in weightlessness compared to their performance on Earth. The authors argue that this movement slowness stems from an underestimation of mass rather than a deliberate reduction in speed for enhanced stability and safety.

      Overall, this is a fascinating and well-written work. The kinematic analysis is thorough and comprehensive. The design of the study is solid, the collected dataset is rare, and the model tends to add confidence to the proposed conclusions. That being said, I have several comments that could be addressed to consolidate interpretations and improve clarity.

      Main comments:

      (1) Mass underestimation

      a) While this interpretation is supported by data and analyses, it is not clear whether this gives a complete picture of the underlying phenomena. The two hypotheses (i.e., mass underestimation vs deliberate speed reduction) can only be distinguished in terms of velocity/acceleration patterns, which should display specific changes during the flight with a mass underestimation. The experimental data generally shows the expected changes but for the 45° condition, no changes are observed during flight compared to the pre- and post-phases (Figure 4). In Figure 5E, only a change in the primary submovement peak velocity is observed for 45°, but this finding relies on a more involved decomposition procedure. It suggests that there is something specific about 45° (beyond its low effective mass). In such planar movements, 45° often corresponds to a movement which is close to single-joint, whereas 90° and 135° involve multi-joint movements. If so, the increased proportion of submovements in 90° and 135° could indicate that participants had more difficulties in coordinating multi-joint movements during flight. Besides inertia, Coriolis and centripetal effects may be non-negligible in such fast planar reaching (Hollerbach & Flash, Biol Cyber, 1982) and, interestingly, they would also be affected by a mass underestimation (thus, this is not necessarily incompatible with the author's view; yet predicting the effects of a mass underestimation on Coriolis/centripetal torques would require a two-link arm model). Overall, I found the discrepancy between the 45° direction and the other directions under-exploited in the current version of the article. In sum, could the corrective submovements be due to a misestimation of Coriolis/centripetal torques in the multi-joint dynamics (caused specifically -or not- by a mass underestimation)?

      Response (6): Thank you for raising these important questions. We unpacked the whole paragraph into two concerns: 1) the possibility that misestimation of Coriolis and centripetal torques might lead to corrective submovements, and 2) the weak effect in the 45° direction unexploited. These two concerns are valid but addressable, and they did not change our general conclusions based on our empirical findings (see Supplementary note 2. Coriolis and centripetal torques have minimal impact).

      Possible explanation for the 45° discrepancy

      We agree with the reviewer that the 45° direction likely involves more single-joint (elbow-dominant) movement, whereas the 90° and 135° directions require greater multi-joint (elbow + shoulder) coordination. This is particularly relevant when the workspace is near body midline (e.g., Haggard & Richardson, 1995), as the case in our experimental setup. To demonstrate this, we examined the curvature of the hand trajectories across directions. Using cumulative curvature (positive = counterclockwise), we obtained average values of 6.484° ± 0.841°, 1.539° ± 0.462°, and 2.819° ± 0.538° for the 45°, 90°, and 135° directions, respectively. The significantly larger curvature in the 45° condition suggests that these movements deviate more from a straight-line path, a hallmark of more elbow-dominant movements.

      Importantly, this curvature pattern was present in both the pre-flight and in-flight phases, indicating that it is a general movement characteristic rather than a microgravity-induced effect. Thus, the 45° reaches are less suitable for modeling with a simplified two-link arm model compared to the other two directions. We believe this is the main reason why the model predictions based on effective mass become less consistent with the empirical data for the 45° direction.

      We have now incorporated this new analysis in the Results and discussed it in the revised Discussion.

      Citation: Haggard, P., Hutchinson, K., & Stein, J. (1995). Patterns of coordinated multi-joint movement. Experimental Brain Research, 107(2), 254-266.

      b) Additionally, since the taikonauts are tested after 2 or 3 weeks in flight, one could also assume that neuromuscular deconditioning explains (at least in part) the general decrease in movement speed. Can the authors explain how to rule out this alternative interpretation? For instance, weaker muscles could account for slower movements within a classical time-effort trade-off (as more neural effort would be needed to generate a similar amount of muscle force, thereby suggesting a purposive slowing down of movement). Therefore, could the observed results (slowing down + more submovements) be explained by some neuromuscular deconditioning combined with a difficulty in coordinating multi-joint movements in weightlessness (due to a misestimation or Coriolis/centripetal torques) provide an alternative explanation for the results?

      Response (7): Neuromuscular deconditioning is indeed a space effect; thanks for bringing this up as we omitted the discussion of this confounds in our original manuscript. Prolonged stay in microgravity can lead to a reduction of muscle strength, but this is mostly limited to lower limb. For example, a recent well-designed large-sample study have shown that while lower leg muscle showed significant strength reductions, no changes in mean upper body strength was found (Scott et al., 2023), consistent with previous propositions that muscle weakness is less for upper-limb muscles than for postural and lower-limb muscles (Tesch et al., 2005). Furthermore, the muscle weakness is unlikely to play a major role here since our reaching task involves small movements (~12cm) with joint torques of a magnitude of ~2N·m. Of course, we cannot completely rule out the contribution of muscle weakness; we can only postulate, based on the task itself (12 cm reaching) and systematic microgravity effect (the increase in submovements, the increase in the inter-submovements intervals, and their significant prediction on movement slowing), that muscle weakness is an unlikely major contributor for the movement slowing.

      The reviewer suggests that poor coordination in microgravity might contribute to slowing down + more submovements. This is also a possibility, but we did not find evidence to support it. First, there is no clear evidence or reports about poor coordination for simple upper-limb movements like reaching investigated here. Note that reaching or aiming movement is one of the most studied tasks among astronauts. Second, we further analyzed our reaching trajectories and found no sign of curvature increase, a hallmark of poor coordination of Coriolis/centripetal torques, in our large collection of reaching movements. We probably have the largest dataset of reaching movements collected in microgravity thus far, given that we had 12 taikonauts and each of them performed about 480 to 840 reaching trials during their spaceflight. We believe the probability of Type II error is quite low here.

      Citation: Tesch, P. A., Berg, H. E., Bring, D., Evans, H. J., & LeBlanc, A. D. (2005). Effects of 17-day spaceflight on knee extensor muscle function and size. European journal of applied physiology, 93(4), 463-468.

      Scott J, Feiveson A, English K, et al. Effects of exercise countermeasures on multisystem function in long duration spaceflight astronauts. npj Microgravity. 2023;9(11).

      (2) Modelling

      a) The model description should be improved as it is currently a mix of discrete time and continuous time formulations. Moreover, an infinite-horizon cost function is used, but I thought the authors used a finite-horizon formulation with the prefixed duration provided by the movement utility maximization framework of Shadmehr et al. (Curr Biol, 2016). Furthermore, was the mass underestimation reflected both in the utility model and the optimal control model? If so, did the authors really compute the feedback control gain with the underestimated mass but simulate the system with the real mass? This is important because the mass appears both in the utility framework and in the LQ framework. Given the current interpretations, the feedforward command is assumed to be erroneous, and the feedback command would allow for motor corrections. Therefore, it could be clarified whether the feedback command also misestimates the mass or not, which may affect its efficiency. For instance, if both feedforward and feedback motor commands are based on wrong internal models (e.g., due to the mass underestimation), one may wonder how the astronauts would execute accurate goal-directed movements.

      b) The model seems to be deterministic in its current form (no motor and sensory noise). Since the framework developed by Todorov (2005) is used, sensorimotor noise could have been readily considered. One could also assume that motor and sensory noise increase in microgravity, and the model could inform on how microgravity affects the number of submovements or endpoint variance due to sensorimotor noise changes, for instance.

      c) Finally, how does the model distinguish the feedforward and feedback components of the motor command that are discussed in the paper, given that the model only yields a feedback control law? Does 'feedforward' refer to the motor plan here (i.e., the prefixed duration and arguably the precomputed feedback gain)?

      Response (8): We thank the reviewer for raising these important and technically insightful points regarding our modeling framework. We first clarify the structure of the model and key assumptions, and then address the specific questions in points (a)–(c) below.

      We used Todorov’s (2005) stochastic optimal control method to compute a finite-horizon LQG policy under sensory noise and signal-dependent motor noise (state noise set to zero). The cost function is: (see details in updated Methods). The resulting time-varying gains {L<sub>k</sub>, K<sub>k</sub>} correspond to the feedforward mapping and the feedback correction gain, respectively. The control law can be expressed as:

      where u<sub>k</sub> is the control input, is the nominal planned state, is the estimated state, L<sub>k</sub> is the feedforward (nominal) control associated with the planned trajectory, and K<sub>k</sub> is the time-varying feedback gain that corrects deviations from the plan.

      To define the motor plan for comparison with behavior, we simulate the deterministic open-loop

      trajectory by turning off noise and disabling feedback corrections, i.e., . In this framework, “feedforward” refers to this nominal motor plan. Thus, sensory and signal-dependent noise influence the computed policy (via the gains), but are not injected when generating the nominal trajectory. This mirrors the minimum-jerk practice used to obtain nominal kinematics in prior utility-based work (Shadmehr, 2016), while optimal control provides a more physiologically grounded nominal plan. In the revision, we have updated the equations, provided more modeling details, and moved the model description to the main text to reduce possible confusions.

      In the implementation of the “mass underestimation” condition, the mass used to compute the policy is the underestimated mass (), whereas the actual mass is used when simulating the feedforward trajectories. Corrective submovements are analyzed separately and are not required for the planning-deficit findings reported here.

      Answers of the three specific questions:

      a) We mistakenly wrote a continuous-time infinite-horizon cost function in our original manuscript, whereas our controller is actually implemented as a discrete-time finite-horizon LQG with a terminal cost, over a horizon set by the utility-based optimal movement duration T<sub>opt</sub>. The underestimated mass is used in both the utility model (to determine T<sub>opt</sub>) and in the control computation (i.e., internal model), while the true mass is used when simulating the movement. This mismatch captures the central idea of feedforward planning based on an incorrect internal model.

      b) As described, our model includes signal-dependent motor noise and sensory noise, following Todorov (2005). We also evaluated whether increased noise levels in microgravity could account for the observed behavioral changes. Simulation results showed that increasing either source of noise did not alter the main conclusions or reverse the trends in our key metrics. Moreover, our experimental data showed no significant increase in endpoint variability in microgravity (see analyses and results in Figure 2—figure supplement 3 & 4), making it unlikely that increased sensorimotor noise alone accounts for the observed slowing and submovement changes.

      c) In our framework, the time-varying gains {L<sub>K</sub>,K<sub>K</sub>}define the feedforward and feedback components of the control policy. While both gains are computed based on a stochastic optimal control formulation (including noise), for comparison with behavior we simulate only the nominal feedforward plan, by turning off both noise and feedback: . This defines a deterministic open-loop trajectory, which we use to capture planning-level effects such as peak timing shifts under mass underestimation. Feedback corrections via gains exist in the full model but are not involved in these specific analyses. We clarified this modeling choice and its behavioral relevance in the revised text.

      We have updated the equations and moved the model description into the main text in the revised manuscript to avoid confusion.

      (3) Brevity of movements and speed-accuracy trade-off

      The tested movements are much faster (average duration approx. 350 ms) than similar self-paced movements that have been studied in other works (e.g., Wang et al., J Neurophysiology, 2016; Berret et al., PLOS Comp Biol, 2021, where movements can last about 900-1000 ms). This is consistent with the instructions to reach quickly and accurately, in line with a speed-accuracy trade-off. Was this instruction given to highlight the inertial effects related to the arm's anisotropy? One may however, wonder if the same results would hold for slower self-paced movements (are they also with reduced speed compared to Earth performance?). Moreover, a few other important questions might need to be addressed for completeness: how to ensure that astronauts did remember this instruction during the flight? (could the control group move faster because they better remembered the instruction?). Did the taikonauts perform the experiment on their own during the flight, or did one taikonaut assume the role of the experimenter?

      Response (9): Thanks for highlighting the brevity of movements in our experiment. Our intention in emphasizing fast movements is to rigorously test whether movement is indeed slowed down in microgravity. The observed prolonged movement duration clearly shows that microgravity affects people’s movement duration, even when they are pushed to move fast. The second reason for using fast movement is to highlight that feedforward control is affected in microgravity. Mass underestimation specifically affects feedforward control in the first place, shown by the microgravity-related changes in peak velocity/acceleration. Slow movement would inevitably have online corrections that might obscure the effect of mass underestimation. Note that movement slowing is not only observed in our speed-emphasized reaching task, but also in whole-arm pointing in other astronauts’ studies (Berger, 1997; Sangals, 1999), which have been quoted in our paper. We thus believe these findings are generalizable.

      Regarding the consistency of instructions: all our experiments conducted in the Tiangong space station were monitored in real time by experimenters in the control center located in Beijing. The task instructions were presented on the initial display of the data acquisition application and ample reading time was allowed. All the pre-, in-, and post-flight test sessions were administered by the same group of personnel with the same instruction. It is common that astronauts serve both as participants and experimenters at the same time. And, they were well trained for this type of role on the ground. Note that we had multiple pre-flight test sessions to familiarize them with the task. All these rigorous measures were in place to obtain high-quality data. In the revision, we included these experimental details for readers that are not familiar with space studies, and provided the rationales for emphasizing fast movements.

      Citations:

      Berger, M., Mescheriakov, S., Molokanova, E., Lechner-Steinleitner, S., Seguer, N., & Kozlovskaya, I. (1997). Pointing arm movements in short- and long-term spaceflights. Aviation, Space, and Environmental Medicine, 68(9), 781–787.

      Sangals, J., Heuer, H., Manzey, D., & Lorenz, B. (1999). Changed visuomotor transformations during and after prolonged microgravity. Experimental Brain Research. Experimentelle Hirnforschung. Experimentation Cerebrale, 129(3), 378–390.

      (4) No learning effect

      This is a surprising effect, as mentioned by the authors. Other studies conducted in microgravity have indeed revealed an optimal adaptation of motor patterns in a few dozen trials (e.g., Gaveau et al., eLife, 2016). Perhaps the difference is again related to single-joint versus multi-joint movements. This should be better discussed given the impact of this claim. Typically, why would a "sensory bias of bodily property" persist in microgravity and be a "fundamental constraint of the sensorimotor system"?

      Response (10): We believe that the presence or absence of adaptation between our study and Gaveau et al.’s study cannot be simply attributed to single-joint versus multi-joint movements. Their adaptation concerned incorporating microgravity into movement control to minimize effort, whereas ours concerned accurately perceiving body mass. Gaveau et al.’s task involved large-amplitude vertical reaching, a scenario in which gravity strongly affects joint torques and movement execution. Thus, adaptation to microgravity can lead to better execution, providing a strong incentive for learning. By contrast, our task consisted of small-amplitude horizontal movements, where the gravitational influence on biomechanics is minimal.

      More importantly, we believe the lack of adaptation for mass underestimation is not totally surprising. When an inertial change is perceived (such as an extra weight attached to the forearm, as in previous motor adaptation studies), people can adapt their reaching within tens of trials. In that case, sensory cues are veridical, as they correctly signal the inertial perturbation. However, in microgravity, reduced gravitational pull and proprioceptive inputs constantly inform the controller that the body mass is less than its actual magnitude. In other words, sensory cues in space are misleading for estimating body mass. The resulting sensory bias prevents the sensorimotor system from adapting. Our initial explanation on this matter was too brief; we expanded it in the revised Discussion.

      Reviewer #3 (Public review):

      Summary:

      The authors describe an interesting study of arm movements carried out in weightlessness after a prolonged exposure to the so-called microgravity conditions of orbital spaceflight. Subjects performed radial point-to-point motions of the fingertip on a touch pad. The authors note a reduction in movement speed in weightlessness, which they hypothesize could be due to either an overall strategy of lowering movement speed to better accommodate the instability of the body in weightlessness or an underestimation of body mass. They conclude for the latter, mainly based on two effects. One, slowing in weightlessness is greater for movement directions with higher effective mass at the end effector of the arm. Two, they present evidence for an increased number of corrective submovements in weightlessness. They contend that this provides conclusive evidence to accept the hypothesis of an underestimation of body mass.

      Strengths:

      In my opinion, the study provides a valuable contribution, the theoretical aspects are well presented through simulations, the statistical analyses are meticulous, the applicable literature is comprehensively considered and cited, and the manuscript is well written.

      Weaknesses:

      Nevertheless, I am of the opinion that the interpretation of the observations leaves room for other possible explanations of the observed phenomenon, thus weakening the strength of the arguments.

      First, I would like to point out an apparent (at least to me) divergence between the predictions and the observed data. Figures 1 and S1 show that the difference between predicted values for the 3 movement directions is almost linear, with predictions for 90º midway between predictions for 45º and 135º. The effective mass at 90º appears to be much closer to that of 45º than to that of 135º (Figure S1A). But the data shown in Figure 2 and Figure 3 indicate that movements at 90º and 135º are grouped together in terms of reaction time, movement duration, and peak acceleration, while both differ significantly from those values for movements at 45º.

      Furthermore, in Figure 4, the change in peak acceleration time and relative time to peak acceleration between 1g and 0g appears to be greater for 90º than for 135º, which appears to me to be at least superficially in contradiction with the predictions from Figure S1. If the effective mass is the key parameter, wouldn't one expect as much difference between 90º and 135º as between 90º and 45º? It is true that peak speed (Figure 3B) and peak speed time (Figure 4B) appear to follow the ordering according to effective mass, but is there a mathematical explanation as to why the ordering is respected for velocity but not acceleration? These inconsistencies weaken the author's conclusions and should be addressed.

      Response (11): Indeed, the model predicts an almost equal separation between 45° and 90° and between 90° and 135°, while the data indicate that the spacing between 45° and 90° is much smaller than between 90° and 135°. We do not regard the divergence as evidence undermining our main conclusion since 1) the model is a simplification of the actual situation. For example, the model simulates an ideal case of moving a point mass (effective mass) without friction and without considering Coriolis and centripetal torques. 2) Our study does not make quantitative predictions of all the key kinematic measures; that will require model fitting, parameter estimation, and posture-constrained reaching experiments; instead, our study uses well-established (though simplified) models to qualitatively predict the overall behavioral pattern we would observe. For this purpose, our results are well in line with our expectations: though we did not find equal spacing between direction conditions, we do confirm that the key kinematic measures (Figure 2 and Figure 3 as questioned) show consistent directional trends between model predictions and empirical data. We added new analysis results on this matter: the directional effect we observed (how the key measures changed in microgravity across direction condition) is significantly correlated with our model predictions in most cases. Please check our detailed response (2) above. These results are also added in the revision.

      We also highlight in the revision that our modeling is not to quantitatively predict reaching behaviors in space, but to qualitatively prescribe that how mass underestimation, but not the conservative control strategy, can lead to divergent predictions about key kinematic measures of fast reaching.

      Then, to strengthen the conclusions, I feel that the following points would need to be addressed:

      (1) The authors model the movement control through equations that derive the input control variable in terms of the force acting on the hand and treat the arm as a second-order low-pass filter (Equation 13). Underestimation of the mass in the computation of a feedforward command would lead to a lower-than-expected displacement to that command. But it is not clear if and how the authors account for a potential modification of the time constants of the 2nd order system. The CNS does not effectuate movements with pure torque generators. Muscles have elastic properties that depend on their tonic excitation level, reflex feedback, and other parameters. Indeed, Fisk et al. showed variations of movement characteristics consistent with lower muscle tone, lower bandwidth, and lower damping ratio in 0g compared to 1g. Could the variations in the response to the initial feedforward command be explained by a misrepresentation of the limbs' damping and natural frequency, leading to greater uncertainty about the consequences of the initial command? This would still be an argument for unadapted feedforward control of the movement, leading to the need for more corrective movements. But it would not necessarily reflect an underestimation of body mass.

      Fisk, J. O. H. N., Lackner, J. R., & DiZio, P. A. U. L. (1993). Gravitoinertial force level influences arm movement control. Journal of neurophysiology, 69(2), 504-511.

      Response (12): We agree that muscle properties, tonic excitation level, proprioception-mediated reflexes all contribute to reaching control. Fisk et al. (1993) study indeed showed that arm movement kinematics change, possibly owing to lower muscle tone and/or damping. However, reduced muscle damping and reduced spindle activity are more likely to affect feedback-based movements. Like in Fisk et al.’s study, people performed continuous arm movements with eyes closed; thus their movements largely relied on proprioceptive control. Our major findings are about the feedforward control, i.e., the reduced and “advanced” peak velocity/acceleration in discrete and ballistic reaching movements. Note that the peak acceleration happens as early as approximately 90-100ms into the movements, clearly showing that feedforward control is affected -- a different effect from Fisk et al’s findings. It is unlikely that people “advanced” their peak velocity/acceleration because they feel the need for more later corrective movements. Thus, underestimation of body mass remains the most plausible explanation.

      (2) The movements were measured by having the subjects slide their finger on the surface of a touch screen. In weightlessness, the implications of this contact are expected to be quite different than those on the ground. In weightlessness, the taikonauts would need to actively press downward to maintain contact with the screen, while on Earth, gravity will do the work. The tangential forces that resist movement due to friction might therefore be different in 0g. This could be particularly relevant given that the effect of friction would interact with the limb in a direction-dependent fashion, given the anisotropy of the equivalent mass at the fingertip evoked by the authors. Is there some way to discount or control for these potential effects?

      Response (13): We agree that friction might play a role here, but normal interaction with a touch screen typically involves friction between 0.1N and 0.5N (e.g., Ayyildiz et al., 2018). We believe that the directional variation of the friction is even smaller than 0.1N. It is very small compared to the force used to accelerate the arm for the reaching movement (10N-15N). Thus, friction anisotropy is unlikely to explain our data. Indeed, our readers might have the same concern, we thus added some discussion about possible effect of friction.

      Citation: Ayyildiz M, Scaraggi M, Sirin O, Basdogan C, Persson BNJ. Contact mechanics between the human finger and a touchscreen under electroadhesion. Proc Natl Acad Sci U S A. 2018 Dec 11;115(50):12668-12673.

      (3) The carefully crafted modelling of the limb neglects, nevertheless, the potential instability of the base of the arm. While the taikonauts were able to use their left arm to stabilize their bodies, it is not clear to what extent active stabilization with the contralateral limb can reproduce the stability of the human body seated in a chair in Earth gravity. Unintended motion of the shoulder could account for a smaller-than-expected displacement of the hand in response to the initial feedforward command and/or greater propensity for errors (with a greater need for corrective submovements) in 0g. The direction of movement with respect to the anchoring point could lead to the dependence of the observed effects on movement direction. Could this be tested in some way, e.g., by testing subjects on the ground while standing on an unstable base of support or sitting on a swing, with the same requirement to stabilize the torso using the contralateral arm?

      Response (14): Body stabilization is always a challenge for human movement studies in space. We minimized its potential confounding effects by using left-hand grasping and foot straps for postural support throughout the experiment. We think shoulder stability is an unlikely explanation because unexpected shoulder instability should not affect the feedforward (early) part of the ballistic reaching movement: the reduced peak acceleration and its early peak were observed at about 90-100ms after movement initiation. This effect is too early to be explained by an expected stability issue. This argument is now mentioned in the revised Discussion.

      The arguments for an underestimation of body mass would be strengthened if the authors could address these points in some way.

      Recommendations for the authors:

      Reviewing Editor Comments:

      General recommendation

      Overall, the reviewers agreed this is an interesting study with an original and strong approach. Nonetheless, there were significant weaknesses identified. The main criticism is that there is insufficient evidence for the claim that the movement slowing is due to mass underestimation, rather than other explanations for the increased feedback corrections. To bolster this claim, the reviewers have requested a deeper quantitative analysis of the directional effect and comparison to model predictions. They have also suggested that a 2-dof arm model could be used to predict how mass underestimation would influence multi-joint kinematics, and this should be compared to the data. Alternatively, or additionally, a control experiment could be performed (described in the reviews). We do realize that some of these options may not be feasible or practical. Ultimately, we leave it to you to determine how best to strengthen and solidify the argument for mass underestimation, rather than other causes.

      As an alternative approach, you could consider tempering the claim regarding mass underestimation and focus more on the result that slower movements in microgravity are not simply a feedforward, rescaling of the movement trajectories, but rather, have greater feedback corrections. In this case, the reviewers feel it would still be critical to explain and discuss potential reasons for the corrections beyond mass underestimation.

      We hope that these points are addressable, either with new analyses, experiments, or with a tempering of the claims. Addressing these points would help improve the eLife assessment.

      Reviewer #1 (Recommendations for the authors):

      (1) Move model descriptions to the main text to present modelling choices in more detail

      Response (15): Thank you for the suggestion. We have moved the model descriptions to the main text to present the modeling choices in more detail and to allow readers to better cross-reference the analyses.

      (2) Perform quantitative comparisons of the directional effect with the model's predictions, and add raw kinematic traces to illustrate the effect in more detail.

      Response (16): Thanks for the suggestion, we have added the raw kinematics figure from a representative participant and please refer to Response (2) above for the comparisons of directional effect.

      (3) Explore the effect of varying cost parameters in addition to mass estimation error to estimate the proportion of data explained by the underestimation hypothesis.

      Response (17): Thank you for the suggestion. This has already been done—please see Response (1) above.

      Reviewer #2 (Recommendations for the authors):

      Minor comments:

      (1) It must be justified early on why reaction times are being analyzed in this work. I understood later that it is to rule out any global slowing down of behavioral responses in microgravity.

      Response (18): Exactly, RT results are informative about the absence of a global slowing down. Contrary to the conservative-strategy hypothesis, taikonauts did not show generalized slowing; they actually had faster reaction times during spaceflight, incompatible with a generalized slowing strategy. Thanks for point out; we justified that early in the text.

      (2) Since the results are presented before the methods, I suggest stressing from the beginning that the reaching task is performed on a tablet and mentioning the instructions given to the participants, to improve the reading experience. The "beep" and "no beep" conditions also arise without obvious justification while reading the paper.

      Response (19): Great suggestions. We now give out some experimental details and rationales at the beginning of Results.

      (3) Figure 1C: The vel profiles are not returning to 0 at the end, why? Is it because the feedback gain is computed based on the underestimated mass or because a feedforward controller is applied here? Is it compatible with the experimental velocity traces?

      Response (20): Figure. 1C shows the forward simulation under the optimal control policy. In our LQG formulation the terminal velocity is softly penalized (finite weight) rather than hard-constrained to zero; with a fixed horizon° the optimal solution can therefore end with a small residual velocity.

      In the behavioral data, the hand does come to rest: this is achieved by corrective submovements during the homing phase.

      (4) Left-skewed -> I believe this is right-skewed since the peak velocity is earlier.

      Response (21): Yes, it should be right-skewed, thanks for point that out.

      (5) What was the acquisition frequency of the positional data points? (on the tablet).

      Response (22): The sampling frequency is 100 Hz. Thanks for pointing that out; we’ve added this information to the Methods.

      (6) Figure S1. The planned duration seems to be longer than in the experiment (it is more around 500 ms for the 135-degree direction in simulation versus less than 400 ms in the experiment). Why?

      Response (23): We apologize for a coding error that inadvertently multiplied the body-mass parameter by an extra factor, making the simulated mass too high. We have corrected the code, rerun the simulations, and updated Figures 1 and S1; all qualitative trends remain unchanged, and the revised movement durations (≈300–400 ms) are closer to the experimental values.

      (7) After Equation 13: "The control law is given by". This is not the control law, which should have a feedback form u=K*x in the LQ framework. This is just the dynamic equations for the auxiliary state and the force. Please double-check the model description.

      Response (24): Thank you for point this out. We have updated and refined all model equations and descriptions, and moved the model description from the Supplementary Materials to the main text; please see the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      (1) I have a concern about the interpretation of the anisotropic "equivalent mass". From my understanding, the equivalent mass would be what an external actor would feel as an equivalent inertia if pushing on the end effector from the outside. But the CNS does not push on the arm with a pure force generator acting at the hand to effectuate movement. It applies torque around the joints by applying forces across joints with muscles, causing the links of the arm to rotate around the joints. If the analysis is carried out in joint space, is the effective rotational inertia of the arm also anisotropic with respect to the direction of the movement of the hand? In other words, can the authors reassure me that the simulations are equivalent to an underestimation of the rotational inertia of the links when applied to the joints of the limb? It could be that these are mathematically the same; I have not delved into the mathematics to convince myself either way. But I would appreciate it if the authors could reassure me on this point.

      Response (25): Thank you for raising this point. In our work, “equivalent mass” denotes the operational-space inertia projected along the hand-movement direction u, computed as:

      This formulation describes the effective mass perceived at the end effector along a given direction, and is standard in operational-space control.

      Although the motor command can be coded as either torque/force in the CNS, the actual executions are equivalent no matter whether it is specified as endpoint forces or joint torques, since force and torque are related by . For small excursions as investigated here, this makes the directional anisotropy in endpoint inertia consistent with the anisotropy of the effective joint-space inertia required to produce the same endpoint motion. Conceptually, therefore, our “mass underestimation” manipulation in operational space corresponds to underestimating the required joint-space inertia mapped through the Jacobian. Since our behavioral data are hand positions, using the operational-space representation is the most direct and appropriate way for modeling.

      (2) I would also like to suggest one more level of analysis to test their hypothesis. The authors decomposed the movements into submovements and measured the prevalence of corrective submovements in weightlessness vs. normal gravity. The increase in corrective submovements is consistent with the hypothesis of a misestimation of limb mass, leading to an unexpectedly smaller displacement due to the initial feedforward command, leading to the need for corrections, leading to an increased overall movement duration. According to this hypothesis, however, the initial submovement, while resulting in a smaller than expected displacement, should have the same duration as the analogous movements performed on Earth. The authors could check this by analyzing the duration of the extracted initial submovements.

      Response (26): We appreciate the reviewer’s suggestion regarding the analysis of the initial submovement duration. In our decomposition framework, each submovement is modeled as a symmetric log-normal (bell-shaped) component, such that the time to peak speed is always half of the component duration. Thus, the initial submovement duration is directly reflected in the initial submovement peak-speed time already reported in our original manuscript (Figure. 5F).

      However, we respectfully disagree with the assumption that mass underestimation would necessarily yield the same submovement duration as on Earth. Under mass underestimation, the movement is effectively under-actuated, and the initial submovement can terminate prematurely, leading to a shorter duration. This is indeed what we observed in the data. Therefore, our reported metrics already address the reviewer’s proposal and support the conclusion that mass underestimation reduces the initial submovement duration in microgravity. Per your suggestion, we now added one more sentence to explain to the reader that initial submovement peak-speed time reflect the duration of the initial submovement.

      Some additional minor suggestions:

      (1) I believe that it is important to include the data from the control subjects, in some form, in the main article. Perhaps shading behind the main data from the taikonauts to show similarities or differences between groups. It is inconvenient to have to go to the supplementary material to compare the two groups, which is the main test of the experiment.

      Response (27): Thank you for the suggestion. For all the core performance variables, the control group showed flat patterns, with no changes across test sessions at all. Thus, including these figures (together with null statistical results) in the main text would obscure our central message, especially given the expanded length of the revised manuscript (we added model details and new analysis results). Instead, following eLife’s format, we have reorganized the Supplementary Material so that each experimental figure has a corresponding supplementary figure showing the control data. This way, readers can quickly locate the control results and directly compare them with the experimental data, while keeping the main text focused.

      (2) "Importantly, sensory estimate of bodily property in microgravity is biased but evaded from sensorimotor adaptation, calling for an extension of existing theories of motor learning." Perhaps "immune from" would be a better choice of words.

      Response (28): Thanks for the suggestion, we edited our text accordingly.

      (3) "First, typical reaching movement exhibits a symmetrical bell-shaped speed profile, which minimizes energy expenditure while maximizing accuracy according to optimal control principles (Todorov, 2004)." While Todorov's analysis is interesting and well accepted, it might be worthwhile citing the original source on the phenomenon of bell-shaped velocity profiles that minimize jerk (derivative of acceleration) and therefore, in some sense, maximize smoothness. Flash and Hogan, 1985.

      Response (29): Thanks for the suggestion, we added the citation of minimum jerk.

      (4) "Post-hoc analyses revealed slower reaction times for the 45° direction compared to both 90° (p < 0.001, d = 0.293) and 135° (p = 0.003, d = 0.284). Notably, reactions were faster during the in-flight phase compared to pre-flight (p = 0.037, d = 0.333), with no significant difference between in-flight and post-flight phases (p = 0.127)." What can one conclude from this?

      Response (30): Although these decreases reached statistical significance, their magnitudes were small. The parallel pattern across groups suggests the effect is not driven by microgravity, but is more plausibly a mild learning/practice effect. We now mentioned this in the Discussion.

      (5) "In line with predictions, peak acceleration appeared significantly earlier in the 45° direction than other directions (45° vs. 90°, p < 0.001, d = 0.304; 45° vs. 135°, p < 0.001, d = 0.271)." Which predictions? Because the effective mass is greater at 45º? Could you clarify the prediction?

      Response (31): We should be more specific here; thank you for raising this. The predictions are the ones about peak acceleration timing (shown in Fig. 1H). We now modified this sentence as:

      “In line with model predictions (Figure 1H), ….”.

      (6) Figure 2: Why do 45º movements have longer reaction times but shorter movement durations?

      Response (32): Appreciate your careful reading of the results. We believe this is possibly due to flexible motor control across conditions and trials, i.e., people tend to move faster when people react slower with longer reaction time. This has been reflected in across-direction comparisons (as spotted by the reviewer here), and it has also been shown within participant and across participants: For both groups, we found a significant negative correlation between movement duration (MD) and reaction time (RT), both across and within individuals (Figure 2—figure supplement 5). This finding indicates that participants moved faster when their RT was slower, and vice versa. This flexible motor adjustment, likely due to the task requirement for rapid movements, remained consistent during spaceflight.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors present a novel usage of fluorescence lifetime imaging microscopy (FLIM) to measure NAD(P)H autofluorescence in the Drosophila brain, as a proxy for cellular metabolic/redox states. This new method relies on the fact that both NADH and NADPH are autofluorescent, with a different excitation lifetime depending on whether they are free (indicating glycolysis) or protein-bound (indicating oxidative phosphorylation). The authors successfully use this method in Drosophila to measure changes in metabolic activity across different areas of the fly brain, with a particular focus on the main center for associative memory: the mushroom body.

      Strengths:

      The authors have made a commendable effort to explain the technical aspects of the method in accessible language. This clarity will benefit both non-experts seeking to understand the methodology and researchers interested in applying FLIM to Drosophila in other contexts.

      Weaknesses:

      (1) Despite being statistically significant, the learning-induced change in f-free in α/β Kenyon cells is minimal (a decrease from 0.76 to 0.73, with a high variability). The authors should provide justification for why they believe this small effect represents a meaningful shift in neuronal metabolic state.

      We agree with the reviewer that the observed f_free shift averaged per individual, while statistically significant, is small. However, to our knowledge, this is the first study to investigate a physiological (i.e., not pharmacologically induced) variation in neuronal metabolism using FLIM. As such, there are no established expectations regarding the amplitude of the effect. In the revised manuscript, we have included an additional experiment involving the knockdown of ALAT in α/β Kenyon cells, which further supports our findings. We have also expanded the discussion to expose two potential reasons why this effect may appear modest.

      (2) The lack of experiments examining the effects of long-term memory (after spaced or massed conditioning) seems like a missed opportunity. Such experiments could likely reveal more drastic changes in the metabolic profiles of KCs, as a consequence of memory consolidation processes.

      We agree with the reviewer that investigating the effects of long-term memory on metabolism represent a valuable future path of investigation. An intrinsic caveat of autofluorescence measurement, however, is to identify the cellular origin of the observed changes. To this respect, long-term memory formation is not an ideal case study as its essential feature is expected to be a metabolic activation localized to Kenyon cells’ axons in the mushroom body vertical lobes (as shown in Comyn et al., 2024), where many different neuron subtypes send intricate processes. This is why we chose to first focus on middle-term memory, where changes at the level of the cell bodies could be expected from our previous work (Rabah et al., 2022). But our pioneer exploration of the applicability of NAD(P)H FLIM to brain metabolism monitoring in vivo now paves the way to extending it to the effect of other forms of memory.

      (3) The discussion is mostly just a summary of the findings. It would be useful if the authors could discuss potential future applications of their method and new research questions that it could help address.

      The discussion has been expanded by adding interpretations of the findings and remaining challenges.

      Reviewer #2 (Public review):

      This manuscript presents a compelling application of NAD(P)H fluorescence lifetime imaging (FLIM) to study metabolic activity in the Drosophila brain. The authors reveal regional differences in oxidative and glycolytic metabolism, with a particular focus on the mushroom body, a key structure involved in associative learning and memory. In particular, they identify metabolic shifts in α/β Kenyon cells following classical conditioning, consistent with their established role in energy-demanding middle- and long-term memories.

      These results highlight the potential of label-free FLIM for in-vivo neural circuit studies, providing a powerful complement to genetically encoded sensors. This study is well-conducted and employs rigorous analysis, including careful curve fitting and well-designed controls, to ensure the robustness of its findings. It should serve as a valuable technical reference for researchers interested in using FLIM to study neural metabolism in vivo. Overall, this work represents an important step in the application of FLIM to study the interactions between metabolic processes, neural activity, and cognitive function.

      Reviewer #3 (Public review):

      This study investigates the characteristics of the autofluorescence signal excited by 740 nm 2-photon excitation, in the range of 420-500 nm, across the Drosophila brain. The fluorescence lifetime (FL) appears bi-exponential, with a short 0.4 ns time constant followed by a longer decay. The lifetime decay and the resulting parameter fits vary across the brain. The resulting maps reveal anatomical landmarks, which simultaneous imaging of genetically encoded fluorescent proteins helps to identify. Past work has shown that the autofluorescence decay time course reflects the balance of the redox enzyme NAD(P)H vs. its protein-bound form. The ratio of free-to-bound NADPH is thought to indicate relative glycolysis vs. oxidative phosphorylation, and thus shifts in the free-to-bound ratio may indicate shifts in metabolic pathways. The basics of this measure have been demonstrated in other organisms, and this study is the first to use the FLIM module of the STELLARIS 8 FALCON microscope from Leica to measure autofluorescence lifetime in the brain of the fly. Methods include registering the brains of different flies to a common template and masking out anatomical regions of interest using fluorescence proteins.

      The analysis relies on fitting an FL decay model with two free parameters, f_free and t_bound. F_free is the fraction of the normalized curve contributed by a decaying exponential with a time constant of 0.4 ns, thought to represent the FL of free NADPH or NADH, which apparently cannot be distinguished. T_bound is the time constant of the second exponential, with scalar amplitude = (1-f_free). The T_bound fit is thought to represent the decay time constant of protein-bound NADPH but can differ depending on the protein. The study shows that across the brain, T_bound can range from 0 to >5 ns, whereas f_free can range from 0.5 to 0.9 (Figure 1a). These methods appear to be solid, the full range of fits are reported, including maximum likelihood quality parameters, and can be benchmarks for future studies.

      The authors measure the properties of NADPH-related autofluorescence of Kenyon Cells(KCs) of the fly mushroom body. The results from the three main figures are:

      (1) Somata and calyx of mushroom bodies have a longer average tau_bound than other regions (Figure 1e);

      (2) The f_free fit is higher for the calyx (input synapses) region than for KC somata (Figure 2b);

      (3) The average across flies of average f_free fits in alpha/beta KC somata decreases from 0.734 to 0.718. Based on the first two findings, an accurate title would be "Autofluorecense lifetime imaging reveals regional differences in NADPH state in Drosophila mushroom bodies."

      The third finding is the basis for the title of the paper and the support for this claim is unconvincing. First, the difference in alpha/beta f_free (p-value of 4.98E-2) is small compared to the measured difference in f_free between somas and calyces. It's smaller even than the difference in average soma f_free across datasets (Figure 2b vs c). The metric is also quite derived; first, the model is fit to each (binned) voxel, then the distribution across voxels is averaged and then averaged across flies. If the voxel distributions of f_free are similar to those shown in Supplementary Figure 2, then the actual f_free fits could range between 0.6-0.8. A more convincing statistical test might be to compare the distributions across voxels between alpha/beta vs alpha'/beta' vs. gamma KCs, perhaps with bootstrapping and including appropriate controls for multiple comparisons.

      The difference observed is indeed modest relative to the variability of f_free measurements in other contexts. The fact that the difference observed between the somata region and the calyx is larger is not necessarily surprising. Indeed, these areas have different anatomical compositions that may result in different basal metabolic profiles. This is suggested by Figure 1b which shows that the cortex and neuropile have different metabolic signatures. Differences in average f_free values in the somata region can indeed be observed between naive and conditioned flies. However, all comparisons in the article were performed between groups of flies imaged within the same experimental batches, ensuring that external factors were largely controlled for. This absence of control makes it difficult to extract meaningful information from the comparison between naive and conditioned flies.

      We agree with the reviewer that the choice of the metric was indeed not well justified in the first manuscript. In the new manuscript, we have tried to illustrate the reasons for this choice with the example of the comparison of f_free in alpha/beta neurons between unpaired and paired conditioning (Dataset 8). First, the idea of averaging across voxels is supported by the fact that the distributions of decay parameters within a single image are predominantly unimodal. Examples for Dataset 8 are now provided in the new Sup. Figure 14. Second, an interpretable comparison between multiple groups of distributions is, to our knowledge, not straightforward to implement. It is now discussed in Supplementary information. To measure interpretable differences in the shapes of the distributions we computed the first three moments of distributions of f_free for Dataset 8 and compared the values obtained between conditions (see Supplementary information and new Sup. Figure 15). Third, averaging across individuals allows to give each experimental subject the same weight in the comparisons.

      I recommend the authors address two concerns. First, what degree of fluctuation in autofluorescence decay can we expect over time, e.g. over circadian cycles? That would be helpful in evaluating the magnitude of changes following conditioning. And second, if the authors think that metabolism shifts to OXPHOS over glycolosis, are there further genetic manipulations they could make? They test LDH knockdown in gamma KCs, why not knock it down in alpha/beta neurons? The prediction might be that if it prevents the shift to OXPHOS, the shift in f_free distribution in alpha/beta KCs would be attenuated. The extensive library of genetic reagents is an advantage of working with flies, but it comes with a higher standard for corroborating claims.

      In the present study, we used control groups to account for broad fluctuations induced by external factors such as the circadian cycle. We agree with the reviewer that a detailed characterization of circadian variations in the decay parameters would be valuable for assessing the magnitude of conditioning-induced shifts. We have integrated this relevant suggestion in the Discussion. Conducting such an investigation lies unfortunately beyond the scope and means of the current project.

      In line with the suggestion of the reviewer, we have included a new experiment to test the influence of the knockdown of ALAT on the conditioning-induced shift measured in alpha/beta neurons. This choice is motivated in the new manuscript. The obtained result shows that no shift is detected in the mutant flies, in accordance with our hypothesis.

      FLIM as a method is not yet widely prevalent in fly neuroscience, but recent demonstrations of its potential are likely to increase its use. Future efforts will benefit from the description of the properties of the autofluorescence signal to evaluate how autofluorescence may impact measures of FL of genetically engineered indicators.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      (1) Y axes in Figures 1e, 2c, 3b,c are misleading. They must start at 0.

      Although we agree that making the Y axes start at 0 is preferable, in our case it makes it difficult to observe the dispersion of the data at the same time (your next suggestion). To make it clearer to the reader that the axes do not start at 0, a broken Y-axis is now displayed in every concerned figure.

      (2) These same plots should have individual data points represented, for increased clarity and transparency.

      Individual data points were added on all boxplots.

      Reviewer #2 (Recommendations for the authors):

      I am evaluating this paper as a fly neuroscientist with experience in neurophysiology, including calcium imaging. I have little experience with FLIM but anticipate its use growing as more microscopes and killer apps are developed. From this perspective, I value the opportunity to dig into FLIM and try to understand this autofluorescence signal. I think the effort to show each piece of the analysis pipeline is valuable. The figures are quite beautiful and easy to follow. My main suggestion is to consider moving some of the supplemental data to the main figures. eLife allows unlimited figures, moving key pieces of the pipeline to the main figures would make for smoother reading and emphasize the technical care taken in this study.

      We thank the reviewer for their feedback. Following their advice we have moved panels from the supplementary figures to the main text (see new Figure 2).

      Unfortunately, the scientific questions and biological data do not rise to the typical standard in the field to support the claims in the title, "In vivo autofluorescence lifetime imaging of the Drosophila brain captures metabolic shifts associated with memory formation". The authors also clearly state what the next steps are: "hypothesis-driven approaches that rely on metabolite-specific sensors" (Intro). The advantage of fly neuroscience is the extensive library of genetic reagents that enable perturbations. The key manipulation in this study is the electric shock conditioning paradigm that subtly shifts the distribution of a parameter fit to an exponential decay in the somas of alpha/beta KCs vs others. This feels like an initial finding that deserves follow-up; but is it a large enough result to motivate a future student to pick this project up? The larger effect appears to be the gradients in f_free across KCs overall (Figure 2b). How does this change with conditioning?

      We acknowledge that the observed metabolic shift is modest relative to the variability of f_free and agree that additional corroborating experiments would further strengthen this result. Nevertheless, we believe it remains a valid and valuable finding that will be of interest to researchers in the field. The reviewer is right in pointing out that the gradient across KCs is higher in magnitude, however, the fact that this technique can also report experience-dependent changes, in addition to innate heterogeneities across different cell types, is a major incentive for people who could be interested in applying NAD(P)H FLIM in the future. For this reason, we consider it appropriate to retain mention of the memory-induced shift in the title, while making it less assertive and adding a reference to the structural heterogeneities of f_free revealed in the study. We have also rephrased the abstract to adopt a more cautious tone and expanded the discussion to clarify why a low-magnitude shift in f_free can still carry biological significance in this context. Finally, we have added the results of a new set of data involving the knockdown of ALAT in Kenyon cells, to further support the relevance of our observation relative to memory formation, despite its small magnitude. We believe that these elements together form a good basis for future investigations and that the manuscript merits publication in its present form.

      Together, I would recommend reshaping the paper as a methods paper that asks the question, what are the spatial properties of NADPH FL across the brain? The importance of this question is clear in the context of other work on energy metabolism in the MBs. 2P FLIM will likely always have to account for autofluorescence, so this will be of interest. The careful technical work that is the strength of the manuscript could be featured, and whether conditioning shifts f_free could be a curio that might entice future work.

      By transferring panels of the supplementary figures to the main text (see new Figure 2) as suggested by Reviewer 2, we have reinforced the methodological part of the manuscript. For the reasons explained above, we however still mention the ‘biological’ findings in the title and abstract.

      Minor recommendations on science:

      Figure 2C. Plotting either individual data points or distributions would be more convincing.

      Individual data points were added on all boxplots.

      There are a few mentions of glia. What are the authors' expectations for metabolic pathways in glia vs. neurons? Are glia expected to use one more than the other? The work by Rabah suggests it should be different and perhaps complementary to neurons. Can a glial marker be used in addition to KC markers? This seems crucial to being able to distinguish metabolic changes in KC somata from those in glia.

      Drosophila cortex glia are thought to play a similar role as astrocytes in vertebrates (see Introduction). In that perspective, we expect cortex glia to display a higher level of glycolysis than neurons. The work by Rabah et al. is coherent with this hypothesis. Reviewer 2 is right in pointing out that using a glial marker would be interesting. However, current technical limitations make such experiments challenging. These limitations are now exposed in the discussion.

      The question of whether KC somata positions are stereotyped can probably be answered in other ways as well. For example, the KCs are in the FAFB connectomic data set and the hemibrain. How do the somata positions compare?

      The reviewer’s suggestion is indeed interesting. However, the FAFB and hemibrain connectomic datasets are based on only two individual flies, which probably limits their suitability for assessing the stereotypy of KC subtype distributions. In addition, aligning our data with the FAFB dataset would represent substantial additional work.

      The free parameter tau_bound is mysterious if it can be influenced by the identity of the protein. Are there candidate NADPH binding partners that have a spatial distribution in confocal images that could explain the difference between somas and calyx?

      There are indeed dozens of NADH- or NADPH-binding proteins. For this reason, in all studies implementing exponential fitting of metabolic FLIM data, tau_bound is considered a complex combination of the contributions from many different proteins. In addition, one should keep in mind that the number of cell types contributing to the autofluorescence signal in the mushroom body calyx (Kenyon cells, astrocyte-like and ensheathing glia, APL neurons, olfactory projection neurons, dopamine neurons) is much higher than in the somas (only Kenyon cells and cortex glia). This could also participate in the observed difference. Hence, focusing on intracellular heterogeneities of potential NAD(P)H binding partners seems premature at that stage.

      The phrase "noticeable but not statistically significant" is misleading.

      We agree with the reviewer and have removed “noticeable but” from the sentence in the new version of the manuscript.

      Minor recommendations on presentation:

      The Introduction can be streamlined.

      We agree that some parts of the Introduction can seem a bit long for experts of a particular field. However, we think that this level of detail makes the article easily accessible for neuroscientists working on Drosophila and other animal models but not necessarily with FLIM, as well as for experts in energy metabolism that may be familiar with FLIM but not with Drosophila neuroscience.

    1. Reviewer #3 (Public review):

      This paper applies a computational model to behavior in a probabilistic operant reward learning task (a 3-armed bandit) to uncover differences between individuals with temporomandibular disorder (TMD) compared with healthy controls. Integrating computational principles and models into pain research is an important direction, and the findings here suggest that TMD is associated with subtle changes in how uncertainty is represented over time as individuals learn to make choices that maximize reward. There are a number of strengths, including the comparison of a volatile Kalman filter (vKF) model to some standard base models (Rescorla Wagner with 1 or 2 learning rates) and parameter recovery analyses suggesting that the combination of task and vKF model may be able to capture some properties of learning and decision-making under uncertainty that may be altered in those suffering from chronic pain-related conditions.

      I've focused my comments in four areas: (1) Questions about the patient population, (2) Questions about what the findings here mean in terms of underlying cognitive/motivational processes, (3) Questions about the broader implications for understanding individuals with TMD and other chronic pain-related disorders, and (4) Technical questions about the models and results.

      (1) Patient population

      This is a computational modelling study, so it is light on characterization of the population, but the patient characteristics could matter. The paper suggests they were hospitalized, but this is not a condition that requires hospitalization per se. It would be helpful to connect and compare the patient characteristics with large-scale studies of TMD, such as the OPPERA study led by Maixner, Fillingim, and Slade.

      (2) What cognitive/motivational processes are altered in TMD

      The study finds a pattern of alterations in TMD patients that seems clear in Figure 2. Healthy controls (HC) start the task with high estimates of volatility, uncertainty, and learning rate, which drop over the course of the task session. This is consistent with a learner that is initially uncertain about the structure of the environment (i.e., which options are rewarded and how the contingencies change over time) but learns that there is a fixed or slowly changing mean and stationary variance. The TMD patients start off with much lower volatility, uncertainty, and learning rate - which are actually all near 0 - and they remain stable over the course of learning. This is consistent with a learner who believes they know the structure of the environment and ignores new information.

      What is surprising is that this pattern of changes over time was found in spite of null group differences in a number of aspects of performance: (1) stay rate, (2) switch rate, (3) win-stay/lose-switch behaviors, (4) overall performance (corrected for chance level), (5) response times, (6) autocorrelation, (7) correlations between participants' choice probability and each option's average reward rate, (7) choice consistency (though how operationalized is not described?), (8) win-stay-lose-shift patterns over time. I'm curious about how the patterns in Figure 2 would emerge if standard aspects of performance are essentially similar across groups (though the study cannot provide evidence in favor of the null). It will be important to replicate these patterns in larger, independent samples with preregistered analyses.

      The authors believe that this pattern of findings reveals that TMD patients "maintain a chronically heightened sensitivity to environmental changes" and relate the findings to predictive processing, a hallmark of which (in its simplest form) is precision-weighted updating of priors. They also state that the findings are not related to reduced overall attentiveness or failure to understand the task, but describe them as deficits or impairments in calibrating uncertainty.

      The pattern of differences could, in fact, result from differences in prior beliefs, conceptualization of the task, or learning. Unpacking these will be important steps for future work, along with direct measures of priors, cognitive processes during learning, and precision-weighted updating.

      (3) Implications for understanding chronic pain

      If the findings and conclusions of the paper are correct, individuals with TMD and perhaps other pain-related disorders may have fundamental alterations in the ways in which they make decisions about even simple monetary rewards. The broader questions for the field concern (1) how generalizable such alterations are across tasks, (2) how generalizable they are across patient groups and, conversely, how specific they are to TMD or chronic pain, (3) whether they are the result of neurological dysfunction, as opposed to (e.g.) adaptive strategies or assumptions about the environment/task structure.

      It will be important to understand which features of patients' and/or controls' cognition are driving the changes. For example, could the performance differences observed here be attributable to a reduced or altered understanding of the task instructions, more uncertainty about the rules of the game, different assumptions about environments (i.e., that they are more volatile/uncertain or less so), or reduced attention or interest in optimizing performance? Are the controls OVERconfident in their understanding of the environment?

      This set of questions will not be easy to answer and will be the work of many groups for many years to come. It is a judgment call how far any one paper must go to address them, but my view is that it is a collaborative effort. Start with a finding, replicate it across labs, take the replicable phenomena and work to unpack the underlying questions. The field must determine whether it is this particular task with this model that produces case-control differences (and why), or whether the findings generalize broadly. Would we see the same findings for monetary losses, sounds, and social rewards? Tasks with painful stimuli instead of rewards?

      Another set of questions concerns the space of computational models tested, and whether their parameters are identifiable. An alteration in estimated volatility or learning rate, for example, can come from multiple sources. In one model, it might appear as a learning rate change and in another as a confirmation bias. It would be interesting in this regard to compare the "mechanisms" (parameters) of other models used in pain neuroscience, e.g., models by Seymour, Mancini, Jepma, Petzschner, Smith, Chen, and others (just to name a few).

      One immediate next step here could be to formally compare the performance of both patients and controls to normatively optimal models of performance (e.g., Bayes optimal models under different assumptions). This could also help us understand whether the differences in patients reflect deficits and what further experiments we would need to pin that down.<br /> In addition, the volatility parameter in the computational model correlated with apathy. This is interesting. Is there a way to distinguish apathy as a particular clinical characteristic and feature of TMD from apathy in the sense of general disinterest in optimal performance that may characterize many groups?

      If we know this, what actionable steps does it lead us to take? Could we take steps to reduce apathy and thus help TMD patients better calibrate to environmental uncertainty in their lives? Or take steps to recalibrate uncertainty (i.e., increase uncertainty adaptation), with benefits on apathy? A hallmark of a finding that the field can build off of is the questions it raises.

      (4) Technical questions about the models and results

      Clarification of some technical points would help interpret the paper and findings further:

      (a) Was the reward probability truly random? Was the random walk different for each person, or constrained?

      (b) When were self-report measures administered, and how?

      (c) Pain assessments: What types of pain? Was a body map assessed? Widespreadness? Pain at the time of the test, or pain in general?

      (d) Parameter recovery: As you point out, r = 0.47 seems very low for recovery of the true quantity, but this depends on noise levels and on how the parameter space is sampled. Is this noise-free recovery, and is it robust to noise? Are the examples of true parameters drawn from the space of participants, or do they otherwise systematically sample the space of true parameters?

      (e) What are the covariances across parameter estimates and resultant confusability of parameter estimates (e.g., confusion matrix)?

      (f) It would be helpful to have a direct statistical comparison of controls and TMD on model parameter estimates.

      (g) Null statistical findings on differences in correlations should not be interpreted as a lack of a true effect. Bayes Factors could help, but an analysis of them will show that hundreds of people are needed before it is possible to say there are no differences with reasonable certainty. Some journals enforce rules around the kinds of language used to describe null statistical findings, and I think it would be helpful to adopt them more broadly.

      (h) What is normatively optimal in this task? Are TMD patients less so, or not? The paper states "aberrant precision (uncertainty) weighting and misestimation of environmental volatility". But: are they misestimates?

      (i) It's not clear how well the choice of prior variance for all parameters (6.25) is informed by previous research, as sensible values may be task- and context-dependent. Are the main findings robust to how priors are specified in the HBI model?

    1. This means that media, which includes painting, movies, books, speech, songs, dance, etc., all communicates in some way, and thus are social. And every social thing humans do is done through various mediums. So, for example, a war is enacted through the mediums of speech (e.g., threats, treaties, battle plans), coordinated movements, clothing (uniforms), and, of course, the mediums of weapons and violence.

      The definition of bots in this chapter highlights that automation exists on a spectrum rather than as a simple bot vs. human distinction. I found it interesting that many accounts we interact with daily may be partially automated, which challenges the assumption that bots are always deceptive or malicious. This makes me think that ethical concerns should focus more on transparency and intent, not just whether automation is involved.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Here, the authors have addressed the recruitment and firing patterns of motor units (MUs) from the long and lateral heads of the triceps in the mouse. They used their newly developed Myomatrix arrays to record from these muscles during treadmill locomotion at different speeds, and they used template-based spike sorting (Kilosort) to extract units. Between MUs from the two heads, the authors observed differences in their firing rates, recruitment probability, phase of activation within the locomotor cycle, and interspike interval patterning. Examining different walking speeds, the authors find increases in both recruitment probability and firing rates as speed increases. The authors also observed differences in the relation between recruitment and the angle of elbow extension between motor units from each head. These differences indicate meaningful variation between motor units within and across motor pools and may reflect the somewhat distinct joint actions of the two heads of triceps.

      Strengths:

      The extraction of MU spike timing for many individual units is an exciting new method that has great promise for exposing the fine detail in muscle activation and its control by the motor system. In particular, the methods developed by the authors for this purpose seem to be the only way to reliably resolve single MUs in the mouse, as the methods used previously in humans and in monkeys (e.g. Marshall et al. Nature Neuroscience, 2022) do not seem readily adaptable for use in rodents.

      The paper provides a number of interesting observations. There are signs of interesting differences in MU activation profiles for individual muscles here, consistent with those shown by Marshall et al. It is also nice to see fine-scale differences in the activation of different muscle heads, which could relate to their partially distinct functions. The mouse offers greater opportunities for understanding the control of these distinct functions, compared to the other organisms in which functional differences between heads have previously been described.

      The Discussion is very thorough, providing a very nice recounting of a great deal of relevant previous results.

      We thank the Reviewer for these comments.

      Weaknesses:

      The findings are limited to one pair of muscle heads. While an important initial finding, the lack of confirmation from analysis of other muscles acting at other joints leaves the general relevance of these findings unclear.

      The Reviewer raises a fair point. While outside the scope of this paper, future studies should certainly address a wider range of muscles to better characterize motor unit firing patterns across different sets of effectors with varying anatomical locations. Still, the importance of results from the triceps long and lateral heads should not be understated as this paper, to our knowledge, is the first to capture the difference in firing patterns of motor units across any set of muscles in the locomoting mouse.

      While differences between muscle heads with somewhat distinct functions are interesting and relevant to joint control, differences between MUs for individual muscles, like those in Marshall et al., are more striking because they cannot be attributed potentially to differences in each head's function. The present manuscript does show some signs of differences for MUs within individual heads: in Figure 2C, we see what looks like two clusters of motor units within the long head in terms of their recruitment probability. However, a statistical basis for the existence of two distinct subpopulations is not provided, and no subsequent analysis is done to explore the potential for differences among MUs for individual heads.

      We agree with the Reviewer and have revised the manuscript to better examine potential subpopulations of units within each muscle as presented in Figure 2C. We performed Hartigan’s dip test on motor units within each muscle to test for multimodal distributions. For both muscles, p > 0.05, so we cannot reject the null hypothesis that the units in each muscle come from a multimodal distribution. However, Hartigan’s test and similar statistical methods have poor statistical power for the small sample sizes (n=17 and 16 for long and lateral heads, respectively) considered here, so the failure to achieve statistical significance might reflect either the absence of a true difference or a lack of statistical resolution.

      Still, the limited sample size warrants further data collection and analysis since the varying properties across motor units may lead to different activation patterns. Given these results, we have edited the text as follows:

      “A subset of units, primarily in the long head, were recruited in under 50% of the total strides and with lower spike counts (Figure 2C). This distribution of recruitment probabilities might reflect a functionally different subpopulation of units. However, the distribution of recruitment probabilities were not found to be significantly multimodal (p>0.05 in both cases, Hartigan’s dip test; Hartigan, 1985). However, Hartigan’s test and similar statistical methods have poor statistical power for the small sample sizes (n=17 and 16 for long and lateral heads, respectively) considered here, so the failure to achieve statistical significance might reflect either the absence of a true difference or a lack of statistical resolution.”

      The statistical foundation for some claims is lacking. In addition, the description of key statistical analysis in the Methods is too brief and very hard to understand. This leaves several claims hard to validate.

      We thank the Reviewer for these comments and have clarified the text related to key statistical analyses throughout the manuscript, as described in our other responses below.

      Reviewer #2 (Public review):

      The present study, led by Thomas and collaborators, aims to describe the firing activity of individual motor units in mice during locomotion. To achieve this, they implanted small arrays of eight electrodes in two heads of the triceps and performed spike sorting using a custom implementation of Kilosort. Simultaneously, they tracked the positions of the shoulder, elbow, and wrist using a single camera and a markerless motion capture algorithm (DeepLabCut). Repeated one-minute recordings were conducted in six mice at five different speeds, ranging from 10 to 27.5 cm·s<sup>-1</sup>.

      From these data, the authors reported that:

      (1) a significant portion of the identified motor units was not consistently recruited across strides,

      (2) motor units identified from the lateral head of the triceps tended to be recruited later than those from the long head,

      (3) the number of spikes per stride and peak firing rates were correlated in both muscles, and

      (4) the probability of motor unit recruitment and firing rates increased with walking speed.

      The authors conclude that these differences can be attributed to the distinct functions of the muscles and the constraints of the task (i.e., speed).

      Strengths:

      The combination of novel electrode arrays to record intramuscular electromyographic signals from a larger muscle volume with an advanced spike sorting pipeline capable of identifying populations of motor units.

      We thank the Reviewer for this comment.

      Weaknesses:

      (1) There is a lack of information on the number of identified motor units per muscle and per animal.

      The Reviewer is correct that this information was not explicitly provided in the prior submission. We have therefore added Table 1 that quantifies the number of motor units per muscle and per animal.

      (2) All identified motor units are pooled in the analyses, whereas per-animal analyses would have been valuable, as motor units within an individual likely receive common synaptic inputs. Such analyses would fully leverage the potential of identifying populations of motor units.

      Please see our answer to the following point, where we address questions (2) and (3) together.

      (3) The current data do not allow for determining which motor units were sampled from each pool. It remains unclear whether the sample is biased toward high-threshold motor units or representative of the full pool.

      We thank the Reviewer for these comments. To clarify how motor unit responses were distributed across animals and muscle targets, we updated or added the following figures:  

      Figure 2C

      Figure 4–figure supplement 1

      Figure 5–figure supplement 2

      Figure 6–figure supplement 2

      These provide a more complete look at the range of activity within each motor pool, suggesting that we do measure from units with different activation thresholds within the same motor pool, rather than this variation being due to cross-animal differences. For example, Figure 2C illustrates that motor units from the same muscle and animal show a wide variety of recruitment probabilities. However, the limited number of motor units recorded from each individual animal does not allow a statistically rigorous test for examining cross-animal differences.

      (4) The behavioural analysis of the animals relies solely on kinematics (2D estimates of elbow angle and stride timing). Without ground reaction forces or shoulder angle data, drawing functional conclusions from the results is challenging.

      The Reviewer is correct that we did not measure muscular force generation or ground reaction forces in the present study. Although outside the scope of this study, future work might employ buckle force transducers as used in larger animals (Biewener et al., 1988; Karabulut et al., 2020) to examine the complex interplay between neural commands, passive biomechanics, and the complex force-generating properties of muscle tissue.

      Major comments:

      (1) Spike sorting

      The conclusions of the study rely on the accuracy and robustness of the spike sorting algorithm during a highly dynamic task. Although the pipeline was presented in a previous publication (Chung et al., 2023, eLife), a proper validation of the algorithm for identifying motor unit spikes is still lacking. This is particularly important in the present study, as the experimental conditions involve significant dynamic changes. Under such conditions, muscle geometry is altered due to variations in both fibre pennation angles and lengths.

      This issue differs from electrode drift, and it is unclear whether the original implementation of Kilosort includes functions to address it. Could the authors provide more details on the various steps of their pipeline, the strategies they employed to ensure consistent tracking of motor unit action potentials despite potential changes in action potential waveforms, and the methods used for manual inspection of the spike sorting algorithm's output?

      This is an excellent point and we agree that the dynamic behavior used in this investigation creates potential new challenges for spike sorting. In our analysis, Kilosort 2.5 provides key advantages in comparing unit waveforms across multiple channels and in detecting overlapping spikes. We modified this version of Kilosort to construct unit waveform templates using only the channels within the same muscle (Chung et al., 2023), as clarified in the revised Methods section (see “Electromyography (EMG)”):

      “A total of 33 units were identified across all animals. Each unit’s isolation was verified by confirming that no more than 2% of inter-spike intervals violated a 1 ms refractory limit. Additionally, we manually reviewed cross-correlograms to ensure that each waveform was only reported as a single motor unit.”

      The Reviewer is correct that our ability to precisely measure a unit’s activity based on its waveform will depend on the relationship between the embedded electrode and the muscle geometry, which alters over the course of the stride. As a follow-up to the original text, we have included new analyses to characterize the waveform activity throughout the experiment and stride (also in Methods):

      “We further validated spike sorting by quantifying the stability of each unit’s waveform across time (Figure 1–figure supplement 1). First, we calculated the median waveform of each unit across every trial to capture long-term stability of motor unit waveforms. Additionally, we calculated the median waveform through the stride binned in 50 ms increments using spiking from a single trial. This second metric captures the stability of our spike sorting during the rapid changes in joint angles that occur during the burst of an individual motor unit. In doing so, we calculated each motor unit’s waveforms from the single channel in which that unit’s amplitude was largest and did not attempt to remove overlapping spikes from other units before measuring the median waveform from the data. We then calculated the correlation between a unit’s waveform over either trials or bins in which at least 30 spikes were present. The high correlation of a unit waveform over time, despite potential changes in the electrodes’ position relative to muscle geometry over the dynamic task, provides additional confidence in both the stability of our EMG recordings and the accuracy of our spike sorting.”

      (2) Yield of the spike sorting pipeline and analyses per animal/muscle

      A total of 33 motor units were identified from two heads of the triceps in six mice (17 from the long head and 16 from the lateral head). However, precise information on the yield per muscle per animal is not provided. This information is crucial to support the novelty of the study, as the authors claim in the introduction that their electrode arrays enable the identification of populations of motor units. Beyond reporting the number of identified motor units, another way to demonstrate the effectiveness of the spike sorting algorithm would be to compare the recorded EMG signals with the residual signal obtained after subtracting the action potentials of the identified motor units, using a signal-to-residual ratio.

      Furthermore, motor units identified from the same muscle and the same animal are likely not independent due to common synaptic inputs. This dependence should be accounted for in the statistical analyses when comparing changes in motor unit properties across speeds and between muscles.

      We thank the Reviewer for this comment. Regarding motor unit yield, as described above the newly-added Table 1 displays the yield from each animal and muscle.

      Regarding spike sorting, while signal-to-residual is often an excellent metric, it is not ideal for our high-resolution EMG signals since isolated single motor units are typically superimposed on a “bulk” background consisting of the low-amplitude waveforms of other motor units. Because these smaller units typically cannot be sorted, it is challenging to estimate the “true” residual after subtracting (only) the largest motor unit, since subtracting each sorted unit’s waveform typically has a very small effect on the RMS of the total EMG signal. To further address concerns regarding spike sorting quality, we added Figure 1–figure supplement 1 that demonstrates motor units’ consistency over the experiment, highlighting that the waveform maintains its shape within each stride despite muscle/limb dynamics and other possible sources of electrical noise or artifact.

      Finally, the Reviewer is correct that individual motor units in the same muscle are very likely to receive common synaptic inputs. These common inputs may reflect in sparse motor units being recruited in overlapping rather than different strides. Indeed, in the following added to the Results, we identified that motor units are recruited with higher probability when additional units are recruited.

      “Probabilistic recruitment is correlated across motor units

      Our results show that the recruitment of individual motor units is probabilistic even within a single speed quartile (Figure 5A-C) and predicts body movements (Figure 6), raising the question of whether the recruitment of individual motor units are correlated or independent. Correlated recruitment might reflect shared input onto the population of motor units innervating the muscle (De Luca, 1985; De Luca & Erim, 1994; Farina et al., 2014). For example, two motor units, each with low recruitment probabilities, may still fire during the same set of strides. To assess the independence of motor unit recruitment across the recorded population, we compared each unit’s empirical recruitment probability across all strides to its conditional recruitment probability during strides in which another motor unit from the same muscle was recruited (Figure 7). Doing this for all motor unit pairs revealed that motor units in both muscles were biased towards greater recruitment when additional units were active (p<0.001, Wilcoxon signed-rank tests for both the lateral and long heads of triceps). This finding suggests that probabilistic recruitment reflects common synaptic inputs that covary together across locomotor strides.”

      (3) Representativeness of the sample of identified motor units

      However, to draw such conclusions, the authors should exclusively compare motor units from the same pool and systematically track violations of the recruitment order. Alternatively, they could demonstrate that the motor units that are intermittently active across strides correspond to the smallest motor units, based on the assumption that these units should always be recruited due to their low activation thresholds.

      One way to estimate the size of motor units identified within the same muscle would be to compare the amplitude of their action potentials, assuming that all motor units are relatively close to the electrodes (given the selectivity of the recordings) and that motoneurons innervating more muscle fibres generate larger motor unit action potentials.

      We thank the Reviewer for this comment. Below, we provide more detailed analyses of the relationships between motor unit spike amplitude and the recruitment probability as well as latency (relative to stride onset) of activation.

      We generated the below figures to illustrate the relationship between the amplitude of motor units and their firing properties. As suspected, units with larger-amplitude waveforms fired with lower probability and produced their first spikes later in the stride. If we were comfortable assuming that larger spike amplitudes mean higher-force units, then this would be consistent with a key prediction of the size principle (i.e. that higher-force units are recruited later). However, we are hesitant to base any conclusions on this assumption or emphasize this point with a main-text figure, since EMG signal amplitude may also vary due to the physical properties of the electrode and distance from muscle fibers. Thus it is possible that a large motor unit may have a smaller waveform amplitude relative to the rest of the motor pool.

      Author response image 1.

      Relation between motor unit amplitude and (A) recruitment probability and (B) mean first spike time within the stride. Colored lines indicate the outcome of linear regression analyses.

      Currently, the data seem to support the idea that motor units that are alternately recruited across strides have recruitment thresholds close to the level of activation or force produced during slow walking. The fact that recruitment probability monotonically increases with speed suggests that the force required to propel the mouse forward exceeds the recruitment threshold of these "large" motor units. This pattern would primarily reflect spatial recruitment following the size principle rather than flexible motor unit control.

      We thank the Reviewer for this comment. We agree with this interpretation, particularly in relation to the references suggested in later comments, and have added the following text to the Discussion to better reflect this argument:

      “To investigate the neuromuscular control of locomotor speed, we quantified speed-dependent changes in both motor unit recruitment and firing rate. We found that the majority of units were recruited more often and with larger firing rates at faster speeds (Figure 5, Figure5–figure supplement 1). This result may reflect speed-dependent differences in the common input received by populations of motor neurons with varying spiking thresholds (Henneman et al., 1965). In the case of mouse locomotion, faster speeds might reflect a larger common input, increasing the recruitment probability as more neurons, particularly those that are larger and generate more force, exceed threshold for action potentials (Farina et al., 2014).”

      (4) Analysis of recruitment and firing rates

      The authors currently report active duration and peak firing rates based on spike trains convolved with a Gaussian kernel. Why not report the peak of the instantaneous firing rates estimated from the inverse of the inter-spike interval? This approach appears to be more aligned with previous studies conducted to describe motor unit behaviour during fast movements (e.g., Desmedt & Godaux, 1977, J Physiol; Van Cutsem et al., 1998, J Physiol; Del Vecchio et al., 2019, J Physiol).

      We thank the Reviewer for this comment. In the revised Discussion (see ‘Firing rates in mouse locomotion compared to other species’) we reference several examples of previous studies that quantified spike patterns based on the instantaneous firing rate. We chose to report the peak of the smoothed firing rate because that quantification includes strides with zero spikes or only one spike, which occur regularly in our dataset (and for which ISI rate measures, which require two spikes to define an instantaneous firing rate, cannot be computed). Regardless, in the revised Figure 4B, we present an analysis that uses inter-spike intervals as suggested, which yielded similar ranges of firing rates as the primary analysis.

      (5) Additional analyses of behaviour

      The authors currently analyse motor unit recruitment in relation to elbow angle. It would be valuable to include a similar analysis using the angular velocity observed during each stride, re broadly, comparing stride-by-stride changes in firing rates with changes in elbow angular velocity would further strengthen the final analyses presented in the results section.

      We thank the Reviewer for this comment. To address this, we have modified Figure 6 and the associated Supplemental Figures, to show relationships in unit activation with both the range of elbow extension and the range of elbow velocity for each stride. These new Supplemental Figures show that the trends shown in main text Figure 6C and 6E (which show data from all speed quartiles on the same axes) are also apparent in both the slower and faster quartiles individually, although single-quartile statistical tests (with smaller sample size than the main analysis) not reach statistical significance in all cases.

      Reviewer #3 (Public review):

      Summary:

      Using the approach of Myomatrix recording, the authors report that:

      (1) Motor units are recruited differently in the two types of muscles.

      (2) Individual units are probabilistically recruited during the locomotion strides, whereas the population bulk EMG has a more reliable representation of the muscle.

      (3) The recruitment of units was proportional to walking speed.

      Strengths:

      The new technique provides a unique data set, and the data analysis is convincing and well-performed.

      We thank the Reviewer for the comment.

      Weaknesses:

      The implications of "probabilistical recruitment" should be explored, addressed, and analyzed further.

      Comments:

      One of the study's main findings (perhaps the main finding) is that the motor units are "probabilistically" recruited. The authors do not define what they mean by probabilistically recruited, nor do they present an alternative scenario to such recruitment or discuss why this would be interesting or surprising. However, on page 4, they do indicate that the recruitment of units from both muscles was only active in a subset of strides, i.e., they are not reliably active in every step.

      If probabilistic means irregular spiking, this is not new. Variability in spiking has been seen numerous times, for instance in human biceps brachii motor units during isometric contractions (Pascoe, Enoka, Exp physiology 2014) and elsewhere. Perhaps the distinction the authors are seeking is between fluctuation-driven and mean-driven spiking of motor units as previously identified in spinal motor networks (see Petersen and Berg, eLife 2016, and Berg, Frontiers 2017). Here, it was shown that a prominent regime of irregular spiking is present during rhythmic motor activity, which also manifests as a positive skewness in the spike count distribution (i.e., log-normal).

      We thank the Reviewer for this comment and have clarified several passages in response. The Reviewer is of course correct that irregular motor unit spiking has been described previously and may reflect motor neurons’ operating in a high-sensitivity (fluctuation-driven) regime. We now cite these papers in the Discussion (see ‘Firing rates in mouse locomotion compared to other species’). Additionally, the revision clarifies that “probabilistically” - as defined in our paper - refers only to the empirical observation that a motor unit spikes during only a subset of strides, either when all locomotor speeds are considered together (Figure 2) or separately (Figure 5A-C):

      “Motor units in both muscles exhibited this pattern of probabilistic recruitment (defined as a unit’s firing on only a fraction of strides), but with differing distributions of firing properties across the long and lateral heads (Figure 2).”

      “Our findings (Figure 4) highlight that even with the relatively high firing rates observed in mice, there are still significant changes in firing rate and recruitment probability across the spikes within bursts (Figure 4B) and across locomotor speeds (Figure 5F). Future studies should more carefully examine how these rapidly changing spiking patterns derive from both the statistics of synaptic inputs and intrinsic properties of motor neurons (Manuel & Heckman, 2011; Petersen & Berg, 2016; Berg, 2017).”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      As mentioned above, there are several issues with the statistics that need to be corrected to properly support the claims made in the paper.

      The authors compare the fractions of MUs that show significant variation across locomotor speeds in their firing rate and recruitment probability. However, it is not statistically founded to compare the results of separate statistical tests based on different kinds of measurements and thus have unconstrained differences in statistical power. The comparison of the fractional changes in firing rates and recruitment across speeds that follow is helpful, though in truth, by contemporary standards, one would like to see error bars on these estimates. These could be generated using bootstrapping.

      The Reviewer is correct, and we have revised the manuscript to better clarify which quantities should or should not be compared, including the following passage (see “Motor unit mechanisms of speed control” in Results):

      “Speed-dependent increases in peak firing rate were therefore also present in our dataset, although in a smaller fraction of motor units (22/33) than changes in recruitment probability (31/33). Furthermore, the mean (± SE) magnitude of speed-dependent increases was smaller for spike rates (mean rate<sub>fast</sub>/rate<sub>slow</sub> of 111% ± 20% across all motor units) than for recruitment probabilities (mean p(recruitment) <sub>fast</sub>/p(recruitment) <sub>slow</sub> of 179% ± 3% across all motor units). While fractional changes in rate and recruitment probability are not readily comparable given their different upper limits, these findings could suggest that while both recruitment and peak rate change across speed quartiles, increased recruitment probability may play a larger role in driving changes in locomotor speed.”

      The description in the Methods of the tests for variation in firing rates and recruitment probability across speeds are extremely hard to understand - after reading many times, it is still not clear what was done, or why the method used was chosen. In the main text, the authors quote p-values and then state "bootstrap confidence intervals," which is not a statistical test that yields a p-value. While there are mathematical relationships between confidence intervals and statistical tests such that a one-to-one correspondence between them can exist, the descriptions provided fall short of specifying how they are related in the present instance. For this reason, and those described in what follows, it is not clear what the p-values represent.

      Next, the authors refer to fitting a model ("a Poisson distribution") to the data to estimate firing rate and recruitment probability, that the model results agree with their actual data, and that they then bootstrapped from the model estimates to get confidence intervals and compute p-values. Why do this? Why not just do something much simpler, like use the actual spike counts, and resample from those? I understand that it is hard to distinguish between no recruitment and just no spikes given some low Poisson firing rate, but how does that challenge the ability to test if the firing rates or the number of spiking MUs changes significantly across speeds? I can come up with some reasons why I think the authors might have decided to do this, but reasoning like this really should be made explicit.

      In addition, the authors would provide an unambiguous description of the model, perhaps using an equation and a description of how it was fit. For the bootstrapping, a clear description of how the resampling was done should be included. The focus on peak firing rate instead of mean (or median) firing rate should also be justified. Since peaks are noisier, I would expect the statistical power to be lower compared to using the mean or median.

      We thank the Reviewer for the comments and have revised and expanded our discussion of the statistical tests employed. We expanded and clarified our description of these techniques in the updated Methods section:

      “Joint model of rate and recruitment

      We modeled the recruitment probability and firing rate based on empirical data to best characterize firing statistics within the stride. Particularly, this allowed for multiple solutions to explain why a motor unit would not spike within a stride. From the empirical data alone, strides with zero spikes would have been assumed to have no recruitment of a unit. However, to create a model of motor unit activity that includes both recruitment and rate, it must be possible that a recruited unit can have a firing rate of zero. To quantify the firing statistics that best represent all spiking and non-spiking patterns, we modeled recruitment probability and peak firing rate along the following piecewise function:

      where y denotes the observed peak firing rate on a given stride (determined by convolving motor unit spike times with a Gaussian kernel as described above), p denotes the probability of recruitment, and λ denotes the expected peak firing rate from a Poisson distribution of outcomes. Thus, an inactive unit on a given stride may be the result of either non-recruitment or recruitment with a stochastically zero firing rate. The above equations were fit by minimizing the negative log-likelihood of the parameters given the data.

      “Permutation test for joint model of rate and recruitment and type 2 regression slopes

      To quantify differences in firing patterns across walking speeds, we subdivided each mouse’s total set of strides into speed quartiles and calculated rate (𝜆, Eq. 1 and 2, Fig. 5A-C) and recruitment probability terms (p, Eq. 1 and 2, Fig. 5D-F) for each unit in each speed quartile. Here we calculated the difference in both the rate and recruitment terms across the fastest and slowest speed quartiles (p<sub>fast</sub>-p<sub>slow</sub> and 𝜆<sub>fast</sub>-𝜆<sub>slow</sub>). To test whether these model parameters were significantly different depending on locomotor speed, we developed a null model combining strides from both the fastest and slowest speed quartiles. After pooling strides from both quartiles, we randomly distributed the pooled set of strides into two groups with sample sizes equal to the original slow and fast quartiles. We then calculated the null model parameters for each new group and found the difference between like terms. To estimate the distribution of possible differences, we bootstrapped this result using 1000 random redistributions of the pooled set of strides. Following the permutation test, the 95% confidence interval of this final distribution reflects the null hypothesis of no difference between groups. Thus, the null hypothesis can be rejected if the true difference in rate or recruitment terms exceeds this confidence interval.

      We followed a similar procedure to quantify cross-muscle differences in the relationship between firing parameters. For each muscle, we estimated the slope across firing parameters for each motor unit using type 2 regression. In this case, the true difference was the difference in slopes between muscles. To test the null hypothesis that there was no difference in slopes, the null model reflected the pooled set of units from both muscles. Again, slopes were calculated for 1000 random resamplings of this pooled data to estimate the 95% confidence interval.”

      The argument for delayed activation of the lateral head is interesting, but I am not comfortable saying the nervous system creates a delay just based on observations of the mean time of the first spike, given the potential for differential variability in spike timing across muscles and MUs. One way to make a strong case for a delay would be to show aggregate PSTHs for all the spikes from all the MUs for each of the two heads. That would distinguish between a true delay and more gradual or variable activation between the heads.

      This is a good point and we agree that the claim made about the nervous system is too strong given the results. Even with Author response image 2 below that the Reviewer suggested, there is still not enough evidence to isolate the role of the nervous system in the muscles’ activation.

      Author response image 2.

      Aggregate peristimulus time histogram (PSTH) for all motor unit spike times in the long head (top) and lateral head (bottom) within the stride.

      In the ideal case, we would have more simultaneous recordings from both muscles to make a more direct claim on the delay. Still, within the current scope of the paper, to correct this and better describe the difference in timing of muscle activity, we edited the text to the following:

      “These findings demonstrate that despite the synergistic (extensor) function of the long and lateral heads of the triceps at the elbow, the motor pool for the long head becomes active roughly 100 ms before the motor pool supplying the lateral head during locomotion (Figure 3C).”

      The results from Marshall et al. 2022 suggest that the recruitment of some MUs is not just related to muscle force, but also the frequency of force variation - some of their MUs appear to be recruited only at certain frequencies. Figure 5C could have shown signs of this, but it does not appear to. We do not really know the force or its frequency of variation in the measurements here. I wonder whether there is additional analysis that could address whether frequency-dependent recruitment is present. It may not be addressable with the current data set, but this could be a fruitful direction to explore in the future with MU recordings from mice.

      We agree that this would be a fruitful direction to explore, however the Reviewer is correct that this is not easily addressable with the dataset. As the Reviewer points out, stride frequency increases with increased speed, potentially offering the opportunity to examine how motor unit activity varies with the frequency, phase, and amplitude of locomotor movements. However, given our lack of force data (either joint torques or ground reaction forces), dissociating the frequency/phase/amplitude of skeletal kinematics from the frequency/phase/amplitude of muscle force. Marshall et al. (2022) mitigated these issues by using an isometric force-production task (Marshall et al., 2022). Therefore, while we agree that it would be a major contribution to extend such investigations to whole-body movements like locomotion, given the complexities described above we believe this is a project for the future, and beyond the scope of the present study.

      Minor:

      Page 5: "Units often displayed no recruitment in a greater proportion of strides than for any particular spike count when recruited (Figures 2A, B)," - I had to read this several times to understand it. I suggest rephrasing for clarity.

      We have changed the text to read:

      “Units demonstrated a variety of firing patterns, with some units producing 0 spikes more frequently than any non-zero spike count (Figure 2A, B),...”

      Figure 3 legend: "Mean phase ({plus minus} SE) of motor unit burst duration across all strides.": It is unclear what this means - durations are not usually described as having a phase. Do we mean the onset phase?

      We have changed the text to read:

      “Mean phase ± SE of motor unit burst activity within each stride”

      Page 9: "suggesting that the recruitment of individual motor units in the lateral and long heads might have significant (and opposite) effects on elbow angle in strides of similar speed (see Discussion)." I wouldn't say "opposite" here - that makes it sound like the authors are calling the long head a flexor. The authors should rephrase or clarify the sense in which they are opposite.

      This is a fair point and we agree we should not describe the muscles as ‘opposite’ when both muscles are extensors. We have removed the phrase ‘and opposite’ from the text.

      Page 11: "in these two muscles across in other quadrupedal species" - typo.

      We have corrected this error.

      Page 16: This reviewer cannot decipher after repeated attempts what the first two sentences of the last paragraph mean. - “Future studies might also use perturbations of muscle activity to dissociate the causal properties of each motor unit’s activity from the complex correlation structure of locomotion. Despite the strong correlations observed between motor unit recruitment and limb kinematics (Fig. 6, Supplemental Fig. 3), these results might reflect covariations of both factors with locomotor speed rather than the causal properties of the recorded motor unit.”

      For better clarity, we have changed the text to read:

      “Although strong correlations were observed between motor unit recruitment and limb kinematics during locomotion (Figure 6, Figure 6–figure supplement 1), it remains unclear whether such correlations actually reflect the causal contributions that those units make to limb movement. To resolve this ambiguity, future studies could use electrical or optical perturbations of muscle contraction levels (Kim et al., 2024; Lu et al., 2024; Srivastava et al., 2015, 2017) to test directly how motor unit firing patterns shape locomotor movements. The short-latency effects of patterned motor unit stimulation (Srivastava et al., 2017) could then reveal the sensitivity of behavior to changes in muscle spiking and the extent to which the same behaviors can be performed with many different motor commands.”

      Reviewer #2 (Recommendations for the authors):

      Minor comments:

      Introduction:

      (1) "Although studies in primates, cats, and zebrafish have shown that both the number of active motor units and motor unit firing rates increase at faster locomotor speeds (Grimby, 1984; Hoffer et al., 1981, 1987; Marshall et al., 2022; Menelaou & McLean, 2012)." I would remove Marshall et al. (2022) as their monkeys performed pulling tasks with the upper limb. You can alternatively remove locomotor from the sentence and replace it with contraction speed.

      Thank you for the comment. While we intended to reference this specific paper to highlight the rhythmic activity in muscles, we agree that this deviates from ‘locomotion’ as it is referenced in the other cited papers which study body movement. We have followed the Reviewer’s suggestion to remove the citation to Marshall et al.

      (2) "The capability and need for faster force generation during dynamic behavior could implicate motor unit recruitment as a primary mechanism for modulating force output in mice."

      The authors could add citations to this sentence, of works that showed that recruitment speed is the main determinant of the rate of force development (see for example Dideriksen et al. (2020) J Neurophysiol; J. L. Dideriksen, A. Del Vecchio, D. Farina, Neural and muscular determinants of maximal rate of force development. J Neurophysiol 123, 149-157 (2020)).

      Thank you for pointing out this important reference. We have included this as a citation as recommended.

      Results:

      (3) "Electrode arrays (32-electrode Myomatrix array model RF-4x8-BHS-5) were implanted in the triceps brachii (note that Figure 1D shows the EMG signal from only one of the 16 bipolar recording channels), and the resulting data were used to identify the spike times of individual motor units (Figure 1E) as described previously (Chung et al., 2023)."

      This sentence can be misleading for the reader as the array used by the researchers has 4 threads of 8 electrodes. Would it be possible to specify the number of electrodes implanted per head of interest? I assume 8 per head in most mice (or 4 bipolar channels), even if that's not specifically written in the manuscript.

      Thank you for the suggestion. As described above, we have added Table 1, which includes all array locations, and we edited the statement referenced in the comment as follows:

      “Electrode arrays (32-electrode Myomatrix array model RF-4x8-BHS-5) were implanted in forelimb muscles (note that Figure 1D shows the EMG signal from only one of the 16 bipolar recording channels), and the resulting data were used to identify the spike times of individual motor units in the triceps brachii long and lateral heads (Table 1, Figure 1E) as described previously (Chung et al., 2023).“

      (4) "These findings demonstrate that despite the overlapping biomechanical functions of the long and lateral heads of the triceps, the nervous system creates a consistent, approximately 100 ms delay (Figure 3C) between the activation of the two muscles' motor neuron pools. This timing difference suggests distinct patterns of synaptic input onto motor neurons innervating the lateral and long heads."

      Both muscles don't have fully overlapping biomechanical functions, as one of them also acts on the shoulder joint. Please be more specific in this sentence, saying that both muscles are synergistic at the elbow level rather than "have overlapping biomechanical functions".

      We agree with the above reasoning and that our manuscript should be clearer on this point. We edited the above text in accordance with the Reviewer suggestion as follows:

      "These findings demonstrate that despite the synergistic (extensor) function of the long and lateral heads of the triceps at the elbow, …”  

      (5) "Together with the differences in burst timing shown in Figure 3B, these results again suggest that the motor pools for the lateral and long heads of the triceps receive distinct patterns of synaptic input, although differences in the intrinsic physiological properties of motor neurons innervating the two muscles might also play an important role."

      It is difficult to draw such an affirmative conclusion on the synaptic inputs from the data presented by the authors. The differences in firing rates may solely arise from other factors than distinct synaptic inputs, such as the different intrinsic properties of the motoneurons or the reception of distinct neuromodulatory inputs.

      To better explain our findings, we adjusted the above text in the Results (see “Motor unit firing patterns in the long and lateral heads of the triceps”):

      “Together with the differences in burst timing shown in Figure 3B, these results again suggest that the motor pools for the lateral and long heads of the triceps receive distinct patterns of synaptic input, although differences in the intrinsic physiological properties of motor neurons innervating the two muscles might also play an important role.”

      We also included the following distinction in the Discussion (see “Differences in motor unit activity patterns across two elbow extensors”) to address the other plausible mechanisms mentioned.

      “The large differences in burst timing and spike patterning across the muscle heads suggest that the motor pools for each muscle receive distinct inputs. However, differences in the intrinsic physiological properties of motor units and neuromodulatory inputs across motor pools might also make substantial contributions to the structure of motor unit spike patterns (Martínez-Silva et al., 2018; Miles & Sillar, 2011).”

      (6) "We next examined whether the probabilistic recruitment of individual motor units in the triceps and elbow extensor muscle predicted stride-by-stride variations in elbow angle kinematics."

      I'm not sure that the wording is appropriate here. The analysis does not predict elbow angle variations from parameters extracted from the spiking activity. It rather compares the average elbow angle between two conditions (motor unit active or not active).

      We thank the Reviewer for this comment and agree that the wording could be improved here to better reflect our analysis. To lower the strength of our claim, we replaced usage of the word ‘predict’ with ‘correlates’ in the above text and throughout the paper when discussing this result.

      Methods:

      (7) "Using the four threads on the customizable Myomatrix array (RF-4x8-BHS-5), we implanted a combination of muscles in each mouse, sometimes using multiple threads within the same muscle. [...] Some mice also had threads simultaneously implanted in their ipsilateral or contralateral biceps brachii although no data from the biceps is presented in this study."

      A precise description of the localisation of the array (muscles and the number of arrays per muscle) for each animal would be appreciated.

      (8) "A total of 33 units were identified and manually verified across all animals." A precise description of the number of motor units concurrently identified per muscle and per animal would be appreciated. Moreover, please add details on the manual inspection. Does it involve the manual selection of missing spikes? What are the criteria for considering an identified motor unit as valid?

      As discussed earlier, we added Table 1 to the main text to provide the details mentioned in the above comments.

      Regarding spike sorting, given the very large number of spikes recorded, we did not rely on manual adjusting mislabeled spikes. Instead, as described in the revised Methods section, we verified unit isolation by ensuring units had >98% of spikes outside of 1ms of each other. Moreover, as described above we have added new analyses (Figure 1–figure supplement 1) confirming the stability of motor unit waveforms across both the duration of individual recording sessions (roughly 30 minutes) and across the rapid changes in limb position within individual stride cycles (roughly 250 msec).

      Reviewer #3 (Recommendations for the authors):

      Figure 2 (and supplement) show spike count distributions with strong positive skewness, which is in accordance with the prediction of a fluctuation-driven regime. I suggest plotting these on a logarithmic x-axis (in addition to the linear axis), which should reveal a bell-shaped distribution, maybe even Gaussian, in a majority of the units.

      We thank the Reviewer for the suggestion. We present the requested analysis below, which shows bell-shaped distributions for some (but not all) distributions. However, we believe that investigating why some replotted distributions are Gaussian and others are not falls beyond the scope of this paper, and likely requires a larger dataset than the one we were able to obtain.

      Author response image 3.

      Spike count distributions for each motor unit on a logarithmic x-axis.

      Why not more data? I tried to get an overview of how much data was collected.

      Supplemental Figure 1 has all the isolated units, which amounts to 38 (are the colors the two muscle types?). Given there are 16 leads in each myomatrix, in two muscles, of six mice, this seems like a low yield. Could the authors comment on the reasons for this low yield?

      Regarding motor unit yield, even with multiple electrodes per muscle and a robust sorting algorithm, we often isolated only a few units per muscle. This yield likely reflects two factors. First, because of the highly dynamic nature of locomotion and high levels of muscle contraction, isolating individual spikes reliably across different locomotor speeds is inherently challenging, regardless of the algorithm being employed. Second, because the results of spike-train analyses can be highly sensitive to sorting errors, we have only included the motor units that we can sort with the highest possible confidence across thousands of strides.

      Minor:

      Figure captions especially Figure 6: The text is excessively long. Can the text be shortened?

      We thank the Reviewer for this comment. Generally, we seek to include a description of the methods and results within the figure captions, but we concede that we can condense the information in some cases. In a number of cases, we have moved some of the descriptive text from the caption to the Methods section.

      References

      Berg, R. W. (2017). Neuronal Population Activity in Spinal Motor Circuits: Greater Than the Sum of Its Parts. Frontiers in Neural Circuits, 11. https://doi.org/10.3389/fncir.2017.00103

      Biewener, A. A., Blickhan, R., Perry, A. K., Heglund, N. C., & Taylor, C. R. (1988). Muscle Forces During Locomotion in Kangaroo Rats: Force Platform and Tendon Buckle Measurements Compared. Journal of Experimental Biology, 137(1), 191–205. https://doi.org/10.1242/jeb.137.1.191

      Chung, B., Zia, M., Thomas, K. A., Michaels, J. A., Jacob, A., Pack, A., Williams, M. J., Nagapudi, K., Teng, L. H., Arrambide, E., Ouellette, L., Oey, N., Gibbs, R., Anschutz, P., Lu, J., Wu, Y., Kashefi, M., Oya, T., Kersten, R., … Sober, S. J. (2023). Myomatrix arrays for high-definition muscle recording. eLife, 12, RP88551. https://doi.org/10.7554/eLife.88551

      De Luca, C. J. (1985). Control properties of motor units. Journal of Experimental Biology, 115(1), 125–136. https://doi.org/10.1242/jeb.115.1.125

      De Luca, C. J., & Erim, Z. (1994). Common drive of motor units in regulation of muscle force. Trends in Neurosciences, 17(7), 299–305. https://doi.org/10.1016/0166-2236(94)90064-7

      Farina, D., Negro, F., & Dideriksen, J. L. (2014). The effective neural drive to muscles is the common synaptic input to motor neurons. The Journal of Physiology, 592(16), 3427–3441. https://doi.org/10.1113/jphysiol.2014.273581

      Hartigan, P. M. (1985). Algorithm AS 217: Computation of the Dip Statistic to Test for Unimodality. Applied Statistics, 34(3), 320. https://doi.org/10.2307/2347485

      Henneman, E., Somjen, G., & Carpenter, D. O. (1965). FUNCTIONAL SIGNIFICANCE OF CELL SIZE IN SPINAL MOTONEURONS. Journal of Neurophysiology, 28(3), 560–580. https://doi.org/10.1152/jn.1965.28.3.560

      Karabulut, D., Dogru, S. C., Lin, Y.-C., Pandy, M. G., Herzog, W., & Arslan, Y. Z. (2020). Direct Validation of Model-Predicted Muscle Forces in the Cat Hindlimb During Locomotion. Journal of Biomechanical Engineering, 142(5), 051014. https://doi.org/10.1115/1.4045660

      Kim, J. J., Wyche, I. S., Olson, W., Lu, J., Bakir, M. S., Sober, S. J., & O’Connor, D. H. (2024). Myo-optogenetics: Optogenetic stimulation and electrical recording in skeletal muscles. https://doi.org/10.1101/2024.06.21.600113

      Lu, J., Zia, M., Baig, D. A., Yan, G., Kim, J. J., Nagapudi, K., Anschutz, P., Oh, S., O’Connor, D., Sober, S. J., & Bakir, M. S. (2024). Opto-Myomatrix: μLED integrated microelectrode arrays for optogenetic activation and electrical recording in muscle tissue. https://doi.org/10.1101/2024.07.01.601601

      Manuel, M., & Heckman, C. J. (2011). Adult mouse motor units develop almost all of their force in the subprimary range: A new all-or-none strategy for force recruitment? Journal of Neuroscience, 31(42), 15188–15194. https://doi.org/10.1523/JNEUROSCI.2893-11.2011

      Marshall, N. J., Glaser, J. I., Trautmann, E. M., Amematsro, E. A., Perkins, S. M., Shadlen, M. N., Abbott, L. F., Cunningham, J. P., & Churchland, M. M. (2022). Flexible neural control of motor units. Nature Neuroscience, 25(11), 1492–1504. https://doi.org/10.1038/s41593-022-01165-8

      Martínez-Silva, M. de L., Imhoff-Manuel, R. D., Sharma, A., Heckman, C. J., Shneider, N. A., Roselli, F., Zytnicki, D., & Manuel, M. (2018). Hypoexcitability precedes denervation in the large fast-contracting motor units in two unrelated mouse models of ALS. eLife, 7(2007), 1–26. https://doi.org/10.7554/eLife.30955

      Miles, G. B., & Sillar, K. T. (2011). Neuromodulation of Vertebrate Locomotor Control Networks. Physiology, 26(6), 393–411. https://doi.org/10.1152/physiol.00013.2011

      Petersen, P. C., & Berg, R. W. (2016). Lognormal firing rate distribution reveals prominent fluctuation–driven regime in spinal motor networks. eLife, 5. https://doi.org/10.7554/elife.18805

      Srivastava, K. H., Elemans, C. P. H., & Sober, S. J. (2015). Multifunctional and Context-Dependent Control of Vocal Acoustics by Individual Muscles. The Journal of Neuroscience, 35(42), 14183–14194. https://doi.org/10.1523/JNEUROSCI.3610-14.2015

      Srivastava, K. H., Holmes, C. M., Vellema, M., Pack, A. R., Elemans, C. P. H., Nemenman, I., & Sober, S. J. (2017). Motor control by precisely timed spike patterns. Proceedings of the National Academy of Sciences of the United States of America, 114(5), 1171–1176. https://doi.org/10.1073/pnas.1611734114

    1. Act I, Scene 1 Verona. A public place.       next scene [Enter SAMPSON and GREGORY, of the house of Capulet, armed with swords and bucklers] Sampson. Gregory, o' my word, we'll not carry coals. Gregory. No, for then we should be colliers. Sampson. I mean, an we be in choler, we'll draw. Gregory. Ay, while you live, draw your neck out o' the collar. 20 Sampson. I strike quickly, being moved. Gregory. But thou art not quickly moved to strike. Sampson. A dog of the house of Montague moves me. Gregory. To move is to stir; and to be valiant is to stand: therefore, if thou art moved, thou runn'st away. 25 Sampson. A dog of that house shall move me to stand: I will take the wall of any man or maid of Montague's. Gregory. That shows thee a weak slave; for the weakest goes to the wall. Sampson. True; and therefore women, being the weaker vessels, 30are ever thrust to the wall: therefore I will push Montague's men from the wall, and thrust his maids to the wall. Gregory. The quarrel is between our masters and us their men. Sampson. 'Tis all one, I will show myself a tyrant: when I 35have fought with the men, I will be cruel with the maids, and cut off their heads. Gregory. The heads of the maids? Sampson. Ay, the heads of the maids, or their maidenheads; take it in what sense thou wilt. 40 Gregory. They must take it in sense that feel it. Sampson. Me they shall feel while I am able to stand: and 'tis known I am a pretty piece of flesh. Gregory. 'Tis well thou art not fish; if thou hadst, thou hadst been poor John. Draw thy tool! here comes 45two of the house of the Montagues. Sampson. My naked weapon is out: quarrel, I will back thee. Gregory. How! turn thy back and run? Sampson. Fear me not. Gregory. No, marry; I fear thee! 50 Sampson. Let us take the law of our sides; let them begin. Gregory. I will frown as I pass by, and let them take it as they list. Sampson. Nay, as they dare. I will bite my thumb at them; which is a disgrace to them, if they bear it. 55 [Enter ABRAHAM and BALTHASAR] Abraham. Do you bite your thumb at us, sir? Sampson. I do bite my thumb, sir. Abraham. Do you bite your thumb at us, sir? Sampson. [Aside to GREGORY] Is the law of our side, if I say 60ay? Gregory. No. Sampson. No, sir, I do not bite my thumb at you, sir, but I bite my thumb, sir. Gregory. Do you quarrel, sir? 65 Abraham. Quarrel sir! no, sir. Sampson. If you do, sir, I am for you: I serve as good a man as you. Abraham. No better. Sampson. Well, sir. Gregory. Say 'better:' here comes one of my master's kinsmen. 70 Sampson. Yes, better, sir. Abraham. You lie. Sampson. Draw, if you be men. Gregory, remember thy swashing blow. [They fight] [Enter BENVOLIO] Benvolio. Part, fools! Put up your swords; you know not what you do. [Beats down their swords] [Enter TYBALT] Tybalt. What, art thou drawn among these heartless hinds? 80Turn thee, Benvolio, look upon thy death. Benvolio. I do but keep the peace: put up thy sword, Or manage it to part these men with me. Tybalt. What, drawn, and talk of peace! I hate the word, As I hate hell, all Montagues, and thee: 85Have at thee, coward! [They fight] [Enter, several of both houses, who join the fray; then enter Citizens, with clubs] First Citizen. Clubs, bills, and partisans! strike! beat them down! 90Down with the Capulets! down with the Montagues! [Enter CAPULET in his gown, and LADY CAPULET] Capulet. What noise is this? Give me my long sword, ho! Lady Capulet. A crutch, a crutch! why call you for a sword? Capulet. My sword, I say! Old Montague is come, 95And flourishes his blade in spite of me. [Enter MONTAGUE and LADY MONTAGUE] Montague. Thou villain Capulet,—Hold me not, let me go. Lady Montague. Thou shalt not stir a foot to seek a foe. [Enter PRINCE, with Attendants] Prince Escalus. Rebellious subjects, enemies to peace, Profaners of this neighbour-stained steel,— Will they not hear? What, ho! you men, you beasts, That quench the fire of your pernicious rage With purple fountains issuing from your veins, 105On pain of torture, from those bloody hands Throw your mistemper'd weapons to the ground, And hear the sentence of your moved prince. Three civil brawls, bred of an airy word, By thee, old Capulet, and Montague, 110Have thrice disturb'd the quiet of our streets, And made Verona's ancient citizens Cast by their grave beseeming ornaments, To wield old partisans, in hands as old, Canker'd with peace, to part your canker'd hate: 115If ever you disturb our streets again, Your lives shall pay the forfeit of the peace. For this time, all the rest depart away: You Capulet; shall go along with me: And, Montague, come you this afternoon, 120To know our further pleasure in this case, To old Free-town, our common judgment-place. Once more, on pain of death, all men depart. [Exeunt all but MONTAGUE, LADY MONTAGUE, and BENVOLIO] Montague. Who set this ancient quarrel new abroach? 125Speak, nephew, were you by when it began? Benvolio. Here were the servants of your adversary, And yours, close fighting ere I did approach: I drew to part them: in the instant came The fiery Tybalt, with his sword prepared, 130Which, as he breathed defiance to my ears, He swung about his head and cut the winds, Who nothing hurt withal hiss'd him in scorn: While we were interchanging thrusts and blows, Came more and more and fought on part and part, 135Till the prince came, who parted either part. Lady Montague. O, where is Romeo? saw you him to-day? Right glad I am he was not at this fray. Benvolio. Madam, an hour before the worshipp'd sun Peer'd forth the golden window of the east, 140A troubled mind drave me to walk abroad; Where, underneath the grove of sycamore That westward rooteth from the city's side, So early walking did I see your son: Towards him I made, but he was ware of me 145And stole into the covert of the wood: I, measuring his affections by my own, That most are busied when they're most alone, Pursued my humour not pursuing his, And gladly shunn'd who gladly fled from me. 150 Montague. Many a morning hath he there been seen, With tears augmenting the fresh morning dew. Adding to clouds more clouds with his deep sighs; But all so soon as the all-cheering sun Should in the furthest east begin to draw 155The shady curtains from Aurora's bed, Away from the light steals home my heavy son, And private in his chamber pens himself, Shuts up his windows, locks far daylight out And makes himself an artificial night: 160Black and portentous must this humour prove, Unless good counsel may the cause remove. Benvolio. My noble uncle, do you know the cause? Montague. I neither know it nor can learn of him. Benvolio. Have you importuned him by any means? 165 Montague. Both by myself and many other friends: But he, his own affections' counsellor, Is to himself—I will not say how true— But to himself so secret and so close, So far from sounding and discovery, 170As is the bud bit with an envious worm, Ere he can spread his sweet leaves to the air, Or dedicate his beauty to the sun. Could we but learn from whence his sorrows grow. We would as willingly give cure as know. 175 [Enter ROMEO] Benvolio. See, where he comes: so please you, step aside; I'll know his grievance, or be much denied. Montague. I would thou wert so happy by thy stay, To hear true shrift. Come, madam, let's away. 180 [Exeunt MONTAGUE and LADY MONTAGUE] Benvolio. Good-morrow, cousin. Romeo. Is the day so young? Benvolio. But new struck nine. Romeo. Ay me! sad hours seem long. 185Was that my father that went hence so fast? Benvolio. It was. What sadness lengthens Romeo's hours? Romeo. Not having that, which, having, makes them short. Benvolio. In love? Romeo. Out— 190 Benvolio. Of love? Romeo. Out of her favour, where I am in love. Benvolio. Alas, that love, so gentle in his view, Should be so tyrannous and rough in proof! Romeo. Alas, that love, whose view is muffled still, 195Should, without eyes, see pathways to his will! Where shall we dine? O me! What fray was here? Yet tell me not, for I have heard it all. Here's much to do with hate, but more with love. Why, then, O brawling love! O loving hate! 200O any thing, of nothing first create! O heavy lightness! serious vanity! Mis-shapen chaos of well-seeming forms! Feather of lead, bright smoke, cold fire, sick health! 205Still-waking sleep, that is not what it is! This love feel I, that feel no love in this. Dost thou not laugh? Benvolio. No, coz, I rather weep. Romeo. Good heart, at what? 210 Benvolio. At thy good heart's oppression. Romeo. Why, such is love's transgression. Griefs of mine own lie heavy in my breast, Which thou wilt propagate, to have it prest With more of thine: this love that thou hast shown 215Doth add more grief to too much of mine own. Love is a smoke raised with the fume of sighs; Being purged, a fire sparkling in lovers' eyes; Being vex'd a sea nourish'd with lovers' tears: What is it else? a madness most discreet, 220A choking gall and a preserving sweet. Farewell, my coz. Benvolio. Soft! I will go along; An if you leave me so, you do me wrong. Romeo. Tut, I have lost myself; I am not here; 225This is not Romeo, he's some other where. Benvolio. Tell me in sadness, who is that you love. Romeo. What, shall I groan and tell thee? Benvolio. Groan! why, no. But sadly tell me who. 230 Romeo. Bid a sick man in sadness make his will: Ah, word ill urged to one that is so ill! In sadness, cousin, I do love a woman. Benvolio. I aim'd so near, when I supposed you loved. Romeo. A right good mark-man! And she's fair I love. 235 Benvolio. A right fair mark, fair coz, is soonest hit. Romeo. Well, in that hit you miss: she'll not be hit With Cupid's arrow; she hath Dian's wit; And, in strong proof of chastity well arm'd, From love's weak childish bow she lives unharm'd. 240She will not stay the siege of loving terms, Nor bide the encounter of assailing eyes, Nor ope her lap to saint-seducing gold: O, she is rich in beauty, only poor, That when she dies with beauty dies her store. 245 Benvolio. Then she hath sworn that she will still live chaste? Romeo. She hath, and in that sparing makes huge waste, For beauty starved with her severity Cuts beauty off from all posterity. She is too fair, too wise, wisely too fair, 250To merit bliss by making me despair: She hath forsworn to love, and in that vow Do I live dead that live to tell it now. Benvolio. Be ruled by me, forget to think of her. Romeo. O, teach me how I should forget to think. 255 Benvolio. By giving liberty unto thine eyes; Examine other beauties. Romeo. 'Tis the way To call hers exquisite, in question more: These happy masks that kiss fair ladies' brows 260Being black put us in mind they hide the fair; He that is strucken blind cannot forget The precious treasure of his eyesight lost: Show me a mistress that is passing fair, What doth her beauty serve, but as a note 265Where I may read who pass'd that passing fair? Farewell: thou canst not teach me to forget. Benvolio. I'll pay that doctrine, or else die in debt. [Exeunt] previous scene       Act I, Scene 2 A street.       next scene [Enter CAPULET, PARIS, and Servant] Capulet. But Montague is bound as well as I, In penalty alike; and 'tis not hard, I think, For men so old as we to keep the peace. Paris. Of honourable reckoning are you both; And pity 'tis you lived at odds so long. 275But now, my lord, what say you to my suit? Capulet. But saying o'er what I have said before: My child is yet a stranger in the world; She hath not seen the change of fourteen years, Let two more summers wither in their pride, 280Ere we may think her ripe to be a bride. Paris. Younger than she are happy mothers made. Capulet. And too soon marr'd are those so early made. The earth hath swallow'd all my hopes but she, She is the hopeful lady of my earth: 285But woo her, gentle Paris, get her heart, My will to her consent is but a part; An she agree, within her scope of choice Lies my consent and fair according voice. This night I hold an old accustom'd feast, 290Whereto I have invited many a guest, Such as I love; and you, among the store, One more, most welcome, makes my number more. At my poor house look to behold this night Earth-treading stars that make dark heaven light: 295Such comfort as do lusty young men feel When well-apparell'd April on the heel Of limping winter treads, even such delight Among fresh female buds shall you this night Inherit at my house; hear all, all see, 300And like her most whose merit most shall be: Which on more view, of many mine being one May stand in number, though in reckoning none, Come, go with me. [To Servant, giving a paper] 305Go, sirrah, trudge about Through fair Verona; find those persons out Whose names are written there, and to them say, My house and welcome on their pleasure stay. [Exeunt CAPULET and PARIS] Servant. Find them out whose names are written here! It is written, that the shoemaker should meddle with his yard, and the tailor with his last, the fisher with his pencil, and the painter with his nets; but I am sent to find those persons whose names are here 315writ, and can never find what names the writing person hath here writ. I must to the learned.—In good time. [Enter BENVOLIO and ROMEO] Benvolio. Tut, man, one fire burns out another's burning, One pain is lessen'd by another's anguish; 320Turn giddy, and be holp by backward turning; One desperate grief cures with another's languish: Take thou some new infection to thy eye, And the rank poison of the old will die. Romeo. Your plaintain-leaf is excellent for that. 325 Benvolio. For what, I pray thee? Romeo. For your broken shin. Benvolio. Why, Romeo, art thou mad? Romeo. Not mad, but bound more than a mad-man is; Shut up in prison, kept without my food, 330Whipp'd and tormented and—God-den, good fellow. Servant. God gi' god-den. I pray, sir, can you read? Romeo. Ay, mine own fortune in my misery. Servant. Perhaps you have learned it without book: but, I pray, can you read any thing you see? 335 Romeo. Ay, if I know the letters and the language. Servant. Ye say honestly: rest you merry! Romeo. Stay, fellow; I can read. [Reads] 'Signior Martino and his wife and daughters; 340County Anselme and his beauteous sisters; the lady widow of Vitravio; Signior Placentio and his lovely nieces; Mercutio and his brother Valentine; mine uncle Capulet, his wife and daughters; my fair niece Rosaline; Livia; Signior Valentio and his cousin 345Tybalt, Lucio and the lively Helena.' A fair assembly: whither should they come? Servant. Up. Romeo. Whither? Servant. To supper; to our house. 350 Romeo. Whose house? Servant. My master's. Romeo. Indeed, I should have ask'd you that before. Servant. Now I'll tell you without asking: my master is the great rich Capulet; and if you be not of the house 355of Montagues, I pray, come and crush a cup of wine. Rest you merry! [Exit] Benvolio. At this same ancient feast of Capulet's Sups the fair Rosaline whom thou so lovest, 360With all the admired beauties of Verona: Go thither; and, with unattainted eye, Compare her face with some that I shall show, And I will make thee think thy swan a crow. Romeo. When the devout religion of mine eye 365Maintains such falsehood, then turn tears to fires; And these, who often drown'd could never die, Transparent heretics, be burnt for liars! One fairer than my love! the all-seeing sun Ne'er saw her match since first the world begun. 370 Benvolio. Tut, you saw her fair, none else being by, Herself poised with herself in either eye: But in that crystal scales let there be weigh'd Your lady's love against some other maid That I will show you shining at this feast, 375And she shall scant show well that now shows best. Romeo. I'll go along, no such sight to be shown, But to rejoice in splendor of mine own. [Exeunt] previous scene       Act I, Scene 3 A room in Capulet’s house.       next scene [Enter LADY CAPULET and Nurse] Lady Capulet. Nurse, where's my daughter? call her forth to me. Nurse. Now, by my maidenhead, at twelve year old, I bade her come. What, lamb! what, ladybird! God forbid! Where's this girl? What, Juliet! [Enter JULIET] Juliet. How now! who calls? Nurse. Your mother. Juliet. Madam, I am here. What is your will? Lady Capulet. This is the matter:—Nurse, give leave awhile, 390We must talk in secret:—nurse, come back again; I have remember'd me, thou's hear our counsel. Thou know'st my daughter's of a pretty age. Nurse. Faith, I can tell her age unto an hour. Lady Capulet. She's not fourteen. 395 Nurse. I'll lay fourteen of my teeth,— And yet, to my teeth be it spoken, I have but four— She is not fourteen. How long is it now To Lammas-tide? Lady Capulet. A fortnight and odd days. 400 Nurse. Even or odd, of all days in the year, Come Lammas-eve at night shall she be fourteen. Susan and she—God rest all Christian souls!— Were of an age: well, Susan is with God; She was too good for me: but, as I said, 405On Lammas-eve at night shall she be fourteen; That shall she, marry; I remember it well. 'Tis since the earthquake now eleven years; And she was wean'd,—I never shall forget it,— Of all the days of the year, upon that day: 410For I had then laid wormwood to my dug, Sitting in the sun under the dove-house wall; My lord and you were then at Mantua:— Nay, I do bear a brain:—but, as I said, When it did taste the wormwood on the nipple 415Of my dug and felt it bitter, pretty fool, To see it tetchy and fall out with the dug! Shake quoth the dove-house: 'twas no need, I trow, To bid me trudge: And since that time it is eleven years; 420For then she could stand alone; nay, by the rood, She could have run and waddled all about; For even the day before, she broke her brow: And then my husband—God be with his soul! A' was a merry man—took up the child: 425'Yea,' quoth he, 'dost thou fall upon thy face? Thou wilt fall backward when thou hast more wit; Wilt thou not, Jule?' and, by my holidame, The pretty wretch left crying and said 'Ay.' To see, now, how a jest shall come about! 430I warrant, an I should live a thousand years, I never should forget it: 'Wilt thou not, Jule?' quoth he; And, pretty fool, it stinted and said 'Ay.' Lady Capulet. Enough of this; I pray thee, hold thy peace. Nurse. Yes, madam: yet I cannot choose but laugh, 435To think it should leave crying and say 'Ay.' And yet, I warrant, it had upon its brow A bump as big as a young cockerel's stone; A parlous knock; and it cried bitterly: 'Yea,' quoth my husband,'fall'st upon thy face? 440Thou wilt fall backward when thou comest to age; Wilt thou not, Jule?' it stinted and said 'Ay.' Juliet. And stint thou too, I pray thee, nurse, say I. Nurse. Peace, I have done. God mark thee to his grace! Thou wast the prettiest babe that e'er I nursed: 445An I might live to see thee married once, I have my wish. Lady Capulet. Marry, that 'marry' is the very theme I came to talk of. Tell me, daughter Juliet, How stands your disposition to be married? 450 Juliet. It is an honour that I dream not of. Nurse. An honour! were not I thine only nurse, I would say thou hadst suck'd wisdom from thy teat. Lady Capulet. Well, think of marriage now; younger than you, Here in Verona, ladies of esteem, 455Are made already mothers: by my count, I was your mother much upon these years That you are now a maid. Thus then in brief: The valiant Paris seeks you for his love. Nurse. A man, young lady! lady, such a man 460As all the world—why, he's a man of wax. Lady Capulet. Verona's summer hath not such a flower. Nurse. Nay, he's a flower; in faith, a very flower. Lady Capulet. What say you? can you love the gentleman? This night you shall behold him at our feast; 465Read o'er the volume of young Paris' face, And find delight writ there with beauty's pen; Examine every married lineament, And see how one another lends content And what obscured in this fair volume lies 470Find written in the margent of his eyes. This precious book of love, this unbound lover, To beautify him, only lacks a cover: The fish lives in the sea, and 'tis much pride For fair without the fair within to hide: 475That book in many's eyes doth share the glory, That in gold clasps locks in the golden story; So shall you share all that he doth possess, By having him, making yourself no less. Nurse. No less! nay, bigger; women grow by men. 480 Lady Capulet. Speak briefly, can you like of Paris' love? Juliet. I'll look to like, if looking liking move: But no more deep will I endart mine eye Than your consent gives strength to make it fly. [Enter a Servant] Servant. Madam, the guests are come, supper served up, you called, my young lady asked for, the nurse cursed in the pantry, and every thing in extremity. I must hence to wait; I beseech you, follow straight. Lady Capulet. We follow thee. 490[Exit Servant] Juliet, the county stays. Nurse. Go, girl, seek happy nights to happy days. [Exeunt] previous scene       Act I, Scene 4 A street.       next scene [Enter ROMEO, MERCUTIO, BENVOLIO, with five or six [p]Maskers, Torch-bearers, and others] Romeo. What, shall this speech be spoke for our excuse? Or shall we on without a apology? Benvolio. The date is out of such prolixity: We'll have no Cupid hoodwink'd with a scarf, 500Bearing a Tartar's painted bow of lath, Scaring the ladies like a crow-keeper; Nor no without-book prologue, faintly spoke After the prompter, for our entrance: But let them measure us by what they will; 505We'll measure them a measure, and be gone. Romeo. Give me a torch: I am not for this ambling; Being but heavy, I will bear the light. Mercutio. Nay, gentle Romeo, we must have you dance. Romeo. Not I, believe me: you have dancing shoes 510With nimble soles: I have a soul of lead So stakes me to the ground I cannot move. Mercutio. You are a lover; borrow Cupid's wings, And soar with them above a common bound. Romeo. I am too sore enpierced with his shaft 515To soar with his light feathers, and so bound, I cannot bound a pitch above dull woe: Under love's heavy burden do I sink. Mercutio. And, to sink in it, should you burden love; Too great oppression for a tender thing. 520 Romeo. Is love a tender thing? it is too rough, Too rude, too boisterous, and it pricks like thorn. Mercutio. If love be rough with you, be rough with love; Prick love for pricking, and you beat love down. Give me a case to put my visage in: 525A visor for a visor! what care I What curious eye doth quote deformities? Here are the beetle brows shall blush for me. Benvolio. Come, knock and enter; and no sooner in, But every man betake him to his legs. 530 Romeo. A torch for me: let wantons light of heart Tickle the senseless rushes with their heels, For I am proverb'd with a grandsire phrase; I'll be a candle-holder, and look on. The game was ne'er so fair, and I am done. 535 Mercutio. Tut, dun's the mouse, the constable's own word: If thou art dun, we'll draw thee from the mire Of this sir-reverence love, wherein thou stick'st Up to the ears. Come, we burn daylight, ho! Romeo. Nay, that's not so. 540 Mercutio. I mean, sir, in delay We waste our lights in vain, like lamps by day. Take our good meaning, for our judgment sits Five times in that ere once in our five wits. Romeo. And we mean well in going to this mask; 545But 'tis no wit to go. Mercutio. Why, may one ask? Romeo. I dream'd a dream to-night. Mercutio. And so did I. Romeo. Well, what was yours? 550 Mercutio. That dreamers often lie. Romeo. In bed asleep, while they do dream things true. Mercutio. O, then, I see Queen Mab hath been with you. She is the fairies' midwife, and she comes In shape no bigger than an agate-stone 555On the fore-finger of an alderman, Drawn with a team of little atomies Athwart men's noses as they lie asleep; Her wagon-spokes made of long spiders' legs, The cover of the wings of grasshoppers, 560The traces of the smallest spider's web, The collars of the moonshine's watery beams, Her whip of cricket's bone, the lash of film, Her wagoner a small grey-coated gnat, Not so big as a round little worm 565Prick'd from the lazy finger of a maid; Her chariot is an empty hazel-nut Made by the joiner squirrel or old grub, Time out o' mind the fairies' coachmakers. And in this state she gallops night by night 570Through lovers' brains, and then they dream of love; O'er courtiers' knees, that dream on court'sies straight, O'er lawyers' fingers, who straight dream on fees, O'er ladies ' lips, who straight on kisses dream, Which oft the angry Mab with blisters plagues, 575Because their breaths with sweetmeats tainted are: Sometime she gallops o'er a courtier's nose, And then dreams he of smelling out a suit; And sometime comes she with a tithe-pig's tail Tickling a parson's nose as a' lies asleep, 580Then dreams, he of another benefice: Sometime she driveth o'er a soldier's neck, And then dreams he of cutting foreign throats, Of breaches, ambuscadoes, Spanish blades, Of healths five-fathom deep; and then anon 585Drums in his ear, at which he starts and wakes, And being thus frighted swears a prayer or two And sleeps again. This is that very Mab That plats the manes of horses in the night, And bakes the elflocks in foul sluttish hairs, 590Which once untangled, much misfortune bodes: This is the hag, when maids lie on their backs, That presses them and learns them first to bear, Making them women of good carriage: This is she— 595 Romeo. Peace, peace, Mercutio, peace! Thou talk'st of nothing. Mercutio. True, I talk of dreams, Which are the children of an idle brain, Begot of nothing but vain fantasy, 600Which is as thin of substance as the air And more inconstant than the wind, who wooes Even now the frozen bosom of the north, And, being anger'd, puffs away from thence, Turning his face to the dew-dropping south. 605 Benvolio. This wind, you talk of, blows us from ourselves; Supper is done, and we shall come too late. Romeo. I fear, too early: for my mind misgives Some consequence yet hanging in the stars Shall bitterly begin his fearful date 610With this night's revels and expire the term Of a despised life closed in my breast By some vile forfeit of untimely death. But He, that hath the steerage of my course, Direct my sail! On, lusty gentlemen. 615 Benvolio. Strike, drum. [Exeunt] previous scene       Act I, Scene 5 A hall in Capulet’s house.         [Musicians waiting. Enter Servingmen with napkins] First Servant. Where's Potpan, that he helps not to take away? He shift a trencher? he scrape a trencher! 620 Second Servant. When good manners shall lie all in one or two men's hands and they unwashed too, 'tis a foul thing. First Servant. Away with the joint-stools, remove the court-cupboard, look to the plate. Good thou, save me a piece of marchpane; and, as thou lovest me, let 625the porter let in Susan Grindstone and Nell. Antony, and Potpan! Second Servant. Ay, boy, ready. First Servant. You are looked for and called for, asked for and sought for, in the great chamber. 630 Second Servant. We cannot be here and there too. Cheerly, boys; be brisk awhile, and the longer liver take all. [Enter CAPULET, with JULIET and others of his house, meeting the Guests and Maskers] Capulet. Welcome, gentlemen! ladies that have their toes Unplagued with corns will have a bout with you. 635Ah ha, my mistresses! which of you all Will now deny to dance? she that makes dainty, She, I'll swear, hath corns; am I come near ye now? Welcome, gentlemen! I have seen the day That I have worn a visor and could tell 640A whispering tale in a fair lady's ear, Such as would please: 'tis gone, 'tis gone, 'tis gone: You are welcome, gentlemen! come, musicians, play. A hall, a hall! give room! and foot it, girls. [Music plays, and they dance] 645More light, you knaves; and turn the tables up, And quench the fire, the room is grown too hot. Ah, sirrah, this unlook'd-for sport comes well. Nay, sit, nay, sit, good cousin Capulet; For you and I are past our dancing days: 650How long is't now since last yourself and I Were in a mask? Second Capulet. By'r lady, thirty years. Capulet. What, man! 'tis not so much, 'tis not so much: 'Tis since the nuptials of Lucentio, 655Come pentecost as quickly as it will, Some five and twenty years; and then we mask'd. Second Capulet. 'Tis more, 'tis more, his son is elder, sir; His son is thirty. Capulet. Will you tell me that? 660His son was but a ward two years ago. Romeo. [To a Servingman] What lady is that, which doth enrich the hand Of yonder knight? Servant. I know not, sir. 665 Romeo. O, she doth teach the torches to burn bright! It seems she hangs upon the cheek of night Like a rich jewel in an Ethiope's ear; Beauty too rich for use, for earth too dear! So shows a snowy dove trooping with crows, 670As yonder lady o'er her fellows shows. The measure done, I'll watch her place of stand, And, touching hers, make blessed my rude hand. Did my heart love till now? forswear it, sight! For I ne'er saw true beauty till this night. 675 Tybalt. This, by his voice, should be a Montague. Fetch me my rapier, boy. What dares the slave Come hither, cover'd with an antic face, To fleer and scorn at our solemnity? Now, by the stock and honour of my kin, 680To strike him dead, I hold it not a sin. Capulet. Why, how now, kinsman! wherefore storm you so? Tybalt. Uncle, this is a Montague, our foe, A villain that is hither come in spite, To scorn at our solemnity this night. 685 Capulet. Young Romeo is it? Tybalt. 'Tis he, that villain Romeo. Capulet. Content thee, gentle coz, let him alone; He bears him like a portly gentleman; And, to say truth, Verona brags of him 690To be a virtuous and well-govern'd youth: I would not for the wealth of all the town Here in my house do him disparagement: Therefore be patient, take no note of him: It is my will, the which if thou respect, 695Show a fair presence and put off these frowns, And ill-beseeming semblance for a feast. Tybalt. It fits, when such a villain is a guest: I'll not endure him. Capulet. He shall be endured: 700What, goodman boy! I say, he shall: go to; Am I the master here, or you? go to. You'll not endure him! God shall mend my soul! You'll make a mutiny among my guests! You will set cock-a-hoop! you'll be the man! 705 Tybalt. Why, uncle, 'tis a shame. Capulet. Go to, go to; You are a saucy boy: is't so, indeed? This trick may chance to scathe you, I know what: You must contrary me! marry, 'tis time. 710Well said, my hearts! You are a princox; go: Be quiet, or—More light, more light! For shame! I'll make you quiet. What, cheerly, my hearts! Tybalt. Patience perforce with wilful choler meeting Makes my flesh tremble in their different greeting. 715I will withdraw: but this intrusion shall Now seeming sweet convert to bitter gall. [Exit] Romeo. [To JULIET] If I profane with my unworthiest hand This holy shrine, the gentle fine is this: 720My lips, two blushing pilgrims, ready stand To smooth that rough touch with a tender kiss. Juliet. Good pilgrim, you do wrong your hand too much, Which mannerly devotion shows in this; For saints have hands that pilgrims' hands do touch, 725And palm to palm is holy palmers' kiss. Romeo. Have not saints lips, and holy palmers too? Juliet. Ay, pilgrim, lips that they must use in prayer. Romeo. O, then, dear saint, let lips do what hands do; They pray, grant thou, lest faith turn to despair. 730 Juliet. Saints do not move, though grant for prayers' sake. Romeo. Then move not, while my prayer's effect I take. Thus from my lips, by yours, my sin is purged. Juliet. Then have my lips the sin that they have took. Romeo. Sin from thy lips? O trespass sweetly urged! 735Give me my sin again. Juliet. You kiss by the book. Nurse. Madam, your mother craves a word with you. Romeo. What is her mother? Nurse. Marry, bachelor, 740Her mother is the lady of the house, And a good lady, and a wise and virtuous I nursed her daughter, that you talk'd withal; I tell you, he that can lay hold of her Shall have the chinks. 745 Romeo. Is she a Capulet? O dear account! my life is my foe's debt. Benvolio. Away, begone; the sport is at the best. Romeo. Ay, so I fear; the more is my unrest. Capulet. Nay, gentlemen, prepare not to be gone; 750We have a trifling foolish banquet towards. Is it e'en so? why, then, I thank you all I thank you, honest gentlemen; good night. More torches here! Come on then, let's to bed. Ah, sirrah, by my fay, it waxes late: 755I'll to my rest. [Exeunt all but JULIET and Nurse] Juliet. Come hither, nurse. What is yond gentleman? Nurse. The son and heir of old Tiberio. Juliet. What's he that now is going out of door? 760 Nurse. Marry, that, I think, be young Petrucio. Juliet. What's he that follows there, that would not dance? Nurse. I know not. Juliet. Go ask his name: if he be married. My grave is like to be my wedding bed. 765 Nurse. His name is Romeo, and a Montague; The only son of your great enemy. Juliet. My only love sprung from my only hate! Too early seen unknown, and known too late! Prodigious birth of love it is to me, 770That I must love a loathed enemy. Nurse. What's this? what's this? Juliet. A rhyme I learn'd even now Of one I danced withal. [One calls within 'Juliet.'] Nurse. Anon, anon! Come, let's away; the strangers all are gone. [Exeunt]

      I can see various characterizations, themes and stylistic devices, which I will discuss below

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Mazar & Yovel 2025 dissect the inverse problem of how echolocators in groups manage to navigate their surroundings despite intense jamming using computational simulations.

      The authors show that despite the 'noisy' sensory environments that echolocating groups present, agents can still access some amount of echo-related information and use it to navigate their local environment. It is known that echolocating bats have strong small and large-scale spatial memory that plays an important role for individuals. The results from this paper also point to the potential importance of an even lower-level, short-term role of memory in the form of echo 'integration' across multiple calls, despite the unpredictability of echo detection in groups. The paper generates a useful basis to think about the mechanisms in echolocating groups for experimental investigations too.

      Strengths:

      The paper builds on biologically well-motivated and parametrised 2D acoustics and sensory simulation setup to investigate the various key parameters of interest

      The 'null-model' of echolocators not being able to tell apart objects & conspecifics while echolocating still shows agents succesfully emerge from groups - even though the probability of emergence drops severely in comparison to cognitively more 'capable' agents. This is nonetheless an important result showing the direction-of-arrival of a sound itself is the 'minimum' set of ingredients needed for echolocators navigating their environment.

      The results generate an important basis in unraveling how agents may navigate in sensorially noisy environments with a lot of irrelevant and very few relevant cues.

      The 2D simulation framework is simple and computationally tractable enough to perform multiple runs to investigate many variables - while also remaining true to the aim of the investigation.

      Weaknesses:

      Authors have not yet provided convincing justification for the use of different echolocation phases during emergence and in cave behaviour. In the previous modelling paper cited for the details - here the bat-agents are performing a foraging task, and so the switch in echolocation phases is understandable. While flying with conspecifics, the lab's previous paper has shown what they call a 'clutter response' - but this is not necessarily the same as going into a 'buzz'-type call behaviour. As pointed out by another reviewer - the results of the simulations may hinge on the fact that bats are showing this echolocation phase-switching, and thus improving their echo-detection. This is not necessarily a major flaw - but something for readers to consider in light of the sparse experimental evidence at hand currently.

      The use of echolocation phases—defined as the sequential search, approach, and buzz call patterns—has been documented not only during foraging but also in tasks such as landing, obstacle avoidance, clutter navigation, and drinking. Bat call structure has been shown to vary systematically with object proximity, not exclusively in response to prey. During obstacle avoidance, phase transitions were observed, with approach calls emitted in grouped sequences and with reduced durations (Gustafson & Schnitzler, 1979; Schnitzler et al., 1987). In landing contexts, bats have been reported to emit short-duration calls and decrease inter-pulse intervals—buzz-like patterns also observed during prey capture— suggesting shared acoustic strategies across behaviors (Hagino et al., 2007; Hiryu et al., 2008; Melcón et al., 2007, 2009). Comparable patterns have been reported during drinking maneuvers, where “drinking buzzes” have been proposed to guide a precise approach to the water surface, analogous to landing buzzes (Griffiths, 2013; Russo et al., 2016). In response to environmental complexity, bats were found to shorten calls and increase repetition rates when navigating cluttered spaces compared to open ones (Falk et al., 2014; Kalko & Schnitzler, 1993).

      Moreover, field recordings from our study of Rhinopoma microphyllum (Goldshtein et al., 2025) revealed shortened call durations and inter-pulse intervals during dense group flight outside the cave during emergence—patterns consistent with terminal-approach phase that is typical when coming very close to an object (another bat in this case). The Author response image 1 shows an approach sequence recorded from a tagged bat approximately 20 meters from the cave entrance, with self-generated echolocation calls marked. The inter-pulse-interval of ca. 20 ms is used by these bats when a reflective object (another bat in this case) is nearby. 

      Author response image 1.

      These results provide direct evidence that bats actively employ approach-phase echolocation during swarming likely to avoid collision with other bats. This supports the view that echolocation phase transitions are a general proximity-based sensing strategy, adapted across a variety of behavioral scenarios—not limited to hunting alone. 

      In our simulations, bats predominantly emitted calls in the approach phase, with only rare occurrences of buzz-phase calls.

      See lines 355-363 in the revised manuscript.

      The decision to model direction-of-arrival with such high angular resolution (1-2 degrees) is not entirely justifiable - and the authors may wish to do simulation runs with lower angular resolution. Past experimental paradigms haven't really separated out target-strength as a confounding factor for angular resolution (e.g. see the cited Simmons et al. 1983 paper). Moreover, to this reviewer's reading of the cited paper - it is not entirely clear how this experiment provides source-data to support the DoA-SNR parametrisation in this manuscript. The cited paper has two array-configurations, both of which are measured to have similar received levels upon ensonification. A relationship between angular resolution and signal-to-noise ratio is understandable perhaps - and one can formulate such a relationship, but here the reviewer asks that the origin/justification be made clear. On an independent line, also see the recent contrasting results of Geberl, Kugler, Wiegrebe 2019 (Curr. Biol.) - who suggest even poorer angular resolution in echolocation.

      We thank the reviewer for raising this important point. The acuity of 1.5–3° in horizontal direction-of-arrival (DoA) estimation is based on the classical work of Simmons et al. with Eptesicus fuscus (Simmons et al., 1983). Similar precision was later supported by Erwin et al. (Erwin et al., 2001), who modeled azimuth estimation from measured interaural intensity differences (IIDs), reporting an average error of 0.2° with a standard deviation of ~2.2°, consistent with the behavioral data found by Simmons. The decline in acuity with increasing arrival angle has also been demonstrated in behavioral and physiological studies of binaural IID processing (Erwin et al., 2001; Fay, 1995; Razak, 2012; Wohlgemuth et al., 2016). The error model itself was first introduced in our earlier work (Mazar & Yovel, 2020).

      Importantly, Geberl et al. (Geberl et al., 2019) examined the resolution of weak targets masked by nearby strong flankers  and found poor spatial discrimination of ~45 degrees; however, they were studying a detection problem, rather than the horizontal acuity of azimuth estimation. Indeed, our model assumes there is no spatial discrimination at all.

      Overall, while our DoA–SNR parametrization can certainly be critiqued and alternative parameterizations could be tested in future work, we believe it reflects a reasonable and empirically supported assumption. 

      Reviewer #2 (Public review):

      This manuscript describes a detailed model for bats flying together through a fixed geometry. The model considers elements which are faithful to both bat biosonar production and reception and the acoustics governing how sound moves in air and interacts with obstacles. The model also incorporates behavioral patterns observed in bats, like one-dimensional feature following and temporal integration of cognitive maps. From a simulation study of the model and comparison of the results with the literature, the authors gain insight into how often bats may experience destructive interference of their acoustic signals and those of their peers, and how much such interference may actually negatively effect the groups' ability to navigate effectively. The authors use generalized linear models to test the significance of the effects they observe.

      The work relies on a thoughtful and detailed model which faithfully incorporates salient features, such as acoustic elements like the filter for a biological receiver and temporal aggregation as a kind of memory in the system. At the same time, the authors abstract features that are complicating without being expected to give additional insights, as can be seen in the choice of a two-dimensional rather than three-dimensional system. I thought that the level of abstraction in the model was perfect, enough to demonstrate their results without needless details. The results are compelling and interesting, and the authors do a great job discussing them in the context of the biological literature.

      With respect to the first version of the manuscript, the authors have remedied all my outstanding questions or concerns in the current version. The new supplementary figure 5 is especially helpful in understanding the geometry.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Data Availability: This reviewer lauds the authors for switching from a private commercial folder requiring login to one that does not. At the cost of being overtly pedantic - the Github repository is not a long-term archival resource. The ideal solution is to upload the code in an academic repository (Zenodo, OSF, etc.) to periodically create a 'static snapshot' of code for archival, while also hosting a 'live' version on Github.

      We have uploaded to Zenodo repository, and updated the link in the paper:

      How bats exit a crowded colony when relying on echolocation only - a modeling approach

      In one of the rebuttals to Reviewer #3- the authors have cited a wrong paper (Beleyur & Goerlitz 2019) - while discussing broad bandwidth calls improving detection - and may wish to correct this if possible on record.

      We have removed the incorrect citation from the revised version of the manuscript.

      Specific comments on the 2nd manuscript:

      Figure 5: Table 1 says 1, 2,5,10,20,40,100 bats were simulated (line 138-139) but the conclusion (line 398) says '1 to 100 bats' per 3msq. However, the X-axis only stops at 40 and says 'number of bats', while the legend says bats/3msq....what is actually being plotted? Moreover, in the entire paper there is a constant back-and-forth between density and # of bats - perhaps it is explained beforehand, but it is a bit unsettling - and more can be done to clarify these two conventions.

      While most parameters were tested across the full range of 1 to 100 bats per 3 m², a subset of conditions—including misidentification, multi-call clustering, wall target strength, and conspecific target strength—were simulated only up to 40 bats due to significantly longer run-times. This is now clarified in both the main text and the Table 1 caption.

      In our simulations, the primary parameter was the number of bats placed within a 3 m² starting area, which directly determined the initial density (bats per 3 m²). Throughout the manuscript, we use “number of bats” to refer to the simulation input, while “density” denotes the equivalent ecological measure. Figure 5 and related captions have been revised accordingly to note these conventions and to indicate when results are shown only up to 40 bats (see lines 120–122, 314-317 in the revised text).

      Table 1: This was made considerably difficult to read given the visual clutter - and I hope I've understood these changes correctly.

      What is in the square brackets of the effect-size (e.g. first row with values 'Exit prob. (%)' says -0.37/bat [63:100] ? What does this 63:100 refer to?

      What is the 'process flag'

      Values in square brackets indicate the minimum and maximum values of the metric across the tested range (e.g., [63:100] shows the range of exit probabilities observed across different bat densities).

      The term “process flag” has been replaced with “with and without multi-call clustering” for clarity

      Both the table layout and caption have been revised to reduce visual clutter and to make these conventions clearer to the reader. 

      Lines 562-3: "In our study, due to the dense cave environment, the bats are found to operate in the approach phase nearly all of the time, which is consistent with natural cave emergence behavior" - bats are 'found to' implies there is some experimental data or it is an emergent property. See above for the point questioing the implementation of multiple echolocation phases in the model, but also - here the bat-agents are allowed to show different phases and thus they do so -- it is a constraint of the implementation and not a result per se given the size of the cave and the number of bats involved...

      We removed the sentence from the Methods section, since it could be misinterpreted as an experimental finding rather than a model outcome. Instead, we now discuss this in the Discussion, clarifying that the predominance of the approach phase arises from the cluttered cave environment in our simulations, which is consistent with natural emergence behavior (see lines 355-363). In this context, the use of echolocation phases is presented as a biologically plausible modeling choice rather than an empirical result.

      Lines 659-660: The parametrisation between DoA and SNR is supposedly found in 'Equation 10' - which this reviewer could not find in the manuscript

      The equation was accidentally omitted in the previous revision and has now been reinserted into the manuscript. It defines how direction-of-arrival (DoA) error depends on SNR and azimuth angle (see lines 603-605).

    1. Reviewer #2 (Public review):

      Summary:

      This work extends a previous recurrent neural network model of activity-silent working memory to account for well-established findings from psychology and neuroscience suggesting that working memory capacity constraints can be partially overcome when stimuli can be organized into chunks. This is accomplished via the introduction of specialized chunking clusters of neurons to the original model. When these chunking clusters are activated by a cue (such as a longer delay between stimuli), they rapidly suppress recently active stimulus clusters. This makes these stimulus clusters available for later retrieval via a synaptic augmentation mechanism, thereby expanding the network's overall effective capacity. Furthermore, these chunking clusters can be arranged in a hierarchical fashion, where chunking clusters are themselves chunked by higher-level chunking clusters, further expanding the network's overall effective capacity to a new "magic number", 2^{C-1} (where C is the basic capacity without chunking). In addition to illustrating the basic dynamics of the model with detailed simulations (Figures 1 and 2), the paper also utilizes qualitative predictions from the model to (re-)analyze data collected in previous experiments, including single-unit recordings from human medial temporal lobe as well as behavioral findings from a classic study of human memory.

      Strengths:

      The writing and figures are very clear, and the general topic is relevant to a broad interdisciplinary audience. The work is strongly theory-driven, but also makes some effort to engage with existing data from two empirical studies. The basic results showcasing how chunking can be achieved in an activity-silent working memory model via suppression and synaptic augmentation dynamics are interesting. Furthermore, we agree with the authors that the derivation of their new "magic number" is relatively general and could apply to other models, so those findings in particular may be of interest even to researchers using different modeling frameworks.

      Weaknesses:

      (1) Very important aspects of the model are assumed / hard-coded, raising the concern that it relies too much on an external controller, and that it would therefore be difficult to implement the same principles in a fully behaving model responsible for producing its own outputs from a sequence of stimuli (i.e., without a priori knowledge of the structure of incoming sequences).

      (i) One such aspect is the use of external chunking cues provided to the model at critical times to activate the chunking clusters. The simulations reported in the paper were conducted in a setting where signals to chunk are conveniently indicated by longer delays between stimuli. In this case, it is not difficult to imagine how an external component could detect the presence of such a delay and activate a chunking cluster in response. However, in order for the model to be more broadly applicable to different memory tasks that elicit chunking-related phenomena, a more general-purpose detector would be required (see further comments below and alternative models).

      (ii) Relatedly, and as the authors acknowledge in the discussion, the network relies on a pretty sophisticated external controller that decides when the individual chunking clusters are activated or deactivated during readout/retrieval. This seems especially complex in the hierarchical case. How might a network decide which chunking/meta-chunking clusters are activated/deactivated in which order? This was hard-coded in their simulations, but we imagine that it would be difficult to implement a general solution to this problem, especially in cases where there is ambiguity about which stimuli should be chunked, or where the structure of the incoming sequence is not known in advance.

      (iii) One of the central mechanisms of the model is the rapid synaptic plasticity in the inhibitory connections responsible for binding chunking clusters to their corresponding stimulus clusters. This mechanism again appears to have been hard-coded in the main simulations. Although we appreciate that the authors worked on one possible way that this could be implemented (Methods section D, Supplementary Figure S2), in the end, their solution seems to rely on precisely fine-tuning the timing with which stimuli are presented - a factor that seems unlikely to matter very much in humans/animals. This stands in contrast with models of working memory that rely on persistent activity, which are more robust to changes in timing. Note that we do not discount the possibility of activity-silent WM, and indeed it should be studied in its own right, but it is then even more important to highlight which of its features are dependent on the time constants, etc.

      (2) Another key shortcoming of this work is its limited direct engagement with empirical evidence and alternative computational accounts of chunking in WM. Although the efforts to re-analyze existing empirical results in light of the new predictions made by the model are commendable, in the end, we think they fall short of being convincing. As noted above, the model doesn't actually perform the same two tasks used in the human experiments, so direct quantitative comparisons between the model and human behavior or neural data are not possible. Instead, the authors rely on isolating two qualitative predictions of the model - the "dip" and "ramp" phenomena observed after a chunking cluster is activated (Figure 3), and the new magic number for effective capacity derived from the model in the case where stimuli are chunkable, which approximately converges with human recall performance in a memory study (Figure 4). Below, we highlight some specific issues related to these two sets of analyses, but the larger point is that if the model is making a commitment about how these neural mechanisms relate to behavioral phenomena, it would be important to test if the model can produce the behavioral patterns of data in experimental paradigms that have been extensively used to characterize those phenomena. For example, modern paradigms characterizing capacity limits have been more careful to isolate the contributions of WM per se (whereas the original magic number 7 is now thought to reflect a combination of episodic and working memory; see Cowan 2010). There are several existing models that more directly engage with this literature (e.g., Edin et al., 2009; Matthey et al., 2015; Nassar et al., 2018; Soni & Frank, 2025; Swan & Wyble, 2014; van den Berg et al., 2014; Wei et al., 2012), some of which also account for chunking-related phenomena (e.g., Wei et al, 2012; Nassar et al., 2018; Panichello et al., 2019; Soni & Frank, 2025). A number of related proposals suggest that WM capacity limits emerge from fundamentally different mechanisms than the one considered here - for example, content-related interference (Bays, 2014; Ma et al., 2014; Schurgin et al., 2020), or limitations in the number of content-independent pointers that can be deployed at a given time (Awh & Vogel, 2025), and/or the inherent difficulty of learning this binding problem (Soni & Frank, 2025). We think it would be worth discussing how these ideas could be considered complementary or alternatives to the ones presented here.

      (i) Single unit recordings. We found it odd that the authors chose to focus on evidence from single-unit recordings in the medial temporal lobe from a study focused on episodic memory. It was unclear how exactly these data are supposed to relate to their proposal. Is the suggestion that a mechanism similar to the boundary neurons might be operative in the case of working memory over shorter timescales in WM-related areas such as the prefrontal cortex, or that their chunking mechanism may relate not only to working memory but also to episodic memory in the medial temporal lobe?

      (ii) N-gram memory experiment. Our main complaint about the analysis of the behavioral data from the human memory study (Figure 4) is that the model clearly does not account for the main effect observed in that study - namely, the better recall observed for higher-order n-gram approximations to English. We acknowledge that this was perhaps not the main point of the analysis (which related more to the prediction about the absolute capacity limit M*), but it relates to a more general criticism that the model cannot account for chunking behavior associated with statistical learning or semantic similarity. Most of the examples used in the introduction and discussion are of this kind (e.g., expressions such as "Oh my God" or "Easier said than done", etc.). However, the chunking mechanism of the model should not have any preference for segmenting based on statistical regularities or semantic similarity - it should work just as well if statistical anomalies or semantic dissimilarity were used as external chunking cues. In our view, these kinds of effects are likely to relate to the brain's use of distributed representations that can capture semantic similarity and learn statistical regularities in the environment. Although these kinds of effects may be beyond the scope of this model, some effort could be made to highlight this in the discussion. But again, more generally, the paper would be more compelling if the model were challenged to simulate more modern experimental paradigms aimed at testing the nature of capacity limits in WM, or chunking, etc.

      (iii) There are a number of other empirical phenomena that we're not sure the model can explain. In particular, one of the hallmarks of WM capacity limits is that it suffers from a recency bias, where people are more likely to remember the most recent items at the expense of items presented prior to that (Oberauer et al 2012). [There are also studies showing primacy effects in addition to recency effects, but the primacy effects are generally attributed to episodic rather than working memory - for example, introducing a distractor task abolishes the recency but not primacy effect]. But the current model seems to make the opposite prediction: when the stimuli exceed its base capacity, it appears to forget the most recent stimuli rather than the earliest ones (Figure 1d). This seems to result from the number of representations that can be reactivated within a cycle and thus seems inherent to the dynamics of the model, but the authors can clarify if, instead, it depends on the particular values of certain parameters. (In contrast, this recency effect is captured in other models with chunking capabilities based on attractive dynamics and/or gating mechanisms - eg Boboeva et al 2023; Soni & Frank (2025)). Relatedly, we're not sure if the model could account for the more recent finding that recall is specifically enhanced when chunks occur in early serial positions compared to later ones (Thalmann, Souza, Oberauer, 2019).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study addresses the encoding of forelimb movement parameters using a reach-to-grasp task in mice. The authors use a modified version of the water-reaching paradigm developed by Galinanes and Huber. Two-photon calcium imaging was then performed with GCaMP6f to measure activity across both the contralateral caudal forelimb area (CFA) and the forelimb portion of primary somatosensory cortex (fS1) as mice perform the reaching behavior. Established methods were used to extract the activity of imaged neurons in layer 2/3, including methods for deconvolving the calcium indicator's response function from fluorescence time series. Video-based limb tracking was performed to track the positions of several sites on the forelimb during reaching and extract numerous low-level (joint angle) and high-level (reach direction) parameters. The authors find substantial encoding of parameters for both the proximal and distal parts of the limb across both CFA and fS1, with individual neurons showing heterogeneous parameter encoding. Limb movement can be decoded similarly well from both CFA and fS1, though CFA activity enables decoding of reach direction earlier and for a more extended duration than fS1 activity. Collectively, these results indicate involvement of a broadly distributed sensorimotor region in mouse cortex in determining low-level features of limb movement during reach-to-grasp.

      Strengths:

      The technical approach is of very high quality. In particular, the decoding methods are well designed and rigorous. The use of partial correlations to distinguish correlation between cortical activity and either proximal or distal limb parameters or either low- or high-level movement parameters was very nice. The limb tracking was also of extremely high quality, and critical here to revealing the richness of distal limb movement during task performance.

      The task itself also reflects an important extension of the original work by Galinanes and Huber. The demonstration of a clear, trackable grasp component in a paradigm where mice will perform hundreds of trials per day expands the experimental opportunities for the field. This is an exciting development.

      The findings here are important and the support for them is solid. The work represents an important step forward toward understanding the cortical origins of limb control signals. One can imagine numerous extensions of this work to address basic questions that have not been reachable in other model systems.

      Collectively, these strengths made this manuscript a pleasure to read and review.

      Thank you!

      Weaknesses:

      In the last section of the results, the authors purport to examine the representation of "higher-level target-related signals," using the decoding of reach direction. While I think the authors are careful in their phrasing here, I think they should be more explicit about what these signals could be reflecting. The "signals" here that are used to decode direction could relate to anything - low-level signals related to limb or postural muscles, or true high-level commands that dictate only what movement downstream motor centers should execute, rather than the muscle commands that dictate how. One could imagine using a partial correlation-type approach again here to extract a signal uncorrelated with all the measured low-level parameters, but there would still be all the unmeasured ones. Again, I think it is still ok to call these "high-level signals," but I think some explicit discussion of what these signals could reflect is necessary.

      Thank you for this excellent suggestion. We have followed both pieces of the reviewer’s advice. First, we performed the suggested analysis, partialing off the kinematics then performing target classification on the residuals. This is now Figure 6S1. The analysis revealed the presence of target-related information in the neural activity after subtracting off all linear correlations with kinematics, supporting our claims that higher-level information is present in both populations. The exact timing of classifier performances varied substantially across mice, potentially due to differences in reach-to-grasp strategy, kinematic tracking fidelity, and exact spatial locations of each recorded FOV. Following the second suggestion, we have made the relevant text more careful. We now conclude simply that higher-level signals, meaning those signals that are largely unrelated to forelimb joint angle kinematics, are present but with variable timing and strengths in each area. That text now reads:

      “Target decoding performance could result from truly higher-level signals that code abstractly for target location, or alternatively could be supported by strong encoding of kinematic variables that differed between targets. To disambiguate these possibilities, we refit the linear classifier to neural data after regressing off variance related to the joint angle kinematics. The strength and exact time course of the resulting target decoding varied somewhat across animals, but the earliest portion of target decoding performance persisted in all animals after the removal of kinematics and performance remained stronger for M1-fl than S1-fl (Fig. 6S1B). We thus conclude that higher-level signals are present in both areas, but differ in their exact timing and strength. However, we note that other possible signals, such as postural changes, could not be controlled for here.”

      Related to this, I think the manuscript in general does not do an adequate job of explicitly raising the important caveats in interpreting parametric correlations in motor system signals, like those raised by Todorov, 2000. The authors do an expert job of handling the correlations, using PCA to extract uncorrelated components and using the partial correlation approach. However, more clarity about the range of possible signal types the recorded activity could reflect seems necessary.

      This is an important point, and our text could have unintentionally misled readers. We have now attempted to make this point explicit in the Discussion and in the Results for Figure 6. This Discussion text now reads:

      “Moreover, as is widely known (Todorov 2000), the exact role of these kinematically-related signals is challenging to determine from correlative measures alone; thus, determining whether these signals are used for direct movement control or instead indirectly reflect control performed elsewhere is left as a topic for future work.”

      The manuscript could also do a better job of clarifying relevant similarities and differences between the rodent and primate systems, especially given the claims about the rodent being a "first-class" system for examining the cellular and circuit basis of motor control, which I certainly agree with. Interspecies similarities and differences could be better addressed both in the Introduction, where results from both rodents and primates are intermixed (second paragraph), and in the Discussion, where more clarity on how results here agree and disagree with those from primates would be helpful. For example, the ratio of corticospinal projections targeting sensory and motor divisions of the spinal cord differs substantially between rodents and primates. As another example, the relatively high physical proximity between the typical neurons in mouse M1 and S1 compared to primates seems likely to yoke their activity together to a greater extent. There is also the relatively large extent of fS1 from which forelimb movements can be elicited through intracortical microstimulation at current levels similar to those for evoking movement from M1. All of these seem relevant in the context of findings that activity in mouse M1 and S1 are similar.

      We understand two points to address here. The first point is that we needed to be more careful to attribute previous results as being from the rodent vs. monkey. We agree. We have now revised several parts of the paper to make these distinctions clearer. The second point is about the potential benefit of a thorough review of the many ways in which primate and rodent sensorimotor systems differ. We entirely agree that this could be useful for the field. However, this is a sizable endeavor and doing it full justice is beyond what we know how to fit in the space allotted for framing our results here. We therefore sought a compromise, acknowledging how our results correspond to existing results in the primate without exhaustively accounting for how they differ. Future work will be necessary to more carefully disambiguate whether species-specific differences are due to biomechanical, neurological, ethological, or as-of-yet undetermined sources. We have incorporated your final specific points about what could produce similar information in M1 and S1 into the Discussion.

      “This may simply be a consequence of widely distributed representations of movement across mouse cortex (Musall et al. 2019; Steinmetz et al. 2019; Stringer et al. 2019), including forelimb somatosensory areas, or may be a consequence of the close physical proximity of M1-fl and S1-fl hindering development of functionally distinct representations (Tennant et al. 2011).”

      In addition, there are a number of other issues related to the interpretation of findings here that are not adequately addressed. These are described in the Recommendations for improvement.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Grier, Salimian, and Kaufman characterize the relationship between the activity of neurons in sensorimotor cortex and forelimb kinematics in mice performing a reach-to-grasp task. First, they train animals to reach to two cued targets to retrieve water reward, measure limb motion with high resolution, and characterize the stereotyped kinematics of the shoulder, elbow, wrist, and digits. Next, they find that inactivation of the caudal forelimb motor area severely impairs coordination of the limb and prevents successful performance of the task. They then use calcium imaging to measure the activity of neurons in motor and somatosensory cortex, and demonstrate that fine details of limb kinematics can be decoded with high fidelity from this activity. Finally, they show reach direction (left vs right target) can be decoded earlier in the trial from motor than from somatosensory cortex.

      Strengths:

      In my opinion, this manuscript is technically outstanding and really sets a new bar for motor systems neurophysiology in the mouse. The writing and figures are clear, and the claims are supported by the data. This study is timely, as there has been a recent trend towards recording large numbers of neurons across the brain in relatively uncontrolled tasks and inferring a widespread but coarse encoding of high-level task variables. The central finding here, that sensorimotor cortical activity reflects fine details of forelimb movement, argues against the resurgent idea of cortical equipotentiality, and in favor of a high degree of specificity in the responses of individual neurons and of the specialization of cortical areas.

      Thank you!

      Weaknesses:

      It would be helpful for the authors to be more explicit about which models of mouse cortical function their results support or rule out, and how their findings break new conceptual ground.

      We appreciate this feedback and have attempted to make these details clearer through changes to the Introduction and Discussion. One key change is noted below:

      “The presence of detailed kinematic signals in the sensorimotor cortex supports a model of mouse sensorimotor cortex in which M1-fl and S1-fl play a strong role in shaping the fine details of reaching and grasping movements.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      In addition to the weaknesses noted above, I suggest the authors also address the following:

      The last results section is generally lacking in statistical support for claims. Statistical support should be added.

      Thank you for pointing this out, we have added more statistical support to this section.

      The consideration in the Discussion of relevant previous findings and potential explanations for the distal limb signals in mouse sensorimotor cortex is somewhat lacking. There are several specific issues:

      (1) In contrast to the present study, the studies cited in regards to a lack of motor cortical involvement did not involve dexterous movements - in fact, Kawai et al. explicitly engineered a task that did not involve dexterity to distinguish the role of motor cortex in learning from its known role in dextrous movement execution. In Kawai et al., the authors note one rat who adopted a more dexterous approach to the lever pressing task; in this rat, a motor cortical lesion did cause a longer-lasting reduction in task performance. In additional experiments reported in Kawai's PhD thesis, performance of a dextrous task does erode with motor cortex lesion, as seen in other studies, like the early rodent reaching work of Whishaw and colleagues.

      (2) Other possible explanations for the persistence of non-dexterous tasks following motor cortical removal are compensation by, or redundant functionality in, other motor system regions.

      (3) It is also worth noting that stimulation in different regions of mouse M1 and S1 evokes alternately, digit, wrist, and elbow movements in fairly similar proportions (Tennant, 2011), suggesting that descending pathways substantially target spinal circuits that control all forelimb joints.

      (4) It also seems relevant that although the recovery time course is longer, nonhuman primates also retain substantial hand control after motor cortical removal (e.g. Lashley, 1925; Glees and Cole, 1950; Passingham et al., 1983). Humans of course, appear to be a different story.

      These are good points. We have tried to make the Discussion better reflect the tension in the literature, including with this new text:

      “However, several other previous results have indirectly suggested that M1 and S1 may be involved in the details of forelimb movement. Performance suffers with inactivation or lesioning of M1 and S1 in skilled, complex manual behaviors (Guo et al 2015, Mizes et al 2024, Whishaw et al 1990) or idiosyncratic use of digits to accomplish non-dexterous tasks (Kawai 2014). The sparing of non-dexterous tasks with these lesions may also reflect redundancy in control as opposed to irrelevance of M1 and S1. Nevertheless, our finding of low-level kinematic information in sensorimotor cortex supports a role for cortex beyond simply providing redundant high-level commands to these subcortical areas.”

      We have avoided mentioning points 3 and 4 in the paper; the stimulation results might follow from activating projections not normally involved in this behavior, and discussing primates in this context would require a long list of caveats. We agree that these points are worth thinking about, but are concerned that they are too circumstantial to include in interpreting the results formally.

      Although similar decoding performance is achieved using neurons from both CFA and fS1, I am left wondering whether you would do substantially better with CFA using activity at additional preceding time points, or when using exclusively time points from the past. The primary model used here appears to use neural signals from corresponding time points to decode limb parameters, but results seemingly could be different when using preceding time points as regressors.

      We appreciate this suggestion and have added the analysis to an additional supplementary panel for Figure 5 (Figure 5S3). Incorporating lags into the decoder via a Wiener filter does indeed improve the decoding performance, but this could simply be due to the increase in the number of predictor variables. This analysis did not, however, further disambiguate M1-fl and S1-fl: the performance improvement was similar across areas for both causal and acausal lag configurations. This could be a consequence of the time resolution of calcium imaging, so further experiments with electrophysiology would be required to rule this possibility out. We now note this new result:

      “Including additional causal (-100 ms preceding) and/or acausal (-100 ms preceding to 100 following) lags improved decoding performance modestly and similarly for both areas (Fig. 5S3E-F).”

      Related to this, I am also worried about the bleeding of signals across time here. If you deconvolve and interpolate between time points, the interpolation seemingly will pull information into the past, up to half the sampling period, which here is on the order of how long it takes signals to travel to and from the limb. The authors do not make any inappropriate claims about the neural signals here reflecting causes or consequences of what is happening at the limb, but readers (like me) will still try to draw these sorts of conclusions. Is it possible that, although decoding from instantaneous signals is similar for the two regions, the M1 signals are actually motor signals related to future limb state while the S1 signals are sensory consequences? Even if many of the relevant details related to conduction times are not known, perhaps the authors could clarify what can and can't be said related to causal interpretation here.

      Thank you for suggesting further explanation here. We agree that our interpretation could be made more specific. We have added text in the Discussion section to speak more directly to what can and cannot be concluded from our analyses. In short, it is hard to be certain of lags in calcium imaging data for many reasons, and using recording methods with finer temporal resolution (like electrophysiology) will be necessary for determining the precise temporal relationships between kinematics and neural activity. In the absence of these recordings, we limit our claim to kinematic information being present in M1-fl and S1-fl neural activity and leave determining the causal role of this information to future work.

      New clarifying text in the Discussion:

      “The use of calcium imaging further prevents strong conclusions about whether activity reflects future limb states or sensory consequences. Confirming this limitation, inclusion of lagged data in the decoding models, whether causal or acausal, resulted in similar performance changes in both areas.”

      An alternative reason why lift onset is less decodable in CFA is that CFA activates substantially before lift onset, as has been observed in previous rodent studies (Kargo and Nitz, 2004; Miri et al., 2017; Veuthey et al., 2020), perhaps as some sort of movement preparation. S1, on the other hand, may not have this early activity, and so may show a clearer transient at onset when the hand and limb start to move. This seems more likely than the explanations provided by the authors.

      This is a valid possible alternative explanation and we have updated the Discussion to reflect this. This difference in the structure of M1-fl activity versus S1-fl is apparent in the projections of Figure 6A, which show M1-fl projections more clearly aligned to cue-onset than S1-fl projections.

      “Our lift time decoding results are consistent with this view and align with recent observations characterizing mouse proprioceptive forelimb cortex, (Alonso et al 2023), although an alternative explanation may be simply that M1-fl activates earlier than S1-fl during reaching (Kargo and Nitz 2004; Miri et al 2017; Veuthey et al 2020).”

      To better clarify relevant similarities and differences between the rodent and primate systems, the Introduction could include some of these similarities and differences exposed by the literature currently cited, and the Discussion could include an additional paragraph specifically relating findings here to previous observations in the primate.

      We appreciate the reviewer’s thoughtfulness on possible framings of our results. When writing this paper, framing was a major challenge for us and we drafted quite a few versions of the Introduction including some that focused more on mouse-primate comparison. In the end, we decided the most critical function of the Intro was to set up our central question, of “levels-of-sensorimotor-control”. The rich primate literature was valuable here, but getting into a protracted compare-and-contrast exercise quickly became a distraction from the point. Further, we sought to highlight the relevance and importance of the question answered in our work as the mouse has gained prominence for filling gaps that are challenging to address with primates. This paper serves as one of many early steps towards the ultimate goal of revealing general properties of sensorimotor cortical function with the mouse model. We have made some subtle changes to the Introduction that we hope will more clearly communicate this narrative. 

      We agree that a Discussion paragraph directly relating our results to those in primates would benefit our conclusions and have added one:

      “These results expand our understanding of the rodent sensorimotor system and highlight similarities to nonhuman primates. We show here evidence in mice of detailed joint angle kinematic signals from the full forelimb in M1 and S1, as has been shown in macaque cortex during tasks involving reaching and grasping objects (Vargas-Irwin et al. 2010; Saleh et al. 2010, 2012; Goodman et al. 2019; Okorokova et al. 2020). Additionally, the earlier onset of movement-related activity in M1-fl compared to S1-fl is similar to macaque M1 and S1 (Tanji and Evarts 1976). Taken together these results suggest that the mouse can be employed to address questions traditionally explored in primates about how cortical activity encodes detailed movement commands.”

      Although this is outside the scope of the present study, it would be interesting to image descending projection neurons to see what signals are conveyed downstream, and to what targets. Some signals observed in layer 2/3 may not be strongly reflected in descending projections.

      We agree that recording from descending projection neurons in this task would be of deep interest – and also agree that these experiments are beyond the scope of the present study. We look forward to performing these additional experiments in future work.

      Minor:

      (1) The use of "CFA" and “fS1” is a bit confusing. S1, like M1, is defined primarily based on histological criteria, while CFA is defined by intracortical microstimulation. CFA contains a substantial fraction of fS1, seemingly most of it based on the maps shown in Tennant et al., 2011. This is not really a criticism, as the field has not reached any sort of consensus on this nomenclature yet.

      We are similarly unhappy with the inconsistency of the terminology in the field, and struggled with how not to make it worse.  After much debate and consultation with colleagues, we decided to use “M1” and “S1” to evoke the century of literature on these areas; and “-fl” to indicate forelimb because it is more intuitive than “-ul” and avoids using the illegible “-ll” for hindlimb (relevant to our subsequent paper). For what we called M1-fl, we recorded where we did because anecdotally we saw similar responses across that swath; but note that this definition is also consistent with the definition of “MOp-ul” found with multimodal mapping by

      Munoz-Castaneda (2021), which extends a little anteriorly of MOp as defined by the Allen CCF. As the field continues to mature, we hope future work can converge on a set of shared terms.

      (2) Page 4: "Inactivations and lesions of M1 and S1 have shown that M1 is required for the execution of dexterous reach-to-grasp movements" - to me, earlier work from Whishaw and colleagues deserves to be cited here.

      We appreciate the suggestion and have updated the references in this section to better reflect the prior work from Whishaw and other researchers.

      (3) Page 5: "evoking sufficient trial-to-trial variability to avoid model overfitting." - what I think the authors are referring to here is a particular kind of "overfitting," the consequence of not exploring the full movement space, as opposed to model overfitting from issues with the model-fitting method itself. Rather than just saying overfitting, the authors could be clearer about what they are referring to.

      The reviewer is right; the phenomenon we intended to refer to is not properly termed overfitting. Specifically, we meant that data with restricted range does not necessarily express global structure, and models can therefore incorrectly fit them. For example, fitting a linear model to data including many periods of a sine wave will correctly show a zero-slope linear component, but fitting to only a portion of a single cycle will typically yield a nonzero slope. This is not overfitting, is not exactly underfitting (because the relevant structure is barely present in the data, as opposed to missed by an insufficiently powerful model), is not bias (the data are fit well), and is not even necessarily a problem (the local relationship may be what you are interested in). Yet, it does not reflect the larger structure of the data.

      We do not know of a standard term for this phenomenon, so instead of dragging the reader through this tangential argument, we have tried to offer a simpler motivation for using multiple targets:

      “Assessing the relationship between neural activity and the details of movement requires striking a balance between achieving repeatable behavior and evoking sufficient trial-to-trial variability to broadly sample movement space”.

      (4) Page 5: Caudal Forelimb Area should not be capitalized.

      Obviated with the change in area nomenclature.

      (5) Page 7: "of linearly independent degrees of freedom" - for a neuroscience audience, I think it is better to explicitly mention that the resulting PCs are uncorrelated.

      We agree that this section could benefit from clarification. We have attempted to provide additional nuance to indicate what the analysis was intended to test.

      “Despite the strong coupling between the proximal and distal joint angles, rich variation remained in the action of different joints over time. The presence of strong correlations across joints suggested that the kinematics may be well described by a smaller number of independent degrees of freedom than the total number of recorded angles. To assess the number of linearly independent (uncorrelated) degrees of freedom amongst the 24 joint angles and velocities, we used double-cross-validated PCA (Yu et al. 2009); Methods; Fig. 3D), finding intermediate dimensionalities of 7 (median for joint angles) and 10 (velocities; Fig. 3E). This is consistent with the idea that joint angles across the limb are coordinated instead of controlled independently, and that this coordination is flexible enough over time to enable accurately performing reaching and grasping to different targets.”

      (6) Page 7: In the Results, the authors should mention what indicator is being used, the imaging frame rate, and summarize briefly how cells were defined.

      Thank you for the suggestion, these details have been added to the relevant results section for clarity.

      “To do so, we recorded neural activity from neurons in layer 2/3 M1-fl extending into the immediately adjacent secondary motor cortex (M2), and the forelimb region of S1 (S1-fl) using two-photon calcium imaging of GCaMP6f-expressing neurons in layer 2/3 (185-230 μm deep, imaged at 31 Hz, cells extracted with Suite2p (Pachitariu et al 2017)).”

      (7) Page 7: "corrected at n=2" - n doesn't typically refer to the number of tests, so for clarity I would say "corrected for dual tests."

      Thank you for pointing this out, we have corrected the text and added additional explanation in the methods for our approach to determining statistical significance across the targets and locking events.

      “P-values obtained through the ZETA were then Bonferroni corrected for dual tests when measuring the number of cells modulated to a given event and corrected for six tests (2 targets and 3 events) when measuring the overall number of modulated cells.”

      (8) Page 7: In the Results, when the decoding is introduced, it would be helpful to have a few details without having to hunt through the Methods. For example, were things regularized, how was cross-validation handled, etc?

      Thank you for the suggestion, these details have been added to the relevant results section for clarity.

      A simple linear regression model related the single-trial joint angles at all time points to single-trial neural activity at the corresponding moments. The model was fit with ridge regression, the ridge penalty was determined via a heuristic (Karabatsos 2018), and performance was measured on held-out trials (80/20 train/test split, 50 folds).

      (9) Page 8: I think it is worth noting how much mouse reaching involves shoulder rotation as opposed to movement in other joints, as this seems very different from primates.

      Thank you for pointing this out. We think this is mostly a task difference: our mice were in a quadrupedal stance, whereas monkeys are typically asked to reach from a sitting position. We now mention this in the Results. 

      “Reaching evoked particularly large rotation of the shoulder, likely because the mice reached from a quadrupedal position to targets on either side of the snout.”

      (10) Page 8: Should provide quantification to clarify what is meant by "closely tracked."

      We have updated the text to indicate that this claim was meant to be qualitative, and to more clearly highlight that the interest here is the first demonstration of the ability to reconstruct valid forelimb postures from decoded joint angles in the mouse. Quantifying the reconstruction properly would require substantially more manual data labeling, and the successful decoding itself demonstrates indirectly that the reconstructions are good enough to obtain the results of interest.

      Additionally, we reconstructed the skeletal representation of the forelimb from the decoded joint angles and found that, as intended, the reconstructed postures had strong qualitative resemblance to the true postures, even of “minor” angles like cylindrical paw deformation or digit splay (Fig. 5C,G).

      (11) Page 8: "Overall, these results suggest that instantaneous movement-related signals are similarly distributed across CFA and fS1." - I know we are being succinct here, but this sentence sounds like a non sequitur in the context of this paragraph - perhaps include a conclusion from the results in this paragraph first, then summarize the whole section.

      Thank you for the suggestion, we have updated this text to more clearly conclude the results of this section.

      Overall, these results reveal that neural activity in M1-fl and S1-fl is closely related to the kinematic details of reach-to-grasp movements. The ability to decode substantial variance in proximal and distal joints suggests that this relationship extends to the entire forelimb and the similar performance obtained from each area suggests that this information is similarly distributed across M1-fl and S1-fl. 

      (12) Page 10: Mention of projections from fS1 does not explicitly specify their preferential targeting of the dorsal horn, which seems relevant.

      We appreciate the suggestion and have added this detail to the text.

      Rodent S1-fl is known to influence interneuron populations in the spinal cord through direct and indirect projections that predominantly target the dorsal horn (Ueno et al. 2018), thus these signals may also reflect S1-fl’s important role in modulating reflex circuits to coordinate sensory feedback with movement generation (Moreno-López et al. 2016; Moreno-Lopez et al. 2021; Seki et al. 2003).

      (13) Page 31: Labels on the figure indicating what blue and red stand for would be helpful.

      Thank you for the suggestion, labels have been added to indicate left and right trials for Figure 5 C/F and Figure 6A.

      (14) Page 32: Legend does not include panel D.

      Thank you for catching this, the corresponding caption has been added.

      Reviewer #2 (Recommendations for the authors):

      (1) The Introduction could perhaps set the central question in starker relief. What specifically do the authors mean by high- vs low-level control? As suggested by the cited studies, this has been a fraught issue in primate work for decades, and I think a finer-grained framing of alternative hypotheses would help set up the results. For example, would better performance at decoding joint angles than paw position be evidence for lower-level control? The clarity of the Introduction might also be improved if the facts and unknowns were broken down by species throughout.

      We have tried to further improve the focus of the Introduction on the central question, clarify what we mean, and make clearer in the review of the literature which species a finding comes from.

      The clarifying text from the introduction is quoted below:

      Extensive motor mapping experiments in rodents have revealed that activating different parts of the sensorimotor cortex evokes movements of different body parts or different kinds of movements of the same body part, as it does in primates (for review, see (Harrison and Murphy 2014)). Yet it is unclear how the topography of stimulation-evoked movements relates to the roles of these areas during volitional actions. Perturbations during behavioral tasks in mice involving forelimb lever or reaching movements have provided a coarse-level understanding of how these areas contribute during behavior. Inactivations and lesions of M1 and S1 have shown that M1 is required for the execution of dexterous reach-to-grasp movements (Guo et al. 2015; Sauerbrei et al. 2020; Galiñanes et al. 2018; Wang et al. 2017; Whishaw et al. 1991; Whishaw 2000) and that S1 is essential for adapting learned movements to external perturbations of a joystick (Mathis et al. 2017). However, spinal cord projections from mouse M1 and S1 primarily target spinal interneurons rather than directly synapsing onto motor neurons (Gu et al. 2017; Ueno et al. 2018; Wang et al. 2017), suggesting cortical activity might play a more modulatory role. Further, stimulation of brainstem nuclei alone can evoke naturalistic forelimb actions, including realistic reaching movements involving coordinated flexion and extension of the proximal and distal limb (Esposito et al. 2014; Ruder et al. 2021; Yang et al. 2023). Taken together, these results have raised the question of what role mouse M1 and S1 play in the control of goal-directed forelimb movements. 

      One route to answering this question involves characterizing the signals present in mouse M1 and S1 during movement. If mouse M1 and S1 were to control only high-level aspects of forelimb movements, activity should be dominated by ‘abstract’ signals like target location and reflect little trial-to-trial variability in reach kinematics. If instead M1 and S1 control low-level movement features then activity should correlate strongly with forelimb joint angle kinematics and their trial-to-trial variation when reaching to different targets. While the presence of high- or low-level signals in a cortical area does not necessarily imply that they are causally responsible for these aspects of movement, characterizing what signals are present serves as a first step toward determining how these areas relate to movement.

      (2) The kinematics and calcium traces appear to be highly stereotyped across trials. If the population encodes joint angles, would one expect to find correlations between the neural and kinematic residuals after subtraction of the time-varying means? Some additional analysis and/or discussion on this point would be helpful, especially as there are only two targets.

      This is a great idea. As suggested, we implemented regression models on the residuals for each target in the new Figure 5S3. Figure 5S3 A and B show the performance when decoding the residuals for right trials and C and D show performance for left trials. Decoding remained well above chance, despite shrinking down due to predicting this relatively small within-target variation. This analysis supports our claims from the main regression models in Figure 5 and 5S1-2, and also suggests that movements ipsilateral to the reaching limb (contralateral to the recording hemisphere) may be better encoded than movements contralateral to the reaching limb. We have added a reference to this additional residual analysis in the final paragraph of the decoding section of the Results section:

      “Finally, we tested whether the ability to decode these many joint angles was a direct consequence of inter-joint correlations, and might not be indicative of the presence of “real” information about some of these joints. To do so, we fit partial correlation models that removed correlations between proximal and distal joints, or removed correlations of the joint angles with a high-level parameter – the overall distance of the paw centroid to the spout. Despite substantially lowering the behavioral variance, in each case the residuals could still be decoded from neural activity (Fig 5S2A-D). Similar decoding performance for M1-fl and S1-fl was obtained from models fit to decode single-trial residuals separately for left and right trials (Fig 5S3A-D), indicating that trial-to-trial variations on each basic movement were decodable from these populations.”

      Along similar lines, binary classification is used to characterize cue-, lift-, and contact-responsive neurons. Is it possible to exploit trial-to-trial variation in the cue-lift and lift-contact latencies to extract the time-varying marginal effects of each event (e.g., using a GLM)?

      For the detection of single-cell modulations by different events, we have elected to retain our simple statistical test to determine modulation; in our experience, encoding models typically involve a surprising number of steps to get them to do what you actually intend. We leave more extensive encoding model-style analysis to future work, currently in progress.

      (3) The authors mention prior studies suggesting that the control of some forelimb tasks can be gradually transferred from the cortex to the subcortical centers. Have they performed the inactivation at different time points across learning, and if so, do they have evidence for a diminishing effect over time (e.g., blocking of both initiation and coordination early in training)? In addition, the effects of motor cortex inactivation are similar to, but slightly different from, effects shown in reaching tasks in prior studies. Some additional discussion on this point would be useful.

      Our inactivation experiments in this study were intended to coarsely demonstrate the involvement of mouse forelimb sensorimotor cortex in our task. We have not performed the inactivations over learning and leave such experiments to future work. 

      We agree that a little more clarity relating our results to previous ones was warranted. Previous studies (Guo et al. 2015 and Galinanes et al. 2018) have demonstrated inactivation impacts on similar tasks, but for thoroughness we sought to show the same for our task as it varied from the pellet and motorized water spout tasks in both training time and target configurations. Our results are strongly in line with those of Galinanes et al. 2018 which used a fairly similar water spout target configuration. In the inactivation experiments of that paper, 3 out of 13 animals with initiation-triggered inactivations were able to initiate reaching within a time window similar to control trials. Additionally, a proportion of trials across multiple mice proceeded with little perturbation from the inactivations. This is consistent with our observation that M1-fl inactivations may either abolish movement initiation or allow movement initiation but impair task completion on a trial-by-trial and animal-to-animal basis. Further work is required to determine what factors influence these differential responses to inactivation and to determine how these effects differ across task variations (i.e., pellet vs water spout). We have added a brief description of these nuances to the text for clarity. 

      “These inactivations blocked the execution of the reach to grasp sequence, preventing the animal from making contact with the spout during the 3-second laser stimulation period (Fig. 1F; 86.5% control trials with contact within 3 seconds of cue, 5.1% inactivation trials with contact, P < 10<sup>-191</sup>, Mann-Whitney U test, 2 mice, 495 stimulation trials). Interestingly, inactivation at the time of cue often did not prevent reach initiation (mouse 1: 54.7%, mouse 2: 34.2% of inactivation trials with lift within 3 seconds; 93.5%, 86.2% control trials). Yet the movement stalled once the paw and digits extended towards the spout, producing uncoordinated and unsuccessful reaching trajectories (Fig. 1I, two representative datasets). Taken together, these results support the involvement of M1-fl in the water-reaching task and suggest that the strength of inactivation effects may depend on specific task details like training time or target configuration (c.f. Galinanes et al. 2018).”

      Minor points

      (1) The rationale for the multiple comparisons procedure in identifying event-locked responses should be explained in more detail. If I understand correctly, the authors are not correcting for comparisons across ROIs, but instead control the family-wise error rate across brain regions and event types (dividing alpha by two or six). Why not instead control the false discovery rate across ROIs? 

      Thank you for pointing this out, it was confusing as written and we received a similar comment from Reviewer 1. We have fixed the wording now to make it clearer why we did this. We simply aimed to describe how many of the recorded neurons in each area were modulated by the task as a proxy for the engagement of these areas during the behavior, and to use this measure of modulation as a criterion for including the neuron in subsequent analysis. In other words, if the question had been “are any neurons in this area modulated by the task?” then correcting for the number of ROIs would be the correct method; but if the question is, “is this neuron probably modulated and therefore worth including in my decoder?” correcting for the number of ROIs will typically be much too conservative. Thus, we only sought to correct for the false discovery rate across events and targets for each ROI. We have added additional text in the methods to clarify these choices, below. Please also see response to (7) from Reviewer 1 above.

      “Note that we did not correct for the number of ROIs tested for two reasons. First, the goal of this testing was to serve as a criterion for inclusion in subsequent decoding analyses, not to determine whether any neurons in the area at all were modulated; and second, correcting for the number of ROIs would bias comparison between areas if different numbers of ROIs were recorded in one area vs. the other.”

      (2) It appears joint angles are treated as linear variables in the decoding analysis; is this correct? This seems reasonable as long as the range of motion is not too large, but the authors might briefly comment on the issue in the Methods. 

      Yes, all joint angles are treated as linear variables in the linear regression model. We observed empirically (as can be seen in Figure 3B and Figure 5B/F) that the joint angle variables were relatively constrained to specific ranges during the task, with no angles displaying substantial wrap-around during the reaching and grasping movements. It is true that use of nonlinear decoding would almost surely improve performance further. Future work could also compare decoding of joint angles with muscle forces, which correlate and which we made no effort to distinguish here. In this work, though, the demonstration of a substantial relationship between neural activity and kinematics already tells us that fine details of movement are present in the M1 and S1-fl populations, which is a critical fact to understand these areas and was not previously known. We now comment explicitly on this, as suggested.

      “Joint angle or velocity kinematics were linearly interpolated from their original 6.66 ms to 10 ms and smoothed with a Gaussian (15 ms s.d.). These angular variables were then treated linearly in decoding analyses as their ranges were relatively constrained during the reaching and grasping movements; although the true relationships are likely nonlinear, this serves as a sufficient approximation to demonstrate the presence of a relationship between neural activity and kinematics.”

      (3) Are the limb pose estimates mirrored along the mediolateral axis? Figures 1C and 2D appear to show reaches to the left spout on the animal's right.

      Thank you for pointing out the ambiguity in the display of these data. The reach trajectories were not mirrored along the mediolateral axis, but they are displayed from the perspective of the behavioral imaging cameras as shown in Figure 1A. Thus the right target reaches (ipsilateral to the animal’s reaching arm) are on the left side of the camera image and the left target reaches (contralateral to the animal’s reaching arm) are on the right side of the image. We have clarified this in the figure captions.

    1. Author response:

      The following is the authors’ response to the previous reviews

      General recommendations (from the Reviewing Editor):

      The reviewers agreed that addressing some specific concerns would improve the clarity of the paper and the strength of the conclusions. These points are listed below, and described in more detail in the reviewer-specific 'Recommendations for Authors':

      We thanks the editor and reviewers for the encouraging feedback and constructive comments. We provide our point-by-point response below.

      (1) The details of the new experiment including number of subjects and a description of the analysis should be provided in the main text.

      We now provide a detailed description of the methods (including the number of subjects; N = 30) and analyses for the new experiment. See our response to Reviewer 2 for more details.

      (2) It would be informative to see how the amplitude biases observed, agree with those found by Gordon et al. 1994.

      Addressed. Please see our response to Reviewer 1, comment 1.

      (3) Each of the models lead to different bias patterns. It would be very helpful to hear the author's interpretation, ideally with a mathematical explanation, of what leads to these distinct patterns.

      Addressed. Please see our response to Reviewer 1, comment 2.

      Reviewer #1 (Recommendations for the authors):

      (1) Most of my points have been addressed convincingly in this revision. The new experiment in which also biases in movement amplitude were determined is a welcome addition to the paper. However, I could not see the results of this study, as the authors did not include Fig. 4 in the manuscript, but repeated Fig. 3. That's unfortunate as I would have like to see the similarity between the biases in direction and amplitude. Moreover, I would have liked to see how the amplitude biases agree with those found by Gordon et al. EBR (1994) 99:112-130, and to which extent Gordon et al.'s explanation can explain the pattern.

      We apologize for including the incorrect figure in the previous version of our manuscript. We did make a correction and submitted a corrected version, but it appears that it didn’t make its way to you. The correct Figure 4 is now in the manuscript.

      The motor biases in amplitude (extent) observed in Experiment 4 (Author response image 1) are qualitatively similar to the pattern reported by Gordon et al. 1994. While the exact peaks do not match perfectly, both datasets show a two-peaked pattern.

      Gordon et al. (1994) attributed the bias in amplitude to direction-dependent variation in movement speed which, in their view, arise from anisotropies in limb inertia. Specifically, moving the upper arm along its quasiorthogonal direction (i.e., rotation about the elbow) requires lower effective inertia than moving parallel to the upper-arm axis. Given the arm posture in both datasets, the upper limb points toward ~135°/315°, with the orthogonal direction corresponding to ~45°/225°. The two-peaked speed profiles in both our data Author response image 1 and Gordon et al. are consistent with this prediction.

      Author response image 1.

      Gordon et al (1994) noted that, while the extent bias function should mirror the speed bias function, the motor planning system might proactively compensate for the speed bias. Indeed, while the extent and speed bias functions are roughly aligned in their study, the two are misaligned in our Experiment 4. For example, the speed function peaks around 45° which corresponds to a valley in the extent bias function. The difference between their data and ours could be due to a difference in the starting point configuration. However, their model predicts alignment of the speed and extent functions independent of starting point configuration. In contrast, the TR+TG model does predict our observed extent bias function and yields predictions about how this should change with different start point configurations. As such, while heterogeneity in movement speed may contribute to extent bias to some degree, we think the transformation bias and visual-target bias likely play a larger role in determining the amplitude bias observed extent bias at movement endpoint.

      We have added a discussion section about the bias function reported by Gordon et al. (1994) and their account in the manuscript (lines 482-493). We do not repeat it here, as the content largely overlaps with the response above.

      (2) One of the most important new insights from this study is that the three single-source models lead to different bias patterns, with 1, 2 or 4 peaks. However, what I miss in the paper is an intuitive explanation why they do so. Now, the models are described and their predictions are shown, but it remains unclear where these distinct patterns come from. As scientists, we want to understand things, so I would very much appreciate if the authors can provide such an intuitive explanation, for instance using a mathematical proof. That could also identify how general these patterns are, or if there are certain requirements for them to occur (such as a certain shape of the transformation bias).

      Note that the closed-form mathematical expression for the motor bias function is not straight forward. As such, the intuition comes primarily from inspection, that is, the model simulations themselves, what we show Figure 1 of the paper. Importantly, the model predictions are insensitive to the parameter values over a reasonable range. Thus, the number of peaks predicted by each model is a core distinguishing feature. We present in the Supplementary Results a formalized mathematical analysis to illustrate how different models produce different numbers of peaks in the movement-bias function.

      (3) I think it's a good idea to change the previous "Visual Bias" into a "Target Bias". This raises the question whether the "Prioprioceptive Bias" should not be changed into a "Hand Bias" or "Start Bias"?

      While we appreciate the reviewer’s point here, we prefer the term “Proprioceptive Bias” given that this term has been used in the literature and provides a contrast with sources of bias arising from vision. “Hand Bias” and "Start Bias” seem more ambiguous.

      L51: I think "would fall short" should be replaced by "would overshoot".

      L127: I think "biased toward the vertical axis" should be replaced by "biased away from the vertical axis". Figure 3 still contains the old terminology like T+V. Please replace by the new terminology. L255: Replace "Exp 1a" by "Exp 1b".

      L376: Replace 60 by 6.

      L831-2: I hope the summed LL was maximized, not minimized.

      Thanks for catching the typos. We have corrected all of them.

      Reviewer #2 (Recommendations for the authors):

      I think that Experiment 4 does not mention how many participants performed the study. (Only in the response to the reviewers I found this)

      We have added information regarding the number of participants in the Fig 4 (N=30).

      I am very happy that the authors added the biomechanical simulation into the paper. I am not convinced that this addressed my concerns exactly but it is an excellent addition and the authors have now adjusted the text appropriately.

      We appreciate the positive response to our additional assessment of biomechanical factors. We welcome any additional information on how we might fully address this issue.

      line 826: extend -> extent

      Corrected.

      Figure 4. I think that the authors have put the wrong figure here. I cannot see any data for extent. I would need to see this figure (or please correct me - but the caption doesn't match the figure and I don't see the results clearly. (I think the review might have the correct figure).

      We apologize for this mistake. We now provided the correct Figure 4 in the paper (also included in the first page of the response letter).

      I am missing the detailed description on when the direction error and distance error were calculated for exp 4 - and what exactly was used? How did the authors examine the values without correction? What time point was used? Did I miss the analysis section for this?

      Participants were instructed to make fast, straight movement without any corrections and were given up to 1 s to complete the movement. Hand position was recorded once the movement speed dropped below 1 cm/s. On 99.8% of trials, movement speed did not increase once this threshold was passed, indicating that the participants adhered to the instructions. On the remaining trials, we detected a secondary corrective movement (increase in speed >5 cm/s). On these trials, we used the position recorded when the movement speed initially dropped below 1 cm/s as the endpoint position. The pattern of results would be the same were we to exclude these trials.

      This information has been added to the Methods section (line 661-666).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      SOM+ interneurons such as Martinotti cells target the apical tufts of pyramidals in the cortex. Since interneurons in general are strongly implicated in mediating rhythmic population activity over a range of timescales, it is quite appropriate to study the consequence of rhythmic inhibition provided by SOM+ interneurons for synaptic integration, including the phenomenon of dendritic spikes. However, using conclusions from a singular study (ref 22) to identify the beta band as the rhythm mediated by SOM+ is not very accurate. SOM+ interneurons have been implicated in regulating rhythms centered just below 30 Hz (refs 22, 21). It is a range that lies in the grey zone of the traditional definition of beta and gamma. However, it is significantly higher than the 16 Hz rhythms explored in this study. It thus remains unknown how a 25-30 Hz rhythmic inhibition (that has an experimentally suggested role for dendrite targeting SOM+ INs) in apical tufts regulates dendritic spikes.

      We agree with the reviewer that the rhythms arising from SOM+ interneurons can extend their frequencies higher than the 16 Hz analyzed in this study. To address this, we have conducted a new set of simulations where we delivered distal dendritic inhibition across a range of frequencies, from 0.5 to 80 Hz (see new Results section “Frequency specific effects of rhythmic inhibition on neuronal integration”). These results revealed, surprisingly, that at 30 Hz their ability to entrain Ca<sup>2+</sup> and NMDA spikes degrades (but not Na<sup>+</sup> spikes). This suggests that beta rhythms in the 20-30 Hz range are operating at the highest frequency for which dendritically targeting inhibition will be effective. The implications are covered in the Discussion section “Interaction with microcircuitry”. They are:

      “Particularly in the visual cortex, SOM interneurons can generate a rhythm in the 25-30 Hz range [22]. We found this to be at the upper end of the frequency range for dendritic inhibitory rhythms to be effective in modulating NMDA and Ca<sup>2+</sup> spikes. If this rhythm solely recruited SOM interneurons, its effectiveness would be marginal. Potentially compensating for this, recent work has found that PV interneurons also participate in beta/low-gamma [23, 24] (but see [21, 22]). In our model, on its own when beta rhythmic inhibition was delivered perisomatically we found that it was less able to entrain spiking and had an overall hyperpolarizing effect. However, if delivered in conjunction with the distal dendritic inhibition arising from SOM interneurons, this may strengthen entrainment.”

      Distal dendritic inhibition has been previously shown to be more effective in controlling dendritic spikes. However, given the slow timescale of dendritic spikes, it can be hypothesized that high-frequency rhythmic inhibition would be ineffective in entraining the dendritic spikes either in distal or proximal location, as demonstrated by 4H and 5F, and vice versa. A computational study can take this further by exploring the robustness of this hypothesis. By sticking to a single-frequency definition of what constitutes Gamma (64 Hz) and Beta (16 Hz) inhibition, the current exploration does support the core hypothesis. However, given the temporal dynamics of dendritic spikes, it is valuable to learn, for example, the upper bound of "Beta" range (13-30Hz) inhibition that fails to phasically modulate them. In addition to the reason stated in the earlier paragraph, Alpha band activity (8-12 Hz), has been implicated (e.g. van Kerkoerle, 2014) in signaling of inter-areal feedback to the superficial layer in the cortex, potentially targeting apical tufts of pyramidals from multiple layers and resulting in alpha-range rhythmic inhibition. To make the findings significant, it might therefore be more pertinent to understand the consequences of ~10Hz rhythmic inhibition (in addition to the ~25-30 Hz Beta/Gamma) in the apical tufts for phasic modulation of dendritic spikes.

      We added an additional set of simulations that address this in the Results section ‘Frequency specific effects of rhythmic inhibition on neuronal integration’. In general, we found that dendritic and perisomatic inhibitory rhythms at lower frequencies could entrain AP generation, but with less functional specialization. This is explored in our Discussion section ‘Interneuron specializations and rhythm timescales’.

      The differential effect of Gamma and Beta range inhibition on basal and apical excitatory clusters is not convincing from the information provided. The basal cluster appears to overlap with perisomatic inhibitory synapses. The description in the methods does not have enough information to negate the visual perception (ln 979-81). With this understanding, it is not surprising that the correlation between excitation and APs is high (during the trough of gamma) for basal and not apical excitation. A more comparable scenario would be a more distal location of the basal excitatory cluster.

      While we stated in the original manuscript that we were contrasting ‘basal’ vs. ‘apical’ clustered inputs, this terminology did not reflect our intent with these analyses. We meant to contrast proximal vs. distal dendritic clustered synaptic inputs, which the reviewer correctly noted is confounded in the apical vs. basal comparison. We have rewritten these results, their discussion, and corresponding figure, to clearly state that we are contrasting proximal vs. distal synaptic input.

      Reviewer #2:

      The weaknesses are probably in some of the parameterizations of inhibitory synaptic dynamics. A unitary peak conductance of 1nS is very high for inhibitory synapses. This high value could invariably skew some of the network-level predictions. The authors could obtain specific parameters from the Neocortical Collaboration Portal (https://bbp.epfl.ch/nmcportal/microcircuit.html), which is an incredible resource for cortical neurons and synapses.

      We appreciate the valuable resource mentioned by the reviewer and will consult it when constructing future models. Regarding the present one, our choice of peak conductance was based on previous studies, namely:

      Egger R, Narayanan RT, Guest JM, Bast A, Udvary D, Messore LF, Das S, de Kock CPJ, Oberlaender M (2020) Cortical output is gated by horizontally projecting neurons in the deep layers. Neuron 105, 122-137.e128.

      and

      Xiang Z, Huguenard JR, Prince DA (2002) Synaptic inhibition of pyramidal cells evoked by different interneuronal subtypes in layer v of rat visual cortex. J Neurophysiol 88, 740-750.

      The study by Egger et al. used an inhibitory peak conductance of 1 nS and was simulating circuitry very similar to ours. We validated these synapses in pilot simulations that sought to characterize the resulting IPSPs and IPSCs, and whose results can be seen in Table 1 of our methods. These synapses exhibited IPSCs whose peak amplitudes ranged over values (~24162 pA) that agreed with the experimental literature, such as Xiang et al.

      Given this, we feel our parameterization of inhibitory synapses does not warrant any changes.

      Reviewer #3:

      What disappointed me a bit was the lack of a concise summary of what we learned beyond the fact that beta and gamma act differently on dendritic integration. The individual paragraphs of the discussion often are 80% summary of existing theories and only a single vague statement about how the results in this study relate. I think a summarizing schematic or similar would help immensely.

      We agree with the reviewer that a summary schematic would help the reader. This has been added to the manuscript as Figure 11. It demonstrates the principal findings of the paper and is referenced in the opening paragraph of the discussion section.

      Orthogonal to that, there were some points where the authors could have offered more depth on specific features. For example, the authors summarized that their "results suggest that the timescales of these rhythms align with the specialized impacts of SOM and PV interneurons on neuronal integration". Here they could go deeper and try to explain why SOM impact is specialized at slower time scales. (I think their results provide enough for a speculative outlook.)

      This discussion has been expanded under the section “Interneuron specializations and rhythm timescales”. The added text is:

      “So, while our results suggest that spatial targeting of SOM and PV interneurons aligns with the timescales of their network-level rhythms, it could also be that their timing and subcellular localization interact to produce specialized neuron-level functions [85]. For instance, NMDA and Ca<sup>2+</sup> spikes in the distal dendrites last for ~50 ms, making the slower beta rhythm more appropriate for bidirectionally controlling them. Both can be described as dynamical systems with distinct phases with differing sensitivity to inhibition. Ca<sup>2+</sup> spikes are dynamical events comprised of an initiation, plateau, and termination phase. Inhibition delivered during the plateau phase shortens their duration [86]. If the beta rhythm is comprised of cycling between periods of elevated excitation (increased NMDA spike generation) followed by elevated inhibition, then Ca<sup>2+</sup> spike initiation will tend to occur during the excitatory phase, and its plateau during the subsequent inhibitory phase. A plateau during the inhibitory phase will more quickly enter termination. This is bidirectional control. On the other hand, slower rhythms (e.g. 1 Hz) initiate Ca<sup>2+</sup> spikes during the excitatory phase that plateau and enter termination autonomously, before the inhibitory phase is reached. The same principle holds for NMDA spikes [87]. As a result, rhythms in the range from 15-30 Hz are optimal for synchronizing the onsets and offsets of dendritic spikes across a population of neurons.

      The integrative effects of gamma (>40 Hz) are also specialized. Low frequency inhibitory rhythms delivered to the soma tended to shift the membrane potential higher or lower with the rhythm’s phase, effectively bringing it closer or farther from AP generation but not changing the neuron’s sensitivity to fast synaptic inputs. In the gamma frequency range, this is reversed, with the mean membrane potential not varying with rhythm phase but with a shifting bias to positive or negative membrane potential fluctuations. In addition, the trough phase of gamma lowers the threshold for AP generation, while slower rhythms like beta only raise the threshold. Consequently, the timing of gamma is ideal for increasing the sensitivity of the neuron to rapid excitation. This agrees with the observation that gamma oscillations accompany rapid excitation-inhibition balancing [88].”

      We also extended our discussion section ‘Relevance to coding’ to explore how beta and gamma rhythms can support sparse vs. dense population coding, respectively. It reads:

      “One interpretation of rhythms arising from local inhibitory feedback is that they maintain the balance between excitation and inhibition. This can be thought of as a normalization operation that maintains activity within a set range. Normalization can be achieved either through a subtractive effect that raises the threshold for initiating an action potential, or a multiplicative effect that lowers the slope of the relationship between excitation and action potential firing rate. When considered at the population level, these normalization effects impact coding in different ways. Subtractive normalization increases sparsity by dropping out neurons whose excitation is below the raised threshold. Multiplicative normalization, however, encourages dense codes by scaling down firing rates and compressing the range of firing rates. This study found that while both perisomatic and distal dendritic inhibition produced subtractive effects, only perisomatic had a multiplicative effect. Tying this to beta and gamma, beta rhythms may encourage sparse population codes while gamma allows for dense.”

      Beyond that, the authors invite the community to reappraise the role of gamma and beta in coding. This idea seems to be hindered by the fact that I cannot find a mention of a release of the model used in this work. The base pyramidal cell model is of course available from the original study, but it would be helpful for follow-up work to release the complete setup including excitatory and inhibitory synapses and their activation in the different simulation paradigms used. As well as code related to that.

      We have added a Code and Data Availability section that addresses this. It reads: “Simulation code is deposited at ModelDB athttps://modeldb.science/2019883 . The raw simulation data are available from DBH upon request. Analysis code is posted as a github repo at https://github.com/dbheadley/InhibOnDendComp.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      The Drosophila wing disc is an epithelial tissue, the study of which has provided many insights into the genetic regulation of organ patterning and growth. One fundamental aspect of wing development is the positioning of the wing primordia, which occurs at the confluence of two developmental boundaries, the anterior-posterior and the dorsal-ventral. The dorsal-ventral boundary is determined by the domain of expression of the gene apterous, which is set early in the development of the wing disc. For this reason, the regulation of apterous expression is a fundamental aspect of wing formation.

      In this manuscript, the authors used state-of-the-art genomic engineering and a bottom-up approach to analyze the contribution of a 463 base pair fragment of apterous regulatory DNA. They find compelling evidence about the inner structure of this regulatory DNA and the upstream transcription factors that likely bind to this DNA to regulate apterous early expression in the Drosophila wing disc.

      Strengths:

      This manuscript has several strengths concerning both the experimental techniques used to address the problem of gene regulation and the relevance of the subject. To identify the mode of operation of the 463 bp enhancer, the authors use a balanced combination of different experimental approaches. First, they use bioinformatic analysis (sequence conservation and identification of transcription factors binding sites) to identify individual modules within the 463 bp enhancer. Second, they identify the functional modules through genetic analysis by generating Drosophila strains with individual deletions. Each deletion is characterized by looking at the resulting adult phenotype and also by monitoring apterous expression in the mutant wing discs. They then use a clever method to interfere in a more dynamic manner with the function of the enhancer, by directing the expression of catalytically inactive Cas9 to specific regions of this DNA. Finally, they recur to a more classical genetic approach to uncover the relevance of candidate transcription factors, some of them previously known and others suggested by the bioinformatic analysis of the 463 bp sequence. This workflow is clearly reflected in the manuscript, and constitutes a great example of how to proceed experimentally in the analysis of regulatory DNA.

      We thank the reviewer for these positive comments on the manuscript.

      Weaknesses:

      There are several caveats with the data that might be constructed as weaknesses, some of them are intrinsic to this detailed analysis or to the experimental difficulties of dealing with the wing disc in its earliest stages, and others are more conceptual and are offered here in case the authors may wish to consider them.

      (1) The primordium of the wing region of the wing imaginal disc is defined by the expression of the gen vestigial, which is regulated by inputs coming from the dorsal-ventral boundary (Notch and wg) and from the anterior-posterior boundary (Dpp). Having such a principal role in wing primordium specification and expansion, I am surprised that this manuscript does not mention this gene in the main text and only contains indirect references to it. I consider that the manuscript would have benefited a lot by including vestigial in the analysis, at least as a marker of early wing primordium. This might allow us to visualize directly the positioning of the primordium in the apterous mutants generated in this study, adding more verisimilitude to the interpretations that place this domain based on indirect evidence.

      Vg does indeed play a critical role on the formation of the wing disc, and it is an ideal marker for the identification of the wing pouch. In the updated version of the article, we have now followed the expression of vg in some of the OR463 mutants via immunostaining of the Vg protein (Supplementary Figure 6). Cells within posterior wing outgrowths in Δm1flies were invariably positive for Vg. This result further supports our previous identification of these cells as pouch cells. In those mutants in which no cross-over between DV and AP was observed, vg expression was severely reduced or absent, indicating that the wing pouch had not been specified. We thank the reviewer for this experimental idea, which we believe strengthens the final manuscript.

      We have added to the text:

      “To identify the nature of the posterior outgrowths, we performed anti-Vestigal (Vg) antibody staining of Δm1 mutants (Supplementary Figure 6). Vg is a key regulator of wing specifications and also participates in wing growth and patterning (Baena-Lopez & García-Bellido, 2006; Kim et al., 1996; Zecca & Struhl, 2007a). In those discs, in which the stripe was extended and the P compartment was enlarged, Vg was detected throughout the outgrowth, supporting the wing pouch identity of this region (Supplementary Figure 6B). Hemizygous Δm3 mutants presented a highly reduced anti-Vg signal, which suggests that no wing pouch is specified in these mutants (Supplementary Figure 6C).”

      (2) The authors place some emphasis on the idea that their work addresses possible coordination between setting the D/V boundary and the A/P boundary:

      Abstract: "Thus, the correct establishment of ap expression pattern with respect to en must be tightly controlled", "...challenging the mechanism by which apE miss-regulation leads to AP defects." "Detailed mutational analyses using CRISPR/Cas revealed a role of apE in positioning the DV boundary with respect to the AP boundary"

      Introduction: "However, little is known about how the expression pattern of ap is set up with respect that of en. In other words, how is the DV boundary positioned with respect to the AP boundary?"

      "How such interaction between ap and the AP specification program arises is unknown."

      Results: "Some of these phenotypes are reminiscent of those reported for apBlot (Whittle, 1979) and point towards a yet undescribed crosstalk between ap early expression and the AP specification program."

      At the same time, they express the notion, with which this reviewer agrees, that all defects observed in A/P patterning arising as a result of apterous miss-regulation are due to the fact that in their mutants, apterous expression is lost mainly in the posterior dorsal compartment, bringing novel confrontations between the A/P and the D/V boundaries.

      To me, the key point is why the expression of apterous in different mutants of the OR463 enhancer affects only the posterior compartment. This should be discussed because it is far from obvious that apterous expression has different regulatory requirements in the anterior and posterior compartments.

      We agree with the reviewer that the differential effect of the mutations on the expression of ap in the A and P compartment is a key factor underlying our explanation of how the phenotypes arise. To clarify this point, we have now extended our first discussion point. Moreover, we have included some other references of differential enhancer regulation in different wing disc compartments. In addition, we have discussed whether this effect has to do with the different regulation of the enhancer in the A and P compartment or due to regulation of downstream effectors.

      Added paragraph:

      “Although apE is active throughout the dorsal compartment, its disruption leads to a preferential loss of ap expression in posterior cells. The asymmetric effect of apE perturbation on the anterior and posterior compartments suggests that apE transcriptional control is not equivalent across the A/P axis. Compartment-dependent differences in enhancer regulation have also been documented in other developmental contexts; for example, the Distal-less DMX-R element is interpreted through distinct cofactor combinations (Sloppy paired anteriorly and Engrailed posteriorly) (Gebelein et al., 2004), and specific mutations within DMX-R preferentially disrupt enhancer function in anterior versus posterior cells. It is possible that apE is more sensitive to misregulation due to differential transcriptional regulation across compartments. Nevertheless, we cannot exclude the possibility that the posterior bias we observe arises not from enhancer logic per se, but from intrinsic differences in tissue architecture or the dynamics of boundary positioning during wing disc development.”

      (3) The description of gene expression in the wing disc of novel apterous mutants is only carried out in late third instar discs (Figs. 2, 3, 5, and 7). This is understandable given the technical difficulties of dealing with early discs, as those shown in the analysis of candidate apterous regulatory transcription factors (Fig. 4F, Fig. 6 C-D). However, because the effects of the mutants on apterous expression are expected to occur much earlier than the time of expression analysis, this fact should be discussed.

      We agree with the reviewer regarding the limitations of our analysis whenever we analyzed third instar larvae to assess the expression of the OE463 enhancer. We have included a statement in which this is mentioned in the discussion:

      “It is important to acknowledge that all expression analyses were conducted in third-instar discs, a stage that follows the initial establishment of ap expression. Earlier effects are therefore inferred rather than directly observed, as imaging and staging of early discs present significant technical challenges due to their small size and fragility. A direct observation of the early wing disc across mutant conditions would likely help to clarify the role of the discovered factors during early ap expression.”

      Reviewer #2 (Public Review):

      In their manuscript, "Transcriptional control of compartmental boundary positioning during Drosophila wing development," Aguilar and colleagues do an exceptional job of exploring how tissue axes are established across Drosophila development. The authors perform a series of functional perturbations using mutational analyses at the native locus of apterous (ap), and perform tissue-specific enhancer disruption via dCas9 expression. This innovative approach allowed them to explore the spatio-temporal requirements of an apterous enhancer. Combining these techniques allowed the authors to explore the molecular basis of apterous expression, connecting the genotypes to the phenotypical effects of enhancer perturbations. To me, this paper was a beautiful example of what can be done using modern drosophila genetics to understand classic questions in developmental biology and transcriptional regulation.

      In sum, this was a rigorous paper bridging scales from the molecular to phenotypes, with new insight into how enhancers control compartmental boundary positioning during Drosophila wing development.

      We would like to thank the reviewer for its positive and encouraging comments, as well as for the careful review of the manuscript and figures. We have adapted most of the suggestions in the new manuscript.

      Reviewer #3 (Public Review):

      In this manuscript, authors use the Drosophila wing as a model system and combine state-ofthe-art genetic engineering to identify and validate the molecular players mediating the activity of one of the cis-regulatory enhancers of the apterous gene involved in the regulation of its expression domain in the dorsal compartment of the wing primordium during larval development.

      (1) The authors raise two very important questions in the Introduction: (1) who is locating the relative position of the AP and DV boundaries in the developing wing, and (2) who is responsible for the maintenance of the apterous expression domain late in larval development. None of these two questions have been responded to and, indeed, the summary of the work (as stated in the conclusions of the last paragraph of the Introduction) does not resolve any of these questions.

      We believe the results presented, together with those added during the revision, shed some on the positioning of the boundary. We proposed that the combined integration of four TFs by the OR463 enhancer is fundamental for the correct positioning. Additionally, we proposed a model on how these positioning problems result in the phenotypes observed (Supplementary figure 7, now also shown in Figure 2D). Our results indicate that ap expression in the PD quadrant is particularly sensitive to mutations in the enhancer, which we have now further elaborated on in the first part of the discussion. Together, we believe that our results do tackle the first problem posed in the introduction, while not completely solving them. As for the second question, we have tried to remove any suggestions that this article tries to explain later regulation of apterous. Probably this misunderstanding arises from a sentence in the introduction which has now been deleted. The means of the maintenance of ap expression in later stages has been partially explored previously (See Bieli et al 2015) and it is subject of our current studies.

      (2) The authors have identified two different regions whose deletions give very interesting phenotypes in the adult wing (AP identify change & outgrowths, and loss of wing), and have bioinformatically identified and functionally verified 4 TFs that mediate the activity of these regions by their capacity to phenocopy the wing phenotype. While identification of the 2 TFs acting on the m1 is incremental with respect to previous work on the identification of the enhancer responsible for the early expression of Ap, identification of Antp and Grn does not explain the loss of function phenotype of the m3 enhancer. Does any of these results shed any light on the first two Qs? Do these results explain the compartment boundary position in the wing as stated in the title? Expression of lacZ reporter assays is fundamental to demonstrate their model of Figure 8. The reduction of the PD compartment is difficult to understand by the sole reduction in ap expression in this region (which has not been demonstrated).

      We agree that the identification of Antp and Grn does not by itself explain the loss-of-function phenotype of the m3 enhancer. However, these transcription factors represent the best current candidates for direct regulators for this enhancer. We have clarified in the text that Antp and Grn may not act as instructive inputs but rather play a permissive role in enabling ap expression through m3. Importantly, the dCas9-mediated perturbation experiments directly demonstrate that targeted manipulation of apE in this region is sufficient to produce the characteristic duplications, providing functional evidence that apE activity underlies the observed phenotypes. In addition, lacZ reporter assays confirm that apE expression is indeed affected in all cases where the experimental setup permitted detection. Together, these results validate that the observed morphological phenotypes stem from perturbation of apE activity and support the proposed model for enhancer regulation and its role in compartment boundary maintenance.

      (3) The authors state in one of the sections "Spatio-temporal analysis of apE via dCas9 ". No temporal manipulation of gene activity is shown. The authors should combine GAL4/UAs with the Gal80ts to demonstrate the temporal requirements of Antp/Grn and Pnt/Hth as depicted in their model of Figure 8.

      We agree with the reviewer that the temporal dimension was not explored in the first version of the manuscript (aside of the temporal constrains of en-Gal4 driver). As suggested by the reviewer, we have now used a tub-Gal80ts allele to temporally control the enhancer perturbation and delimit its window of activity. The results are included in two new panels in the figure 3 (H and H’). The new data agrees with the notion that apE enhancer is important up to L2 stages but dispensable later in development. We have added the following paragraph to the text:

      “To define the developmental time window during which the apE enhancer remains sensitive to repression, we combined the temperature-sensitive tub-Gal80<sup>ts</sup> system with temporally controlled expression of dCas9. Animals carrying the en-Gal4, tub-Gal80<sup>ts</sup>, UAS-dCas9 and U6-OR463gRNA(4x) transgenes were maintained at 18 °C to suppress dCas9 expression. Independent sets of embryos were then shifted to 29 °C at successive developmental intervals ranging from 0 to 168 h after egg laying (AEL), so that dCas9 induction occurred at distinct time points in development (Figure 3H). Under these conditions, dCas9 transcription was induced only after the temperature shift, while the gRNAs were expressed constitutively. Wing phenotypes were quantified in adult progeny as a readout of apE enhancer perturbation. When dCas9 was expressed from embryonic or early larval stages (0–48 h AEL), nearly all wings (70–90%) displayed severe ap-like phenotypes, including posterior compartment duplication and loss of anterior–posterior boundary integrity. Shifting animals later (48–72 h AEL) still produced a majority (~66%) of abnormal wings, whereas induction after 72 h AEL resulted in progressively weaker effects and complete loss of phenotypes by 96 h AEL (Figure 3H’).

      These results delineate the developmental period during which apE activity is required for proper wing patterning. Perturbation during the first half of the second larval instar (≤ 96 h at 18 °C) was sufficient to elicit strong ap-like transformations, consistent with the enhancer being functionally required during early larval stages and becoming dispensable thereafter. The temporal decline in phenotype penetrance thus reflects the progressive loss of apE sensitivity to dCas9-mediated repression, providing a precise estimate of when its activity is no longer required for wing morphogenesis.”

      (4) The authors have not managed to explain the AP phenotype. Thus, this work opens many unresolved questions and does not resolve the title, which is a big overstatement. Thus, strengths (technically excellent), weakness (there is not much to learn about wing development and apterous regulation from these results besides the incremental identification of 4 additional TFs mediating the regulation of ap expression by their ability to phenocopy regulatory mutations of the apterous gene).

      As mentioned in response to reviewer 1, we have indeed no concrete explanation  for why the P compartment seems more sensitive to mutations. We have now further discussed this point (see below paragraph, now included in  the discussion). As for how the adult phenotypes arise from the mutant wing discs, we have a good idea (see Supplementary figure 7 and Figure 2). 

      We are pleased to hear that the reviewer considers our article technically valuable. Therefore, we have reformulated the title such as the technical merits play a bigger role in it:

      ”in situ mutational screening and CRISPR interference demonstrate that the apterous Early enhancer is required for developmental boundary positioning”

      Paragraph added to the discussion:

      " Although apE is active throughout the dorsal compartment, its disruption leads to a preferential loss of ap expression in posterior cells. The asymmetric effect of apE perturbation on the anterior and posterior compartments suggests that apE transcriptional control is not equivalent across the A/P axis. Compartment-dependent differences in enhancer regulation have also been documented in other developmental contexts; for example, the Distal-less DMX-R element is interpreted through distinct cofactor combinations (Sloppy paired anteriorly and Engrailed posteriorly) (Gebelein et al., 2004), and specific mutations within DMX-R preferentially disrupt enhancer function in anterior versus posterior cells. It is possible that apE is more sensitive to misregulation due to differential transcriptional regulation across compartments. Nevertheless, we cannot exclude the possibility that the posterior bias we observe arises not from enhancer logic per se, but from intrinsic differences in tissue architecture or the dynamics of boundary positioning during wing disc development.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      Formatting of references should be checked throughout the manuscript

      Reviewer #2 (Recommendations For The Authors):

      Here, I note a few points that would help clarify the manuscript and connect it with a broader community.

      Figure 1: it could help the reader to add the landing site genetic scheme to the main figure.

      In a first draft that was exactly the original configuration, but after comparing both versions we determined that the presence of the landing site removes a bit of the focus of the phenotypes.

      Figure 1: what species were used for the conservation alignment? Further details would be nice to add here.

      We have now added a section of bioinformatical analysis, which was missing in the original manuscript:

      Sequence conservation of the OR463 fragment within the ap upstream intergenic region was analysed across different dipteran species using the “Cons 124 Insects” multiple-alignment track of the D. melanogaster dm6 genome on the UCSC Genome Browser (Kent et al., 2002, https://genome.ucsc.edu). Conservation scores were obtained from the phastCons (Siepel et al., 2005) and used to delineate conserved and less conserved blocks within OR463. Conserved transcription factor binding sites were predicted with MotEvo (Arnold et al., 2011), which defined four conserved modules (m1–m4) and six inter-modules (N1–N6). Additional motif analysis was performed using the JASPAR CORE Insecta database and the Target Explorer tool to cross-validate conserved binding-site predictions and refine motif assignments within the enhancer.

      From Figure 2: I would consider moving the model or portions of it to a main figure. These models, while descriptive, really help make the manuscript more approachable. Note that eLife does not have forced figure requirements.

      We have adapted the reviewer’s suggestion and we are very grateful for it. We think the figure has greatly improved. The final figure now highlights a small part of the model, which is still included in the Supplementary Figure.

      Figure 5: This figure is fantastic, and the results are particularly important. I would recommend increasing the weight of the arrows from D to E, making it more obvious. Did the authors consider any temperature or other perturbations to look at robustness? They mention "robustness" a few times, and this could be an excellent system to explore a bit further. For panels F and G, it would be nice to have a bit of biochemistry here to test the spacing requirements' effects on the distances (but it's great phenotypical data, regardless).

      We have chosen a darker grey to highlight the lines. 

      We appreciate the reviewer’s suggestions. With respect to robustness assays, such as temperature perturbations, we agree that the apE enhancer would be a suitable system for such experiments. However, these analyses would move the study beyond its current scope, which is focused on defining the regulatory logic of boundary positioning through mutational dissection and CRISPRi. We therefore prefer not to expand the work in this direction here, but we note that this would be an interesting avenue for future investigation.

      Similarly, biochemical assays probing spacing requirements would provide additional mechanistic insight but would represent a separate line of work. In this manuscript, we aimed to establish the functional consequences of motif spacing using in vivo genetic and phenotypic analyses, which we believe sufficiently support our conclusions.

      Thank you for the insight.

      Discussion: To the point "most point mutations or short deletions in enhancer regions have little effect on gene expression" I would push the authors to discuss their work in relation to Fuqua et al., (Nature 2020) and Kvon et al., (Cell 2020). Their work is consistent with enhancers being sensitive to mutations, and this warrants further discussion because it could be important for the transcription field.

      Hox genes as pioneer factors, I would recommend citing Loker et al., (Curr Biol 2021), as an example of Hox genes functioning as a pioneer factor.

      We thank the reviewer for this suggestion. We have now added a short paragraph in the Discussion noting how our observations may relate to the mutational patterns described in Fuqua et al. (2020) and Kvon et al. (2020), while keeping the interpretation tentative. The text now says:

      “Recent large-scale enhancer mutagenesis studies have shown that the mutational consequences within enhancers can vary widely. In some cases, many nucleotide positions appear tolerant to single-base changes and only a small subset of mutations produce clear functional effects (Kvon et al., 2020). In other enhancers, regulatory information is distributed more densely, and mutations at multiple positions can alter output (Fuqua et al., 2020). Together, these studies illustrate that enhancer sensitivity is not uniform but depends on enhancer-specific features such as motif organization, cooperativity, and redundancy. Within this broader landscape, the apE enhancer appears to represent a particularly sensitive case.”

      We also included a citation to Loker et al. (2021) in connection with the possible pioneer-like contribution of HOX input to apE.

      We would like to thank all reviewers for their effort.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      I read the paper by Parrotta et al with great interest. The authors are asking an interesting and important question regarding pain perception, which is derived from predictive processing accounts of brain function. They ask: If the brain indeed integrates information coming from within the body (interoceptive information) to comprise predictions about the expected incoming input and how to respond to it, could we provide false interoceptive information to modulate its predictions, and subsequently alter the perception of such input? To test this question, they use pain as the input and the sounds of heartbeats (falsified or accurate) as the interoceptive signal.

      Strengths:

      I found the question well-established, interesting, and important, with important implications and contributions for several fields, including neuroscience of prediction-perception, pain research, placebo research, and health psychology. The paper is well-written, the methods are adequate, and the findings largely support the hypothesis of the authors. The authors carried out a control experiment to rule out an alternative explanation of their finding, which was important.

      Weaknesses:

      I will list here one theoretical weakness or concern I had, and several methodological weaknesses.

      The theoretical concern regards what I see as a misalignment between a hypothesis and a result, which could influence our understanding of the manipulation of heartbeats, and its meaning: The authors indicate from prior literature and find in their own findings, that when preparing for an aversive incoming stimulus, heartbeats *decrease*. However, in their findings, manipulating the heartbeats that participants hear to be slower than their own prior to receiving a painful stimulus had *no effect* on participants' actual heartbeats, nor on their pain perceptions. What authors did find is that when listening to heartbeats that are *increased* in frequency - that was when their own heartbeats decreased (meaning they expected an aversive stimulus) and their pain perceptions increased.

      This is quite complex - but here is my concern: If the assumption is that the brain is collecting evidence from both outside and inside the body to prepare for an upcoming stimulus, and we know that *slowing down* of heartbeats predicts an aversive stimulus, why is it that participants responded in a change in pain perception and physiological response when listened to *increased heartbeats* and not decreased? My interpretation is that the manipulation did not fool the interoceptive signals that the brain collects, but rather the more conscious experience of participants, which may then have been translated to fear/preparation for the incoming stimulus. As the authors indicate in the discussion (lines 704-705), participants do not *know* that decreased heartbeats indicate upcoming aversive stimulus, and I would even argue the opposite - the common knowledge or intuitive response is to increase alertness when we hear increased heartbeats, like in horror films or similar scenarios. Therefore, the unfortunate conclusion is that what the authors assume is a manipulation of interoception - to me seems like a manipulation of participants' alertness or conscious experience of possible danger. I hope the (important) distinction between the two is clear enough because I find this issue of utmost importance for the point the paper is trying to make. If to summarize in one sentence - if it is decreased heartbeats that lead the brain to predict an approaching aversive input, and we assume the manipulation is altering the brain's interoceptive data collection, why isn't it responding to the decreased signal? --> My conclusion is, that this is not in fact a manipulation of interoception, unfortunately

      We thank the reviewer for their comment, which gives us the opportunity to clarify what we believe is a theoretical misunderstanding that we have not sufficiently made clear in the previous version of the manuscript. The reviewer suggests that a decreased heart rate itself might act as an internal cue for a forthcoming aversive stimulus, and questions why our manipulation of slower heartbeats then did not produce measurable effects.

      The central point is this: decreased heart rate is not a signal the brain uses to predict a threat, but is a consequence of the brain having already predicted the threat. This distinction is crucial. The well-known anticipatory decrease of heartrate serves an allostatic function: preparing the body in advance so that physiological responses to the actual stressor (such as an increase in sympathetic activation) do not overshoot. In other words, the deceleration is an output of the predictive model, not an input from which predictions are inferred. It would be maladaptive for the brain to predict threat through a decrease in heartrate, as this would then call for a further decrease, creating a potential runaway cycle.

      Instead, increased heart rate is a salient and evolutionarily conserved cue for arousal, threat, and pain. This association is reinforced both culturally - for example, through the use of accelerating heartbeats in films and media to signal urgency, as R1 mentions - and physiologically, as elevated heart rates reliably occur in response to actual (not anticipated) stressors. Decreased heartrates, in contrast, are reliably associated with the absence of stressors, for example during relaxation and before (and during) sleep. Thus, across various everyday experiences, increased (instead of decreased) heartrates are robustly associated with actual stressors, and there is no a priori reason to assume that the brain would treat decelerating heartrates as cue for threat. As we argued in previous work, “the relationship between the increase in cardiac activity and the anticipation of a threat may have emerged from participants’ first-hand experience of increased heart rates to actual, not anticipated, pain” (Parrotta et al., 2024). The changes in heart rate and pain perception that we hypothesize (and observe) are therefore fully in line with the prior literature on the anticipatory compensatory heartrate response (Bradley et al., 2008, 2005; Colloca et al., 2006; Lykken et al., 1972; Taggart et al., 1976; Tracy et al., 2017; Skora et al., 2022), as well as with Embodied Predictive Coding models (Barrett & Simmons, 2015; Pezzulo, 2014; Seth, 2013; Seth et al., 2012), which assume that our body is regulated through embodied simulations that anticipate likely bodily responses to upcoming events, thereby enabling anticipatory or allostatic regulation of physiological states (Barrett, 2017).

      We now add further explanation to this point to the Discussion (lines 740-758) and Introduction (lines 145-148; 154-156) of our manuscript to make this important point clearer.

      Barrett, L. F., & Simmons, W. K. (2015). Interoceptive predictions in the brain. Nature reviews neuroscience, 16(7), 419-429.

      Barrett, L. F. (2017). The theory of constructed emotion: An active inference account of interoception and categorization. Social cognitive and affective neuroscience, 12(1), 1-23.

      Bradley, M. M., Moulder, B., & Lang, P. J. (2005). When good things go bad: The reflex physiology of defense. Psychological science, 16(6), 468-473.

      Bradley, M. M., Silakowski, T., & Lang, P. J. (2008). Fear of pain and defensive activation. PAIN®, 137(1), 156-163.

      Colloca, L., Petrovic, P., Wager, T. D., Ingvar, M., & Benedetti, F. (2010). How the number of learning trials affects placebo and nocebo responses. Pain®, 151(2), 430-439.

      Lykken, D., Macindoe, I., & Tellegen, A. (1972). Preception: Autonomic response to shock as a function of predictability in time and locus. Psychophysiology, 9(3), 318-333.

      Taggart, P., Hedworth-Whitty, R., Carruthers, M., & Gordon, P. D. (1976). Observations on electrocardiogram and plasma catecholamines during dental procedures: The forgotten vagus. British Medical Journal, 2(6039), 787-789.

      Tracy, L. M., Gibson, S. J., Georgiou-Karistianis, N., & Giummarra, M. J. (2017). Effects of explicit cueing and ambiguity on the anticipation and experience of a painful thermal stimulus. PloS One, 12(8), e0183650.

      Parrotta, E., Bach, P., Perrucci, M. G., Costantini, M., & Ferri, F. (2024). Heart is deceitful above all things: Threat expectancy induces the illusory perception of increased heartrate. Cognition, 245, 105719.

      Pezzulo, G. (2014). Why do you fear the bogeyman? An embodied predictive coding model of perceptual inference. Cognitive, Affective & Behavioral Neuroscience, 14(3), 902-911.

      Seth, A., Suzuki, K., & Critchley, H. (2012). An Interoceptive Predictive Coding Model of Conscious Presence. Frontiers in Psychology, 2. https://www.frontiersin.org/articles/10.3389/fpsyg.2011.00395

      Seth, A. K. (2013). Interoceptive inference, emotion, and the embodied self. Trends in Cognitive Sciences, 17(11), 565-573.

      Skora, L. I., Livermore, J. J. A., & Roelofs, K. (2022). The functional role of cardiac activity in perception and action. Neuroscience & Biobehavioral Reviews, 104655.

      I will add that the control experiment - with an exteroceptive signal (knocking of wood) manipulated in a similar manner - could be seen as evidence of the fact that heartbeats are regarded as an interoceptive signal, and it is an important control experiment, however, to me it seems that what it is showing is the importance of human-relevant signals to pain prediction/perception, and not directly proves that it is considered interoceptive. For example, it could be experienced as a social cue of human anxiety/fear etc, and induce alertness.

      The reviewer asks us to consider whether our measured changes in pain response happen not because the brain treats the heartrate feedback in Experiment 1 as interoceptive stimulus, but because heartbeat sounds could have signalled threat on a more abstract, perhaps metacognitive or affective, level, in contrast to the less visceral control sounds in Experiment 2. We deem this highly unlikely for several reasons.

      First, as we point out in our response to Reviewer 3 (Point 3), if this were the case, the different sounds in both experiments should have induced overall (between-experiment) differences in pain perception and heart rate, induced by the (supposedly) generally more threatening heart beat sounds. However, when we added such comparisons, no such between-experiment differences were obtained (See Results Experiment 2, and Supplementary Materials, Cross-experiment analysis between-subjects model). Instead, we only find a significant interaction between experiment and feedback (faster, slower). Thus, it is not the heartbeat sounds per se that induce the measured changes to pain perception, but the modulation of their rate, and that identical changes to the rate of non-heartrate sounds produce no such effects. In other words, pain perception is sensitive to a change in heart rate feedback, as we predicted, instead of the overall presence of heartbeat sounds (as one would need to predict if heart beat sounds had more generally induced threat or stress).

      Second, one may suspect that it is precisely the acceleration of heartrate feedback that could act as cue to arousal, while accelerated exteroceptive feedback would not. However, if this were the case, one would need to predict a general heart rate increase with accelerated feedback, as this is the general physiological marker of increasing alertness and arousal (e.g. Tousignant-Laflamme et al., 2005; Terkelsen et al., 2005; for a review, see Forte et al., 2022). However, the data shows the opposite, with real heartrates decreasing when the heartrate feedback increases. This result is again fully in line with the predicted interoceptive consequences of accelerated heartrate feedback, which mandates an immediate autonomic regulation, especially when preparing for an anticipated stressor.

      Third, our view is further supported by neurophysiological evidence showing that heartbeat sounds, particularly under the belief they reflect one’s own body, are not processed merely as generic aversive or “human-relevant” signals. For instance, Vicentin et al. (2024) showed that simulated faster heartbeat sounds elicited stronger EEG alpha-band suppression, indicative of increased cortical activation  over frontocentral and right frontal areas, compatible with the localization of brain regions contributing to interoceptive processes (Kleint et al., 2015). Importantly, Kleint et al. also demonstrated via fMRI that heartbeat sounds, compared to acoustically matched tones, selectively activate bilateral anterior insula and frontal operculum, key hubs of the interoceptive network. This suggests that the semantic identity of the sound as a heartbeat is sufficient to elicit internal body representations, despite its exteroceptive nature. Further evidence comes from van Elk et al. (2014), who found that heartbeat sounds suppress the auditory N1 component, a neural marker of sensory attenuation typically associated with self-generated or predicted stimuli. The authors interpret this as evidence that the brain treats heartbeat sounds as internally predicted bodily signals, supporting interoceptive predictive coding accounts in which exteroceptive cues (i.e., auditory cardiac feedback) are integrated with visceral information to generate coherent internal body representations.

      Finally, it is worth noting that the manipulation of heartrate feedback in our study elicited measurable compensatory changes in participants’ actual heart rate. This is striking compared to our previous work (Parrotta et al., 2024), wherein we used a highly similar design as here, combined with a very strong threat manipulation. Specifically, we presented participants with highly salient threat cues (knives directed at an anatomical depiction of a heart), which predicted forthcoming pain with 100% validity (compared to flowers that did predict the absence of pain with 100%). In other words, these cues perfectly predicted actual pain, through highly visceral stimuli. Nevertheless, we found no measurable decrease in actual heartrate. From an abstract threat perspective, it is therefore striking that the much weaker manipulation of slightly increased or decreased heartrates we used here would induce such a change. The difference therefore suggests that what caused the response here is not due to an abstract feeling of threat, but because the brain indeed treated the increased heartrate feedback as an interoceptive signal for (stressor-induced) sympathetic activation, which would then be immediately down-regulated.

      Together, we hope you agree that these considerations make a strong case against a non-specific, arousal or alertness-related explanation of our data. We now make this point clearer in the new paragraph of the Discussion (Accounting for general unspecific contributionslines 796-830), and have added the relevant between experiment comparisons to the Results of Experiment 2.

      Forte, G., Troisi, G., Pazzaglia, M., Pascalis, V. D., & Casagrande, M. (2022). Heart rate variability and pain: a systematic review. Brain sciences, 12(2), 153.

      Vicentin, S., Guglielmi, S., Stramucci, G., Bisiacchi, P., & Cainelli, E. (2024). Listen to the beat: behavioral and neurophysiological correlates of slow and fast heartbeat sounds. International Journal of Psychophysiology, 206, 112447.

      Kleint, N. I., Wittchen, H. U., & Lueken, U. (2015). Probing the interoceptive network by listening to heartbeats: an fMRI study. PloS one, 10(7), e0133164.

      Parrotta, E., Bach, P., Perrucci, M. G., Costantini, M., & Ferri, F. (2024). Heart is deceitful above all things: Threat expectancy induces the illusory perception of increased heartrate. Cognition, 245, 105719.

      Terkelsen, A. J., Mølgaard, H., Hansen, J., Andersen, O. K., & Jensen, T. S. (2005). Acute pain increases heart rate: differential mechanisms during rest and mental stress. Autonomic Neuroscience, 121(1-2), 101-109.

      Tousignant-Laflamme, Y., Rainville, P., & Marchand, S. (2005). Establishing a link between heart rate and pain in healthy subjects: a gender effect. The journal of pain, 6(6), 341-347.

      van Elk, M., Lenggenhager, B., Heydrich, L., & Blanke, O. (2014). Suppression of the auditory N1-component for heartbeat-related sounds reflects interoceptive predictive coding. Biological psychology, 99, 172-182.

      Several additional, more methodological weaknesses include the very small number of trials per condition - the methods mention 18 test trials per participant for the 3 conditions, with varying pain intensities, which are later averaged (and whether this is appropriate is a different issue). This means 6 trials per condition, and only 2 trials per condition and pain intensity. I thought that this number could be increased, though it is not a huge concern of the paper. It is, however, needed to show some statistics about the distribution of responses, given the very small trial number (see recommendations for authors). The sample size is also rather small, on the verge of "just right" to meet the required sample size according to the authors' calculations.

      We provide detailed responses to these points in the “Recommendations for The Authors” section, where each of these issues is addressed point by point in response to the specific questions raised.

      Finally, and just as important, the data exists to analyze participants' physiological responses (ECG) after receiving the painful stimulus - this could support the authors' claims about the change in both subjective and objective responses to pain. It could also strengthen the physiological evidence, which is rather weak in terms of its effect. Nevertheless, this is missing from the paper.

      This is indeed an interesting point, and we agree that analyzing physiological responses such as ECG following the painful stimulus could offer additional insights into the objective correlates of pain. However, it is important to clarify that the experiment was not designed to investigate post-stimulus physiological responses. Our primary focus was on the anticipatory processes leading up to the pain event. Notably, in the time window immediately following the stimulus - when one might typically expect to observe physiological changes such as an increase in heart rate - participants were asked to provide subjective ratings of their nociceptive experience. It is therefore not a “clean” interval that would lend itself for measurement, especially as a substantial body of evidence indicates that one’s heart rate is strongly modulated by higher-order cognitive processes, including attentional control, executive functioning, decision-making and action itself (e.g., Forte et al., 2021a; Forte et al., 2021b; Luque-Casado et al., 2016).

      This limitation is particularly important as the induced change in pain ratings by our heart rate manipulation is substantially smaller than the changes in heart rate induced by actual pain (e.g., Loggia et al., 2011). To confirm this for our study, we simply estimated how much change in heart rate is produced by a change in actual stimulus intensity in the initial no feedback phase of our experiment. There, we find that a change between stimulus intensities 2 and 4 induces a NPS change of 32.95 and a heart rate acceleration response of 1.19 (difference in heart rate response relative to baseline, Colloca et al., 2006), d = .52, p < .001. The change of NPS induced by our implicit heart rate manipulation, however, is only a seventh of this (4.81 on the NPS). This means that the expected effect size of heart rate acceleration produced by our manipulation would only be d = .17. A power analysis, using GPower, reveals that a sample size of n = 266 would be required to detect such an effect, if it exists. Thus, while we agree that this is an exciting hypothesis to be tested, it requires a specifically designed study, and a much larger sample than was possible here.

      Colloca, L., Benedetti, F., & Pollo, A. (2006). Repeatability of autonomic responses to pain anticipation and pain stimulation. European Journal of Pain, 10(7), 659-665.

      Forte, G., Morelli, M., & Casagrande, M. (2021a). Heart rate variability and decision-making: Autonomic responses in making decisions. Brain sciences, 11(2), 243.

      Forte, G., Favieri, F., Oliha, E. O., Marotta, A., & Casagrande, M. (2021b). Anxiety and attentional processes: the role of resting heart rate variability. Brain sciences, 11(4), 480.

      Loggia, M. L., Juneau, M., & Bushnell, M. C. (2011). Autonomic responses to heat pain: Heart rate, skin conductance, and their relation to verbal ratings and stimulus intensity. PAIN®, 152(3), 592-598.

      Luque-Casado, A., Perales, J. C., Cárdenas, D., & Sanabria, D. (2016). Heart rate variability and cognitive processing: The autonomic response to task demands. Biological psychology, 113, 83-90

      I have several additional recommendations regarding data analysis (using an ANOVA rather than multiple t-tests, using raw normalized data rather than change scores, questioning the averaging across 3 pain intensities) - which I will detail in the "recommendations for authors" section.

      We provide detailed responses to these points in the “Recommendations for The Authors” section, where each of these issues is addressed point by point in response to the specific questions raised.

      Conclusion:

      To conclude, the authors have shown in their findings that predictions about an upcoming aversive (pain) stimulus - and its subsequent subjective perception - can be altered not only by external expectations, or manipulating the pain cue, as was done in studies so far, but also by manipulating a cue that has fundamental importance to human physiological status, namely heartbeats. Whether this is a manipulation of actual interoception as sensed by the brain is - in my view - left to be proven.

      Still, the paper has important implications in several fields of science ranging from neuroscience prediction-perception research, to pain and placebo research, and may have implications for clinical disorders, as the authors propose. Furthermore, it may lead - either the authors or someone else - to further test this interesting question of manipulation of interoception in a different or more controlled manner.

      I salute the authors for coming up with this interesting question and encourage them to continue and explore ways to study it and related follow-up questions.

      We sincerely thank the reviewer for the thoughtful and encouraging feedback. We hope our responses to your points below convince you a bit more that what we are measuring does indeed capture interoceptive processes, but we of course fully acknowledge that additional measures - for example from brain imaging (or computational modelling, see Reviewer 3) - could further support our interpretation, and highlights in the Limitations and Future directions section.

      Reviewer #2 (Public Review):

      In this manuscript, Parrotta et al. tested whether it is possible to modulate pain perception and heart rate by providing false HR acoustic feedback before administering electrical cutaneous shocks. To this end, they performed two experiments. The first experiment tested whether false HR acoustic feedback alters pain perception and the cardiac anticipatory response. The second experiment tested whether the same perceptual and physiological changes are observed when participants are exposed to a non-interoceptive feedback. The main results of the first experiment showed a modulatory effect for faster HR acoustic feedback on pain intensity, unpleasantness, and cardiac anticipatory response compared to a control (acoustic feedback congruent to the participant's actual HR). However, the results of the second experiment also showed an increase in pain ratings for the faster non-interoceptive acoustic feedback compared to the control condition, with no differences in pain unpleasantness or cardiac response.

      The main strengths of the manuscript are the clarity with which it was written, and its solid theoretical and conceptual framework. The researchers make an in-depth review of predictive processing models to account for the complex experience of pain, and how these models are updated by perceptual and active inference. They follow with an account of how pain expectations modulate physiological responses and draw attention to the fact that most previous studies focus on exteroceptive cues. At this point, they make the link between pain experience and heart rate changes, and introduce their own previous work showing that people may illusorily perceive a higher cardiac frequency when expecting painful stimulation, even though anticipating pain typically goes along with a decrease in HR. From here, they hypothesize that false HR acoustic feedback evokes more intense and unpleasant pain perception, although the actual HR actually decreases due to the orienting cardiac response. Furthermore, they also test the hypothesis that an exteroceptive cue will lead to no (or less) changes in those variables. The discussion of their results is also well-rooted in the existing bibliography, and for the most part, provides a credible account of the findings.

      Thank you for the clear and thoughtful review. We appreciate your positive comments on the manuscript’s clarity, theoretical framework, and interpretation of results.

      The main weaknesses of the manuscript lies in a few choices in methodology and data analysis that hinder the interpretation of the results and the conclusions as they stand.

      The first peculiar choice is the convoluted definition of the outcomes. Specifically, pain intensity and unpleasantness are first normalized and then transformed into variation rates (sic) or deltas, which makes the interpretation of the results unnecessarily complicated. This is also linked to the definitions of the smallest effect of interest (SESOI) in terms of these outcomes, which is crucial to determining the sample size and gauging the differences between conditions. However, the choice of SESOI is not properly justified, and strangely, it changes from the first experiment to the second.

      We thank the reviewer for this important observation. In the revised manuscript, we have made substantial changes and clarifications to address both aspects of this concern: (1) the definition of outcome variables and their normalization, and (2) the definition of the SESOI.

      First, As explained in our response to Reviewer #1, we have revised the analyses and removed the difference-based change scores from the main results, addressing concerns about interpretability. However, we retained the normalization procedure: all variables (heart rate, pain intensity, unpleasantness) are normalized relative to the no-feedback baseline using a standard proportional change formula (X−bX)/bX(X - bX)/bX(X−bX)/bX, where X is the feedback-phase mean and bX is the no-feedback baseline. This is a widely used normalization procedure (e.g., Bartolo et al., 2013; Cecchini et al., 2020). This method controls for interindividual variability by expressing responses relative to each participant’s own baseline. The resulting normalized values are then used directly in all analyses, and not further transformed into deltas.

      To address potential concerns about this baseline correction approach and its interpretability, we also conducted a new set of supplementary analyses (now reported in the supplementary materials) that include the no-feedback condition explicitly in the models, rather than treating it as a baseline for normalization. These models confirm that our main effects are not driven by the choice of normalization and hold even when no-feedback is analyzed as an independent condition. The new analyses and results are now reported in the Supplementary Materials.

      Second, concerning the SESOI values and their justification: The difference in SESOI values between Experiment 1 and Experiment 2 reflects the outcome of sensitivity analyses conducted for each dataset separately, rather than a post-hoc reinterpretation of our results. Specifically, we followed current methodological recommendations (Anderson, Kelley & Maxwell, 2017; Albers & Lakens, 2017; Lakens, 2022), which advise against estimating statistical power based on previously published effect sizes, especially when working with novel paradigms or when effect sizes in the literature may be inflated or imprecise. Instead, we used the sensitivity analysis function in G*Power (Version 3.1) to determine the smallest effect size our design was capable of detecting with high statistical power (90%), given the actual sample size, test type, and alpha level used in each experiment. This is a prospective, design-based estimation rather than a post-hoc analysis of observed effects. The slight differences in SESOI are due to more participants falling below our exclusions criteria in Experiment 2, leading to slightly larger effect sizes that can be detected (d = 0.62 vs d = 0.57). Importantly, both experiments remain adequately powered to detect effects of a size commonly reported in the literature on top-down pain modulation. For instance, Iodice et al. (2019) reported effects of approximately d = 0.7, which is well above the minimum detectable thresholds of our designs.

      We have now clarified the logic in the Participant section of Experiment 1 (193-218).

      Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychological Science, 28(11), 1547-1562.

      Bartolo, M., Serrao, M., Gamgebeli, Z., Alpaidze, M., Perrotta, A., Padua, L., Pierelli, F., Nappi, G., & Sandrini, G. (2013). Modulation of the human nociceptive flexion reflex by pleasant and unpleasant odors. PAIN®, 154(10), 2054-2059.

      Cecchini, M. P., Riello, M., Sandri, A., Zanini, A., Fiorio, M., & Tinazzi, M. (2020). Smell and taste dissociations in the modulation of tonic pain perception induced by a capsaicin cream application. European Journal of Pain, 24(10), 1946-1955.

      Lakens, D. (2022). Sample size justification. Collabra: psychology, 8(1), 33267.

      Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of experimental social psychology, 74, 187-195.

      Furthermore, the researchers propose the comparison of faster vs. slower delta HR acoustic feedback throughout the manuscript when the natural comparison is the incongruent vs. the congruent feedback.

      We very much disagree that the natural comparison is congruent vs incongruent feedback. First, please note that congruency simply refers to whether the heartrate feedback was congruent with (i.e., matched) the participant’s heartrate measurements in the no feedback trials, or whether it was incongruent, and was therefore either faster or slower than this baseline frequency. As such, simply comparing congruent with incongruent feedback could only indicate that pain ratings change when the feedback does not match the real heart rate, irrespective of whether it is faster or slower. Such a test can therefore only reveal potential general effects of surprise or salience, when the feedback heartrate does not match the real one.

      We therefore assume that the reviewer specifically refers to the comparison of congruent vs incongruent faster feedback. However, this is not a good test either, as this comparison is, by necessity, confounded with the factor of surprise described above. In other words, if a difference would be found, it would not be clear if it emerges because, as we assume, that faster feedback is represented as an interoceptive signal for threat, or simply because participants are surprised about heartrate feedback that diverges from their real heartrate. Note that even a non-significant result in the analogous comparison of congruent vs incongruent slower feedback would not be able to resolve this confound, as in null hypothesis testing the absence of a significant effect does, per definition, not indicate that there is no effect - only that it could not be detected here.

      Instead, the only possible test of our hypothesis is the one we have designed our experiment around and focussed on with our central t-test: the comparison of incongruent faster with incongruent slower feedback. This keeps any possible effects of surprise/salience from generally altered feedback constant and allows us to test our specific hypothesis: that real heart rates will decrease and pain ratings will increase when receiving false interoceptive feedback about increased compared to decreasing heartrates. Note that this test of faster vs slower feedback is also statistically the most appropriate, as it collapses our prediction onto a single and highest-powered hypothesis test: As faster and slower heartrate feedback are assumed to induce effects in the opposite direction, the effect size of their difference is, per definition, double than the averaged effect size for the two separate tests of faster vs congruent feedback and slower vs congruent feedback.

      That being said, we also included comparisons with the congruent condition in our revised analysis, in line with the reviewer’s suggestion and previous studies. These analyses help explore potential asymmetries in the effect of false feedback. While faster feedback (both interoceptive and exteroceptive) significantly modulated pain relative to congruent feedback, the slower feedback did not, consistent with previous literature showing stronger effects for arousal-increasing cues (e.g., Valins, 1966; Iodice et al., 2019). To address this point, in the revised manuscript we have added a paragraph to the Data Analysis section of Experiment 1 (lines 405-437) to make this logic clearer.

      Valins, S. (1966). Cognitive effects of false heart-rate feedback. Journal of personality and social psychology, 4(4), 400.

      Iodice, P., Porciello, G., Bufalari, I., Barca, L., & Pezzulo, G. (2019). An interoceptive illusion of effort induced by false heart-rate feedback. Proceedings of the National Academy of Sciences, 116(28), 13897-13902.

      This could be influenced by the fact that the faster HR exteroceptive cue in experiment 2 also shows a significant modulatory effect on pain intensity compared to congruent HR feedback, which puts into question the hypothesized differences between interoceptive vs. exteroceptive cues. These results could also be influenced by the specific choice of exteroceptive cue: the researchers imply that the main driver of the effect is the nature of the cue (interoceptive vs. exteroceptive) and not its frequency. However, they attempt to generalize their findings using knocking wood sounds to all possible sounds, but it is possible that some features of these sounds (e.g., auditory roughness or loomingness) could be the drivers behind the observed effects.

      We appreciate this thoughtful comment. We agree that low-level auditory features can potentially introduce confounds in the experimental design, and we acknowledge the importance of distinguishing these factors from the higher-order distinction that is central to our study: whether the sound is perceived as interoceptive (originating from within the body) or exteroceptive (perceived as external). To this end, the knocking sound was chosen not for its specific acoustic profile, but because it lacked bodily relevance, thus allowing us to test whether the same temporal manipulations (faster, congruent, slower) would have different effects depending on whether the cue was interpreted as reflecting an internal bodily state or not. In this context, the exteroceptive cue served as a conceptual contrast rather than an exhaustive control for all auditory dimensions.

      Several aspects of our data make it unlikely that the observed effects are driven by unspecific acoustic characteristics of the sounds used in the exteroceptive and interoceptive experiments (see also our responses to Reviewer 1 and Reviewer 3 who raised similar points).

      First, if the knocking sound had inherent acoustic features that strongly influenced perception or physiological responses, we would expect it to have produced consistent effects across all feedback conditions (Faster, Slower, Congruent), regardless of the interpretive context. This would have manifested as an overall difference between experiments in the between-subjects analyses and in the supplementary mixed-effects models that included Experiment as a fixed factor. Yet, we observed no such main effects in any of our variables. Instead, significant differences emerged only in specific theoretically predicted comparisons (e.g., Faster vs. Slower), and critically, these effects depended on the cue type (interoceptive vs. exteroceptive), suggesting that perceived bodily relevance, rather than a specific acoustic property, was the critical modulator. In other words, any alternative explanation based on acoustic features would need to be able to explain why these acoustic properties would induce not an overall change in heart rate and pain perception (i.e., similarly across slower, faster, and congruent feedback), but the brain’s response to changes in the rate of this feedback – increasing pain ratings and decreasing heartrates for faster relative to slower feedback. We hope you agree that a simple effect of acoustic features would not predict such a sensitivity to the rate with which the sound was played.

      Please refer to our responses to Reviewers 1 and 2 for further aspects of the data, arguing strongly against other features associated with the sounds (e.g., alertness, arousal) could be responsible for the results, as the data pattern again goes in the opposite direction than that predicted by such accounts (e.g., faster heartrate feedback decreased real heartrate, instead of increasing them, as would be expected if accelerated heartrate feedback increased arousal).

      Finally, to further support this interpretation, we refer to neurophysiological evidence showing that heartbeat sounds are not processed as generic auditory signals, but as internal, bodily relevant cues especially when believed to reflect one’s own physiological state. For instance, fMRI research (Kleint et al., 2015) shows that heartbeat sounds engage key interoceptive regions such as the anterior insula and frontal operculum more than acoustically matched control tones. EEG data (Vicentin et al., 2024) showed that faster heartbeat sounds produce stronger alpha suppression over frontocentral areas, suggesting enhanced processing in networks associated with interoceptive attention. Moreover, van Elk et al. (2014) found that heartbeat sounds attenuate the auditory N1 response, a neural signature typically linked to self-generated or predicted bodily signals. These findings consistently demonstrate that heartbeats sounds are processed as interoceptive and self-generated signals, which is in line with our rationale that the critical factor at play concern whether it is semantically perceived as reflecting one’s own bodily state, rather than the physical properties of the sound.

      We now explicitly discuss these issues in the revised Discussion section (lines 740-758).

      Kleint, N. I., Wittchen, H. U., & Lueken, U. (2015). Probing the interoceptive network by listening to heartbeats: an fMRI study. PloS one, 10(7), e0133164.

      van Elk, M., Lenggenhager, B., Heydrich, L., & Blanke, O. (2014). Suppression of the auditory N1-component for heartbeat-related sounds reflects interoceptive predictive coding. Biological psychology, 99, 172-182.

      Vicentin, S., Guglielmi, S., Stramucci, G., Bisiacchi, P., & Cainelli, E. (2024). Listen to the beat: behavioral and neurophysiological correlates of slow and fast heartbeat sounds. International Journal of Psychophysiology, 206, 112447.

      Finally, it is noteworthy that the researchers divided the study into two experiments when it would have been optimal to test all the conditions with the same subjects in a randomized order in a single cross-over experiment to reduce between-subject variability. Taking this into consideration, I believe that the conclusions are only partially supported by the evidence. Despite of the outcome transformations, a clear effect of faster HR acoustic feedback can be observed in the first experiment, which is larger than the proposed exteroceptive counterpart. This work could be of broad interest to pain researchers, particularly those working on predictive coding of pain.

      We appreciate the reviewer’s suggestion regarding a within-subject crossover design. While such a design indeed offers increased statistical power by reducing interindividual variability (Charness, Gneezy, & Kuhn, 2012), we intentionally opted for a between-subjects design due to theoretical and methodological considerations specific to studies involving deceptive feedback. Most importantly, carryover effects are a major concern in deception paradigms. Participants exposed to one type of feedback initially (e.g., interoceptive), and then the other (exteroceptive) would be more likely to develop suspicion or adaptive strategies that would alter their responses. Such expectancy effects could contaminate results in a crossover design, particularly when participants realize that feedback is manipulated. In line with this idea, past studies on false cardiac feedback (e.g., Valins, 1966; Pennebaker & Lightner, 1980) often employed between-subjects or blocked designs to mitigate this risk.

      Pennebaker, J. W., & Lightner, J. M. (1980). Competition of internal and external information in an exercise setting. Journal of personality and social psychology, 39(1), 165.

      Valins, S. (1966). Cognitive effects of false heart-rate feedback. Journal of personality and social psychology, 4(4), 400.

      Reviewer #3 (Public Review):

      In their manuscript titled "Exposure to false cardiac feedback alters pain perception and anticipatory cardiac frequency", Parrotta and colleagues describe an experimental study on the interplay between false heart rate feedback and pain experience in healthy, adult humans. The experimental design is derived from Bayesian perspectives on interoceptive inference. In Experiment 1 (N=34), participants rated the intensity and unpleasantness of an electrical pulse presented to their middle fingers. Participants received auditory cardiac feedback prior to the electrical pulse. This feedback was congruent with the participant's heart rate or manipulated to have a higher or lower frequency than the participant's true heart rate (incongruent high/ low feedback). The authors find heightened ratings of pain intensity and unpleasantness as well as a decreased heart rate in participants who were exposed to the incongruent-high cardiac feedback. Experiment 2 (N=29) is equivalent to Experiment 1 with the exception that non-interoceptive auditory feedback was presented. Here, mean pain intensity and unpleasantness ratings were unaffected by feedback frequency.

      Strengths:

      The authors present interesting experimental data that was derived from modern theoretical accounts of interoceptive inference and pain processing.

      (1) The motivation for the study is well-explained and rooted within the current literature, whereas pain is the result of a multimodal, inferential process. The separation of nociceptive stimulation and pain experience is explained clearly and stringently throughout the text.

      (2) The idea of manipulating pain-related expectations via an internal, instead of an external cue, is very innovative.

      (3) An appropriate control experiment was implemented, where an external (non-physiological) auditory cue with parallel frequency to the cardiac cue was presented.

      (4) The chosen statistical methods are appropriate, albeit averaging may limit the opportunity for mechanistic insight, see weaknesses section.

      (5) The behavioral data, showing increased unpleasantness and intensity ratings after exposure to incongruent-high cardiac feedback, but not exteroceptive high-frequency auditory feedback, is backed up by ECG data. Here, the decrease in heart rate during the incongruent-high condition speaks towards a specific, expectation-induced physiological effect that can be seen as resulting from interoceptive inference.

      We thank the reviewer for their positive feedback. We are glad that the study’s theoretical foundation, innovative design, appropriate control conditions, and convergence of behavioral and physiological data were well received.

      Weaknesses:

      Additional analyses and/ or more extensive discussion are needed to address these limitations:

      (1) I would like to know more about potential learning effects during the study. Is there a significant change in ∆ intensity and ∆ unpleasantness over time; e.g. in early trials compared to later trials? It would be helpful to exclude the alternative explanation that over time, participants learned to interpret the exteroceptive cue more in line with the cardiac cue, and the effect is driven by a lack of learning about the slightly less familiar cue (the exteroceptive cue) in early trials. In other words, the heartbeat-like auditory feedback might be "overlearned", compared to the less naturalistic tone, and more exposure to the less naturalistic cue might rule out any differences between them w.r.t. pain unpleasantness ratings.

      We thank the reviewer for raising this important point. Please note that the repetitions in our task were relatively limited (6 trials per condition), which limits the potential influence of such differential learning effects between experiments. To address this concern, we performed an additional analysis, reported in the Supplementary Materials, using a Linear Mixed-Effects Model approach. This method allowed us to include "Trial" (the rank order of each trial) as a variable to account for potential time-on-task effects such as learning, adaptation, or fatigue (e.g., Möckel et al., 2015). All feedback conditions (no-feedback, congruent, faster, slower) and all stimulus intensity levels were included.

      Specifically, we tested the following models:

      Likert Pain Unpleasantness Ratings ~ Experiment × Feedback × StimInt × Trial + (StimInt + Trial | Subject)

      Numeric Pain Scale of Intensity Ratings ~ Experiment × Feedback × StimInt × Trial + (StimInt + Trial | Subject)

      In both models, no significant interactions involving Trial × Experiment or Trial × Feedback × Experiment were found. Instead, we just find generally larger effects in early trials compared to later ones (Main effect of Trial within each Experiment), similar to other cognitive illusions where repeated exposure diminishes effects. Thus, although some unspecific changes over time may have occurred (e.g., due to general task exposure), these changes did not differ systematically across experimental conditions (interoceptive vs. exteroceptive) or feedback types. However, we are fully aware that the absence of significant higher-order interactions does not conclusively rule out the possibility of learning-related effects. It is possible that our models lacked the statistical power to detect more subtle or complex time-dependent modulations, particularly if such effects differ in magnitude or direction across feedback conditions.

      We report the full description of these analyses and results in the Supplementary materials 1. Cross-experiment analysis (between-subjects model).

      (2) The origin of the difference in Cohen's d (Exp. 1: .57, Exp. 2: .62) and subsequently sample size in the sensitivity analyses remains unclear, it would be helpful to clarify where these values are coming from (are they related to the effects reported in the results? If so, they should be marked as post-hoc analyses).

      Following recommendations (Anderson, Kelley & Maxwell, 2017; Albers &  Lakens, 2017), we do not report theoretical power based on previously reported effect sizes as this neglects uncertainty around effect size measurements, especially for new effects for which no reliable expected effect size estimates can be derived across the literature. Instead, the power analysis is based on a sensitivity analysis, conducted in G*Power (Version 3.1). Importantly, these are not post-hoc analyses, as they are not based on observed effect sizes in our study, but derived a priori. Sensitivity analyses estimate effect sizes that our design is well-powered (90%) to detect (i.e. given target power, sample size, type of test), for the crucial comparison between faster and slower feedback in both experiments (Lakens, 2022). Following recommendations, we also report the smallest effect size this test can in principle detect in our study (SESOI, Lakens, 2022). This yields effect sizes of d = .57 in Experiment 1 and d = .62 in Experiment 2 at 90% power and SESOIs of d = .34 and .37, respectively. Note that values are slightly higher in Experiment 2, as more participants were excluded based on our exclusion criteria. Importantly, detectable effect sizes in both experiments are smaller than reported effect sizes for comparable top-down effects on pain measurements of d = .7 (Iodice et al., 2019).  We have now added more information to the power analysis sections to make this clearer (lines 208-217).

      Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of experimental social psychology, 74, 187-195.

      Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychological Science, 28(11), 1547-1562.

      Lakens, D. (2022). Sample size justification. Collabra: psychology, 8(1), 33267.

      (3) As an alternative explanation, it is conceivable that the cardiac cue may have just increased unspecific arousal or attention to a larger extent than the exteroceptive cue. It would be helpful to discuss the role of these rather unspecific mechanisms, and how it may have differed between experiments.

      We thank the reviewer for raising this important point. We agree that, in principle, unspecific mechanisms such as increased arousal or attention driven by cardiac feedback could be an alternative explanation for the observed effects. However, several aspects of our data indicate that this is unlikely:

      (1) No main effect of Experiment on pain ratings:

      If the cardiac feedback had simply increased arousal or attention in a general (non-specific) way, we would expect a main effect of Experiment (i.e., interoceptive vs exteroceptive condition) on pain intensity or unpleasantness ratings, regardless of feedback frequency. However, such a main effect was never observed when we compared between experiments (see between-experiment t-tests in results, and in supplementary analyses). Instead, effects were specific to the manipulation of feedback frequency.

      (2) Heart rate as an arousal measure:

      Heart rate (HR) is a classical physiological index of arousal. If there had been an unspecific increase in arousal in the interoceptive condition, we would expect a main effect of Experiment on HR. However, no such main effect was found. Instead, our HR analyses revealed a significant interaction between feedback and experiment, suggesting that HR changes depended specifically on the feedback manipulation rather than reflecting a general arousal increase.

      (3) Arousal predicts faster, not slower, heart rates

      In Experiment 1, faster interoceptive cardiac feedback led to a slowdown in heartrates both when compared to slower feedback and to congruent cardiac feedback. This is in line with the predicted compensatory response to faster heart rates. In contrast, if faster feedback would have only generally increased arousal, heart rates should have increased instead of decreased, as indicated by several prior studies (Tousignant-Laflamme et al., 2005; Terkelsen et al., 2005; for a review, see Forte et al., 2022), predicting the opposite pattern of responses than was found in Experiment 1.

      Taken together, these findings indicate that the effects observed are unlikely to be driven by unspecific arousal or attention mechanisms, but rather are consistent with feedback-specific modulations, in line with our interoceptive inference framework.

      We have now integrated these considerations in the revised discussion (lines 796-830), and added the relevant between-experiment comparisons to the Results of Experiment 2 and the supplementary analysis.

      Terkelsen, A. J., Mølgaard, H., Hansen, J., Andersen, O. K., & Jensen, T. S. (2005). Acute pain increases heart rate: differential mechanisms during rest and mental stress. Autonomic Neuroscience, 121(1-2), 101-109.

      Tousignant-Laflamme, Y., Rainville, P., & Marchand, S. (2005). Establishing a link between heart rate and pain in healthy subjects: a gender effect. The journal of pain, 6(6), 341-347.

      Forte, G., Troisi, G., Pazzaglia, M., Pascalis, V. D., & Casagrande, M. (2022). Heart rate variability and pain: a systematic review. Brain sciences, 12(2), 153.

      (4) The hypothesis (increased pain intensity with incongruent-high cardiac feedback) should be motivated by some additional literature.

      We thank the reviewer for this helpful suggestion. Please note that the current phenomenon was tested in this experiment for the first time. Therefore, there is no specific prior study that motivated our hypotheses; they were driven theoretically, and derived from our model of interoceptive integration of pain and cardiac perception. The idea that accelerated cardiac feedback (relative to decelerated feedback) will increase pain perception and reduce heart rates is grounded on Embodied Predictive coding frameworks. Accordingly, expectations and signals from different sensory modalities (sensory, proprioceptive, interoceptive) are integrated both to efficiently infer crucial homeostatic and physiological variables, such as hunger, thirst, and, in this case, pain, and regulate the body’s own autonomic responses based on these inferences.

      Within this framework, the concept of an interoceptive schema (Tschantz et al., 2022; Iodice et al., 2019; Parrotta et al., 2024; Schoeller et al., 2022) offers the basis for understanding interoceptive illusions, wherein inferred levels of interoceptive states (i.e., pain) deviate from the actual physiological state. Cardiac signals conveyed by the feedback manipulation act as a misleading prior, shaping the internal generative model of pain. Specifically, an increased heart rate may signal a state of threat, establishing a prior expectation of heightened pain. Building on predictive models of interoception, we predict that this cardiac prior is integrated with interoceptive (i.e., actual nociceptive signal) and exteroceptive inputs (i.e., auditory feedback input), leading to a subjective experience of increased pain even when there is no corresponding increase in the nociceptive input.

      This idea is not completely new, but it is based on our previous findings of an interoceptive cardiac illusion driven by misleading priors about anticipated threat (i.e., pain). Specifically, in Parrotta et al. (2024), we tested whether a common false belief that heart rate increases in response to threat lead to an illusory perception of accelerated cardiac activity when anticipating pain. In two experiments, we asked participants to monitor and report their heartbeat while their ECG was recorded. Participants performed these tasks while visual cues reliably predicted a forthcoming harmless (low-intensity) vs. threatening (high-intensity) cutaneous electrical stimulus. We showed that anticipating a painful vs. harmless stimulus causes participants to report an increased cardiac frequency, which does not reflect their real cardiac response, but the common (false) belief that heart rates would accelerate under threat, reflecting the hypothesised integration of prior expectations and interoceptive inputs when estimating cardiac activity.

      Here we tested the counterpart of such a cardiac illusion. We reasoned that if cardiac interoception is shaped by expectations about pain, then the inverse should also be true: manipulating beliefs about cardiac activity (via cardiac feedback) in the context of pain anticipation should influence the perception of pain. Specifically, we hypothesized that presenting accelerated cardiac feedback would act as a misleading prior, leading to an illusory increase in pain experience, even in the absence of an actual change in nociceptive input.

      Moreover, next to the references already provided in the last version of the manuscript, there is ample prior research that provides more general support for such relationships. Specifically, studies have shown that providing mismatched cardiac feedback in contexts where cardiovascular changes are typically expected (i.e. sexual arousal, Rupp & Wallen, 2008; Valins, 1996; physical exercise, Iodice et al., 2019) can enhance the perception of interoceptive states associated with those experiences. Furthermore, findings that false cardiac feedback can influence emotional experience suggest that it is the conscious perception of physiological arousal, combined with the cognitive interpretation of the stimulus, that plays a key role in shaping emotional responses (Crucian et al., 2000).

      This point is now addressed in the revised Introduction, wherein additional references have been integrated (lines 157-170).

      Crucian, G. P., Hughes, J. D., Barrett, A. M., Williamson, D. J. G., Bauer, R. M., Bowers, D., & Heilman, K. M. (2000). Emotional and physiological responses to false feedback. Cortex, 36(5), 623-647.

      Iodice, P., Porciello, G., Bufalari, I., Barca, L., & Pezzulo, G. (2019). An interoceptive illusion of effort induced by false heart-rate feedback. Proceedings of the National Academy of Sciences, 116(28), 13897-13902.

      Parrotta, E., Bach, P., Perrucci, M. G., Costantini, M., & Ferri, F. (2024). Heart is deceitful above all things: Threat expectancy induces the illusory perception of increased heartrate. Cognition, 245, 105719.

      Rupp, H. A., & Wallen, K. (2008). Sex differences in response to visual sexual stimuli: A review. Archives of sexual behavior, 37(2), 206-218.

      Schoeller, F., Horowitz, A., Maes, P., Jain, A., Reggente, N., Moore, L. C., Trousselard, M., Klein, A., Barca, L., & Pezzulo, G. (2022). Interoceptive technologies for clinical neuroscience.

      Tschantz, A., Barca, L., Maisto, D., Buckley, C. L., Seth, A. K., & Pezzulo, G. (2022). Simulating homeostatic, allostatic and goal-directed forms of interoceptive control using active inference. Biological Psychology, 169, 108266.

      Valins, S. (1966). Cognitive effects of false heart-rate feedback. Journal of personality and social psychology, 4(4), 400.

      (5) The discussion section does not address the study's limitations in a sufficient manner. For example, I would expect a more thorough discussion on the lack of correlation between participant ratings and self-reported bodily awareness and reactivity, as assessed with the BPQ.

      We thank the reviewer for this valuable observation. In response, we have revised the Discussion section to explicitly acknowledge and elaborate on the lack of significant correlations between participants’ pain ratings and their self-reported bodily awareness and reactivity as assessed with the BPQ.

      We now clarify that the inclusion of this questionnaire was exploratory. While it would be theoretically interesting to observe a relationship between subjective pain modulation and individual differences in interoceptive awareness, detecting robust correlations between within-subject experimental effects and between-subjects trait measures such as the BPQ typically requires much larger sample sizes (often exceeding N = 200) due to the inherently low reliability of such cross-level associations (see Hedge, Powell & Sumner, 2018; the “reliability paradox”). As such, the absence of a significant correlation in our study does not undermine the conclusions we draw from our main findings. Future studies with larger samples will be needed to systematically address this question. We now acknowledge this point explicitly in the revised manuscript (lines 501-504; 832-851).

      Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166-1186. https://doi.org/10.3758/s13428-017-0935-1

      (a) Some short, additional information on why the authors chose to focus on body awareness and supradiaphragmatic reactivity subscales would be helpful.

      We chose to focus on the body awareness and supradiaphragmatic reactivity subscales because these aspects are closely tied to emotional and physiological processing, particularly in the context of interoception. Body awareness plays a critical role in how individuals perceive and interpret bodily signals, which in turn affects emotional regulation and self-awareness. Supradiaphragmatic reactivity refers specifically to organs located or occurring above the diaphragm (i.e., the muscle that separates the chest cavity from the abdomen), which includes the heart, compared to subdiaphragmatic reactivity subscales further down. Our decision to include these subscales is further motivated by recent research, including the work by Petzschner et al. (2021), which demonstrates that the focus of attention can modulate the heartbeat-evoked potential (HEP), and that this modulation is predicted by participants’ responses on the supradiaphragmatic reactivity subscales. Thus, this subscale, and the more general body awareness scale, allows us to explore the interplay between bodily awareness, physiological reactivity, and emotional processing in our study. We now clarify this point in the revised version of the Methods - Body Perception Questionnaire (lines 384-393).

      (6) The analyses presented in this version of the manuscript allow only limited mechanistic conclusions - a computational model of participants' behavior would be a very strong addition to the paper. While this may be out of the scope of the article, it would be helpful for the reader to discuss the limitations of the presented analyses and outline avenues towards a more mechanistic understanding and analysis of the data. The computational model in [7] might contain some starting ideas.

      Thank you for your valuable feedback. We agree that a computational model would enhance the mechanistic understanding of our findings. While this is beyond the current scope, we now discuss the limitations of our analysis in the Limitations and Future directions section (lines 852-863). Specifically, we acknowledge that future studies could use computational models to better understand the interactions between physiological, cognitive, and perceptual factors.

      Some additional topics were not considered in the first version of the manuscript:

      (1) The possible advantages of a computational model of task behavior should be discussed.

      We agree that a computational model of task behavior could provide several advantages. By formalizing principles of predictive processing and active inference, such a model could generate quantitative predictions about how heart rate (HR) and feedback interact, providing a more precise understanding of their respective contributions to pain modulation. However, this is a first demonstration of a theoretically predicted phenomenon, and computationally modelling it is currently outside the scope of the article. We would be excited to explore this in the future. We have added a brief discussion of these potential advantages in the revised manuscript and suggest that future work could integrate computational modelling to further deepen our understanding of these processes (lines 852-890).

      (2) Across both experiments, there was a slightly larger number of female participants. Research suggests significant sex-related differences in pain processing [1,2]. It would be interesting to see what role this may have played in this data.

      Thank you for your insightful comment. While we acknowledge that sex-related differences in pain processing are well-documented in the literature, we do not have enough participants in our sample to test this in a well-powered way. As such, exploring the role of sex differences in pain perception will need to be addressed in future studies with more balanced samples. It would be interesting if more sensitive individuals, with a more precise representation of pain, also show smaller effects on pain perception. We have noted this point in the revised manuscript (lines 845-851) and suggest that future research could specifically investigate how sex differences might influence the modulation of pain and physiological responses in similar experimental contexts.

      (3) There are a few very relevant papers that come to mind which may be of interest. These sources might be particularly useful when discussing the roadmap towards a mechanistic understanding of the inferential processes underlying the task responses [3,4] and their clinical implications.

      Thank you for highlighting these relevant papers. We appreciate your suggestion and have now cited them in the Limitations and Future directions paragraph (lines 852-863).

      (4) In this version of the paper, we only see plots that illustrate ∆ scores, averaged across pain intensities - to better understand participant responses and the relationship with stimulus intensity, it would be helpful to see a more descriptive plot of task behavior (e.g. stimulus intensity and raw pain ratings)

      To directly address the reviewer’s request, we now provide additional descriptive plots in the supplementary material of the revised manuscript, showing raw pain ratings across different stimulus intensities and feedback conditions. These plots offer a clearer view of participant behavior without averaging across pain levels, helping to better illustrate the relationship between stimulus intensity and reported pain.

      Mogil, J. S. (2020). Qualitative sex differences in pain processing: emerging evidence of a biased literature. Nature Reviews Neuroscience, 21(7), 353-365. https://www.nature.com/articles/s41583-020-0310-6

      Sorge, R. E., & Strath, L. J. (2018). Sex differences in pain responses. Current Opinion in Physiology, 6, 75-81. https://www.sciencedirect.com/science/article/abs/pii/S2468867318300786?via%3Dihub

      Unal, O., Eren, O. C., Alkan, G., Petzschner, F. H., Yao, Y., & Stephan, K. E. (2021). Inference on homeostatic belief precision. Biological Psychology, 165, 108190.

      Allen, M., Levy, A., Parr, T., & Friston, K. J. (2022). In the body's eye: the computational anatomy of interoceptive inference. PLoS Computational Biology, 18(9), e1010490.

      Stephan, K. E., Manjaly, Z. M., Mathys, C. D., Weber, L. A., Paliwal, S., Gard, T., ... & Petzschner, F. H. (2016). Allostatic self-efficacy: A metacognitive theory of dyshomeostasis-induced fatigue and depression. Frontiers in human neuroscience, 10, 550.

      Friston, K. J., Stephan, K. E., Montague, R., & Dolan, R. J. (2014). Computational psychiatry: the brain as a phantastic organ. The Lancet Psychiatry, 1(2), 148-158.

      Eckert, A. L., Pabst, K., & Endres, D. M. (2022). A Bayesian model for chronic pain. Frontiers in Pain Research, 3, 966034.

      We thank the reviewer for highlighting these relevant references which have now been integrated in the revised version of the manuscript.

      Recommendations For The Authors: 

      Reviewer #1 (Recommendations For The Authors):

      At the time I was reviewing this paper, I could not think of a detailed experiment that would answer my biggest concern: Is this a manipulation of the brain's interoceptive data integration, or rather a manipulation of participants' alertness which indirectly influences their pain prediction?

      One incomplete idea that came to mind was delivering this signal in a more "covert" manner (though I am not sure it will suffice), or perhaps correlating the effect size of a participant with their interoceptive abilities, as measured in a different task or through a questionnaire.... Another potential idea is to tell participants that  this is someone else's HR that they hear and see if that changes the results (though requires further thought). I leave it to the authors to think further, and perhaps this is to be answered in a different paper - but if so, I am sorry to say that I do not think the claims can remain as they are now, and the paper will need a revision of its arguments, unfortunately. I urge the authors to ask further questions if my point about the concern was not made clear enough for them to address or contemplate it.

      We thank the reviewer for raising this important point. As detailed in our previous response, this point invites an important clarification regarding the role of cardiac deceleration in threat processing. Rather than serving as an interoceptive input from which the brain infers the likelihood of a forthcoming aversive event, heart rate deceleration is better described as an output of an already ongoing predictive process, as it reflects an allostatic adjustment of the bodily state aimed at minimizing the impact of the predicted perturbation (e.g., pain) and preventing sympathetic overshoot. It would be maladaptive for the brain to use a decelerating heart rate as evidence of impending threat, since this would paradoxically trigger further parasympathetic activation, initiating a potentially destabilizing feedback loop. Conversely, increased heart rate represents an evolutionarily conserved cue for arousal, threat, and pain. Our results therefore align with the idea that the brain treats externally manipulated increases in cardiac signals as congruent with anticipated sympathetic activation, prompting a compensatory autonomic and perceptual response consistent with embodied predictive processing frameworks (e.g., Barrett & Simmons, 2015; Seth, 2013).

      We would also like to re-iterate that our results cannot be explained by general differences induced by the different heart rate sounds relative to the exteroceptive (see also our detailed comments to your point above, and our response to a similar point from Reviewer 3), for three main reasons.

      (1) No main effect of Experiment on pain ratings:

      If the cardiac feedback had simply increased arousal or attention in a general (non-specific) way, we would expect a main effect of Experiment (i.e., interoceptive vs exteroceptive condition) on pain intensity or unpleasantness ratings, regardless of feedback frequency. However, such a main effect was never observed. Instead, effects were specific to the manipulation of feedback frequency.

      (2) Heart rate as an arousal measure:

      Heart rate (HR) is a classical physiological index of arousal. If there had been an unspecific increase in arousal in the interoceptive condition, we would expect a main effect of Experiment on HR. However, no such main effect was found. Instead, our HR analyses revealed a significant interaction between feedback and experiment, suggesting that HR changes depended specifically on the feedback manipulation rather than reflecting a general arousal increase.

      (3) Arousal predicts faster, not slower, heart rates

      In Experiment 1, faster interoceptive cardiac feedback led to a slowdown in heartrates both when compared to slower feedback and to congruent cardiac feedback. This is in line with the predicted compensatory response to faster heart rates. In contrast, if faster feedback would have only generally increased arousal, heart rates should have increased instead of decreased, as indicated by several prior studies (for a review, see Forte et al., 2022), predicting the opposite pattern of responses than was found in Experiment 1.

      Taken together, these findings indicate that the effects observed are unlikely to be driven by unspecific arousal or attention mechanisms, but rather are consistent with feedback-specific modulations, in line with our interoceptive inference framework. We now integrate these considerations in the general discussion (lines 796-830).

      Barrett, L. F., & Simmons, W. K. (2015). Interoceptive predictions in the brain. Nature reviews neuroscience, 16(7), 419-429.

      Forte, G., Troisi, G., Pazzaglia, M., Pascalis, V. D., & Casagrande, M. (2022). Heart rate variability and pain: a systematic review. Brain sciences, 12(2), 153.

      Seth, A. K. (2013). Interoceptive inference, emotion, and the embodied self. Trends in Cognitive Sciences, 17(11), 565-573.

      Additional recommendations:

      Major (in order of importance):

      (1) Number of trials per participant, per condition: as I mentioned, having only 6 trials for each condition is very little. The minimum requirement to accept so few trials would be to show data about the distribution of participants' responses to these trials, both per pain intensity (which was later averaged across - another issue discussed later), and across pain intensities, and see that it allows averaging across and that it is not incredibly variable such that the mean is unreliable.

      We appreciate the reviewer’s concern regarding the limited number of trials per condition. This choice was driven by both theoretical and methodological considerations.

      First, as is common in body illusion paradigms (e.g., the Rubber Hand Illusion, Botvinick & Cohen, 1998; the Full Body Illusion, Ehrsson, 2007; the Cardio-visual full body illusion, Pratviel et al., 2022) only a few trials are typically employed due to the immediate effects these manipulations elicit. Repetition can reduce the strength of the illusion through habituation, increased awareness, or loss of believability.

      Second, the experiment was already quite long (1.5h to 2h per participant) and cognitively demanding. It would not have been feasible to expand it further without compromising data quality due to fatigue, attentional decline, or participant disengagement.

      Third, the need for a large number of trials is more relevant when using implicit measures such as response times or physiological indices, which are typically indirectly related to the psychological constructs of interest. In contrast, explicit ratings are often more sensitive and less noisy, and thus require fewer repetitions to yield reliable effects (e.g., Corneille et al., 2024).

      Importantly, we also addressed your concern analytically. We ran therefore linear mixed-effects model analyses across all dependent variables (See Supplementary materials), with Trial (i.e., the rank order of each trial) included as a predictor to account for potential time-on-task effects such as learning, adaptation, or fatigue (e.g., Möckel et al., 2015). These models captured trial-by-trial variability and allowed us to test for systematic changes in heart rate (HR) and pain ratings including interactions with feedback conditions (e.g., Klieg et al., 2011; Baayen et al., 2010; Ambrosini et al., 2019). The consistent effects of Trial suggest that repetition dampens the illusion, reinforcing our decision to limit the number of exposures.

      In the interoceptive experiment, these analyses revealed a significant Feedback × Trial interaction (F(3, 711.19) = 6.16, p < .001), indicating that the effect of feedback on HR was not constant over time. As we suspected, and in line with other illusion-like effects, the difference between Faster and Slower feedback, which was significant early on (estimate = 1.68 bpm, p = .0007), decreased by mid-session (estimate = 0.69 bpm, p = .0048), and was no longer significant in later trials (estimate = 0.30 bpm, p = .4775). At the end of the session, HR values in the Faster and Slower conditions even numerically converged (Faster: M = 74.4, Slower: M = 74.1), and the non-significant contrast confirms that the difference had effectively vanished (for further details about slope estimation, see Supplementary material).

      The same pattern emerged for pain-unpleasantness ratings. A significant Feedback × Trial interaction (F (3, 675.33) = 3.44, p = .0165) revealed that the difference between Faster and Slower feedback was strongest at the beginning of the session and progressively weakened. Specifically, Faster feedback produced higher unpleasantness than Slower in early trials (estimate= -0.28, p = .0058) and mid-session (estimate = - 0.19, p = .0001), but this contrast was no longer significant in the final trials, wherein all the differences between active feedback conditions vanished (all ps > .55).

      Finally, similar results were yielded for pain intensity ratings. A significant Feedback × Trial interaction (F (3, 669.15) = 9.86, p < .001) showed that the Faster vs Slower difference was greatest at the start of the session and progressively vanished over trials. In early trials Faster feedback exceeded Slower (estimate=-8.33, p = .0001); by mid-session this gap had shrunk to 4.48 points (p < .0001); and in the final trials it was no longer significant (all ps > .94).

      Taken together, our results show that the illusion induced by Faster relative to slower feedback fades with repetition; adding further trials would likely have masked this key effect, confirming the methodological choice to restrict each condition to fewer exposures. To conclude, given that this is the first study to investigate an illusion of pain using heartbeat-based manipulation, we intentionally limited repeated exposures to preserve the integrity of the illusion. The use of mixed models as complementary analyses strengthens the reliability of our conclusions within these necessary design constraints. We now clarify this point in the Procedure paragraph (lines 328-335)

      Ambrosini, E., Peressotti, F., Gennari, M., Benavides-Varela, S., & Montefinese, M. (2023). Aging-related effects on the controlled retrieval of semantic information. Psychology and Aging, 38(3), 219.

      Baayen, R. H., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3(2), 12-28.

      Botvinick, M., & Cohen, J. (1998). Rubber hands ‘feel’touch that eyes see. Nature, 391(6669), 756-756.

      Corneille, O., & Gawronski, B. (2024). Self-reports are better measurement instruments than implicit measures. Nature Reviews Psychology, 3(12), 835–846.

      Ehrsson, H. H. (2007). The experimental induction of out-of-body experiences. Science, 317(5841), 1048-1048.

      Kliegl, R., Wei, P., Dambacher, M., Yan, M., & Zhou, X. (2011). Experimental effects and individual differences in linear mixed models: Estimating the relation of spatial, object, and attraction effects in visual attention. Frontiers in Psychology, 1, 238. https://doi.org/10.3389/fpsyg.2010.00238

      Möckel, T., Beste, C., & Wascher, E. (2015). The effects of time on task in response selection-an ERP study of mental fatigue. Scientific reports, 5(1), 10113.

      Pratviel, Y., Bouni, A., Deschodt-Arsac, V., Larrue, F., & Arsac, L. M. (2022). Avatar embodiment in VR: Are there individual susceptibilities to visuo-tactile or cardio-visual stimulations?. Frontiers in Virtual Reality, 3, 954808.

      (2) Using different pain intensities: what was the purpose of training participants on correctly identifying pain intensities? You state that the aim of having 5 intensities is to cause ambiguity. What is the purpose of making sure participants accurately identify the intensities? Also, why then only 3 intensities were used in the test phase? The rationale for these is lacking.

      We thank the reviewer for raising these important points regarding the use of different pain intensities. The purpose of using five levels during the calibration and training phases was to introduce variability and increase ambiguity in the participants’ sensory experience. This variability aimed to reduce predictability and prevent participants from forming fixed expectations about stimulus intensity, thereby enhancing the plausibility of the illusion. It also helped prevent habituation to a single intensity and made the manipulation subtler and more credible. We had no specific theoretical hypotheses about this manipulation. Regarding the accuracy training, although the paradigm introduced ambiguity, it was important to ensure that participants developed a stable and consistent internal representation of the pain scale. This step was essential to control for individual differences in sensory discrimination and to ensure that illusion effects were not confounded by participants’ inability to reliably distinguish between intensities.

      As for the use of only three pain intensities in the test phase, the rationale was to focus on a manageable subset that still covered a meaningful range of the stimulus spectrum. This approach followed the same logic as Iodice et al. (2019, PNAS), who used five (rather than all seven) intensity levels during their experimental session. Specifically, they excluded the extreme levels (45 W and 125 W) used during baseline, to avoid floor and ceiling effects and to ensure that each test intensity could be paired with both a “slower” and a “faster” feedback from an adjacent level. This would not have been possible at the extremes of the intensity range, where no adjacent level exists in one direction. We adopted the same strategy to preserve the internal consistency and plausibility of our feedback manipulation.

      We further clarified these points in the revised manuscript (lines 336-342).

      Iodice, P., Porciello, G., Bufalari, I., Barca, L., & Pezzulo, G. (2019). An interoceptive illusion of effort induced by false heart-rate feedback. Proceedings of the National Academy of Sciences, 116(28), 13897-13902.

      (3) Averaging across pain intensities: this is, in my opinion, not the best approach as by matching a participant's specific responses to a pain stimulus before and after the manipulation, you can more closely identify changes resulting from the manipulation. Nevertheless, the minimal requirement to do so is to show data of distributions of pain intensities so we know they did not differ between conditions per participant, and in general - as you indicate they were randomly distributed.

      We thank the reviewer for this thoughtful comment. The decision to average across pain intensities in our main analyses was driven by the specific aim of the study: we did not intend to determine at which exact intensity level the illusion was most effective, and the limited number of trials makes such an analysis difficult. Rather, we introduced variability in nociceptive input to increase ambiguity and reduce predictability in the participants’ sensory experience. This variability was critical for enhancing the plausibility of the illusion by preventing participants from forming fixed expectations about stimulus strength. Additionally, using a range of intensities helped to minimize habituation effects and made the feedback manipulation subtler and more credible.

      That said, we appreciate the reviewer’s point that matching specific responses before and after the manipulation at each intensity level could provide further insights into how the illusion operates across varying levels of nociceptive input. We therefore conducted supplementary analyses using linear mixed-effects models in which all three stimulus intensities were included as a continuous fixed factor. This allowed us to examine whether the effects of feedback were intensity-specific or generalized across different levels of stimulation

      These analyses revealed that, in both the interoceptive and exteroceptive experiments, the effect of feedback on pain ratings was significantly modulated by stimulus intensity, as indicated by a Feedback × Stimulus Intensity interaction (Interoceptive: unpleasantness F(3, 672.32)=3.90, p=.0088; intensity ratings F(3, 667.07)=3.46, p=.016. Exteroceptive: unpleasantness F(3, 569.16)=8.21, p<.0001; intensity ratings F(3, 570.65)=3.00, p=.0301). The interaction term confirmed that the impact of feedback varied with stimulus strength, yet the pattern that emerged in each study diverged markedly.

      In the interoceptive experiment, the accelerated-heartbeat feedback (Faster) systematically heightened pain relative to the decelerated version (Slower) at every level of noxious input: for low-intensity trials Faster exceeded Slower by 0.22 ± 0.08 points on the unpleasantness scale (t = 2.84, p = .0094) and by 3.87 ± 1.69 units on the numeric intensity scale (t = 2.29, p = .0448); at the medium intensity the corresponding differences were 0.19 ± 0.05 (t = -4.02, p = .0001) and 4.52 ± 1.06 (t = 4.28, p < .0001); and even at the highest intensity, Faster still surpassed Slower by 0.17 ± 0.08 on unpleasantness (t = 2.21, p = .0326) and by 5.16 ± 1.67 on intensity (t = 3.09, p = .0032). This uniform Faster > Slower pattern indicates that the interoceptive manipulation amplifies perceived pain in a stimulus-independent fashion.

      The exteroceptive control experiment told a different story: the Faster-Slower contrast reached significance only at the most noxious setting (unpleasantness: estimate = 0.24 ± 0.07, t = -3.24, p = .0019; intensity: estimate = - 5.14 ± 1.82, t = 2.83, p = .0072) and was absent at the medium level (intensity , p=0.29; unpleasantness,  p=0.45), while at the lowest level Slower actually produced numerically higher unpleasantness (2.56 versus 2.40) and intensity ratings (44.7 versus 42.2).

      Thus, although both studies show that feedback effects depend on the actual nociceptive level of the stimulus, the results suggest that the faster vs. slower interoceptive feedback manipulation delivers a robust and intensity-invariant enhancement of pain, whereas the exteroceptive cue exerts a sporadic influence that surfaces solely under maximal stimulation.

      These new results are now included in the Supplementary Materials, where we report the detailed analyses for both the Interoceptive and Exteroceptive experiments on the Likert unpleasantness ratings and the numeric pain intensity ratings.

      (4) Sample size: It seems that the sample size was determined after the experiment was conducted, as the required N is identical to the actual N. I would be transparent about that, and say that retrospective sample size analyses support the ability of your sample size to support your claims. In general, a larger sample size than is required is always recommended, and if you were to run another study, I suggest you increase the sample size.

      As also addressed in our responses to your later comments (see our detailed reply regarding the justification of SESOI and power analyses), the power analyses reported here were not post-hoc power analyses based on obtained results. In line with current recommendations (Anderson, Kelley & Maxwell, 2017; Albers & Lakens, 2018), we did not base our analyses on previously reported effect sizes, as these can carry considerable uncertainty, particularly for novel effects where robust estimates are lacking. Instead, we used sensitivity analyses, conducted using the sensitivity analysis function in G*Power (Version 3.1). Sensitivity analyses allow us to report effect sizes that our design was adequately powered (90%) to detect, given the actual sample size, desired power level, and the statistical test used in each experiment (Lakens, 2022). Following further guidance (Lakens, 2022), we also report the smallest effect size of interest (SESOI) that these tests could reliably detect.

      This approach indicated that our design was powered to detect effect sizes of d = 0.57 in Experiment 1 and d = 0.62 in Experiment 2, with corresponding SESOIs of d = 0.34 and d = 0.37, respectively. The slightly higher value in Experiment 2 reflects the greater number of participants excluded (from an equal number originally tested) based on pre-specified criteria. Importantly, both experiments were well-powered to detect effects smaller than those typically reported in similar top-down pain modulation studies, where effect sizes around d = 0.7 have been observed (Iodice et al., 2019).

      We have now clarified this rationale in the revised manuscript, Experiment 1- Methods - Participants (lines 208-217).

      Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of experimental social psychology, 74, 187-195.

      Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychological Science, 28(11), 1547-1562. https://doi.org/10.1177/0956797617723724

      Lakens, D. (2022). Sample size justification. Collabra: psychology, 8(1), 33267.

      (5) Analysis: the use of change scores instead of the actual scores is not recommended, as it is a loss of data, but could have been ignored if it didn't have a significant effect on the analyses conducted. Instead of conducting an RM-ANOVA of conditions (faster, slower, normal heartbeats) across participants, finding significant interaction, and then moving on to specific post-hoc paired comparisons between conditions, the authors begin with the change score but then move on to conduct the said paired comparisons without ever anchoring these analyses in an appropriate larger ANOVA. I strongly recommend the use of an ANOVA but if not, the authors would have to correct for multiple comparisons at the minimum.

      We thank the reviewer for their comment regarding the use of change scores. These were originally derived from the difference between the slower and faster feedback conditions relative to the congruent condition. In line with the reviewer’s recommendation, we have now removed these difference-based change scores from the main analysis. The results remain identical. Please note that we have retained the normalization procedure, relative to each participant’s initial baseline in the no feedback trials, as it is widely used in the interoceptive and pain literature (e.g., Bartolo et al., 2013; Cecchini et al., 2020; Riello et al., 2019). This approach helps to control for interindividual variability and baseline differences by expressing each participant’s response relative to their no-feedback baseline. As before, normalization was applied across all dependent variables (heart rate, pain intensity, and pain unpleasantness).

      To address the reviewer’s concern about statistical validity, we now first report a 1-factor repeated-measures ANOVA (Greenhouse-Geisser corrected) for each dependent variable, with feedback condition (slower, congruent, faster) as the within-subject factor.

      These show in each case a significant main effect, which we then follow with planned paired-sample t-tests comparing:

      Faster vs. slower feedback (our main hypothesis, as these manipulations are expected to produce largest, most powerful, test of our hypothesis, see response to Reviewer 3),

      Faster vs. congruent and slower vs. congruent (to test for potential asymmetries, as suggested  by previous false heart rate feedback studies).

      The rationale of these analyses is further discussed in the Data Analysis of Experiment 1 (lines 405-437).

      Although we report the omnibus one-factor RM-ANOVAs to satisfy conventional expectations, we note that such tests are not statistically necessary, nor even optimal, when the research question is fully captured by a priori, theory-driven contrasts. Extensive methodological work shows that, in this situation, going straight to planned contrasts maximises power without inflating Type I error and avoids the logical circularity of first testing an effect one does not predict (e.g., Rosenthal & Rosnow, 1985). In other words, an omnibus F is warranted only when one wishes to protect against unspecified patterns of differences. Here our hypotheses were precise (Faster ≠ Slower; potential asymmetry relative to Congruent), so the planned paired comparisons would have sufficed statistically. We therefore include the RM-ANOVAs solely for readers who expect to see them, but our inferential conclusions rest on the theoretically motivated contrasts.

      Rosenthal, R., & Rosnow, R. L. (1985). Contrast analysis. New York: Cambridge.

      (6) Correlations: were there correlations between subjects' own heartbeats (which are considered a predictive cue) and pain perceptions? This is critical to show that the two are in fact related.

      We thank the reviewer for this thoughtful suggestion. While we agree that testing for a correlation between anticipatory heart rate responses and subjective pain ratings is theoretically relevant. However, we have not conducted this analysis in the current manuscript, as our study was not designed or powered to reliably detect such individual differences. As noted by Hedge, Powell, and Sumner (2018), robust within-subject experimental designs tend to minimize between-subject variability in order to detect clear experimental effects. This reduction in variance at the between-subject level limits the reliability of correlational analyses involving trait-like or individual response patterns. This issue, known as the reliability paradox, highlights that measures showing robust within-subject effects may not show stable individual differences, and therefore correlations with other individual-level variables (like subjective ratings used here) require much larger samples to produce interpretable results than available here (and commonly used in the literature), typically more than 200 participants. For these reasons, we believe that running such an analysis in our current dataset would not yield informative results and could be misleading.

      We now explicitly acknowledge this point in the revised version of the manuscript (Limitations and future directions, lines 832-851) and suggest that future studies specifically designed to examine individual variability in anticipatory physiological responses and pain perception would be better suited to address this question.

      Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166-1186. https://doi.org/10.3758/s13428-017-0935-1

      (7) The direct comparison between studies is great! and finally the use of ANOVA - but why without the appropriate post-hoc tests to support the bold claims in lines 542-544? This is needed. Same for 556-558.

      We apologize if our writing was not clear here, but the result of the ANOVAs fully warrants the claims in 542-544 (now lines 616-618) and 556-558 (now lines 601-603).

      In a 2x2 design, the interaction term is mathematically identical to comparing the difference induced by Factor 1 at one level of Factor 2 with the same difference induced at the other level of Factor 2. In our 2x2 analysis with the factors Experiment (Cardiac feedback, Exteroceptive feedback - between participants) and Feedback Frequency (faster, slower - within participants), the interaction therefore directly tests whether the effect of Feedback frequency differs statistically (i.e., is larger or smaller) in the participants in the interoceptive and exteroceptive experiments. Thus, the conclusion that “faster feedback affected the perceptual bias more strongly in the Experiment 1 than in Experiment 2” captures the outcome of the significant interaction exactly. Indeed, this test would be statistically equivalent (and would produce identical p values) to a simple between-group t-test between each participant’s difference between the faster and slower feedback in the interoceptive group and the analogous differences between the faster and slower feedback in the exteroceptive group, as illustrated in standard examples of factorial analysis (see, e.g., Maxwell, Delaney and Kelley, 2018).

      Please note that, for the above reason, mathematically the conclusion of larger effects in one experiment than the other is licensed by the significant interaction even without follow-up t-tests. However, if the reader would like to see these tests, they are simply the main analysis results reported in each of the two experiment sections, where significant (t-test) differences between faster and slower feedback were induced with interoceptive cues (Experiment 1) but not exteroceptive cues (Experiment 2). Reporting them in the between-experiment comparison section again would therefore be redundant.

      To avoid this lack of clarity, we have now re-written the results section of each experiment. First, as noted above, we now precede our main hypothesis test - the crucial t-test comparing heartrate and pain ratings after faster vs slower feedback - with an ANOVA including all three levels (faster, congruent, slower feedback). Moreover, we removed the separate between-experiment comparison section. Instead, in the Result section of the exteroceptive Experiment 2, we now directly compare the (absent or reversed) effects of faster vs slower feedback directly, with a between-groups t-test, with the present effects in the interoceptive Experiment 1. This shows conclusively, and hopefully more clearly, that the effects in both experiments differ. We hope that this makes the logic of our analyses clearer.

      Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). Designing experiments and analyzing data: A model comparison perspective. Routledge.

      (8) The discussion is missing a limitation paragraph.

      Thank you for the suggestion. We have now added a dedicated limitations paragraph in the Discussion section (lines 832-890).

      Additional recommendations:

      Minor (chronological order):

      (1) Sample size calculations for both experiments: what was the effect size based on? A citation or further information is needed. Also, clarify why the effect size differed between the two experiments.

      Please see above

      (2) "Participants were asked to either not drink coffee or smoke cigarettes" - either is implying that one of the two was asked. I suspect it is redundant as both were not permitted.

      The intention was to restrict both behaviors, so we have corrected the sentence to clarify that participants were asked not to drink coffee or smoke cigarettes before the session.

      (3) Normalization of ECG - what exactly was normalized, namely what measure of the ECG?

      The normalized measure was the heart rate, expressed in beats per minute (bpm). We now clarify this in the Data Analysis section of Experiment 1 (Measures of the heart rate recorded with the ECG (beats per minute) in the feedback phase were normalized)

      (4) Line 360: "Mean Δ pain unpleasantness ratings were analysed analogously" - this is unclear, if already described in methods then should be removed here, if not - should be further explained here.

      Thank you for your observation. We are no longer using change scores.

      (5) Lines 418-420: "Consequently, perceptual and cardiac modulations associated with the feedback manipulation should be reduced over the exposure to the faster exteroceptive sound." - why reduced and not unchanged? I didn't follow the logic.

      We chose the term “reduced” rather than “unchanged” to remain cautious in our interpretation. Statistically, the absence of a significant effect in one experiment does not necessarily mean that no effect is present; it simply means we did not detect one. For this reason, we avoided using language that would suggest complete absence of modulation. It also more closely matches the results of the between experiment comparisons that we report in the Result section of Experiment 2, which can in principle only show that the effect in Experiment 2 was smaller than that of Experiment 1, not that it was absent. Even the TOST analysis that we utilize to show the absence of an effect can only show that any effect that is present is smaller than we could reasonably expect to detect with our experimental design, not its complete absence.

      Also, on a theoretical level, pain is a complex, multidimensional experience influenced not only by sensory input but also by cognitive, emotional, social and expectancy factors. For this reason, we considered it important to remain open to the possibility that other mechanisms beyond the misleading cardiac prior induced by the feedback might have contributed to the observed effects. If such other influences had contributed to the induced differences between faster and slower feedback in Experiment 1, some remainder of this difference could have been observed in Experiment 2 as well.

      Thus, for both statistical and theoretical reasons, we were careful to predict a reduction of the crucial difference, not its complete elimination. However, to warrant the possibility that effects could be completely eliminated we now write that “perceptual and cardiac modulations associated with the feedback manipulation should be reduced or eliminated with exteroceptive feedback”

      (6) Study 2 generation of feedback - was this again tailored per participants (25% above and beyond their own HR at baseline + gradually increasing or decreasing), or identical for everyone?

      Yes, in Study 2, the generation of feedback was tailored to each participant, mirroring the procedure or Experiment 1. Specifically, the feedback was set to be 25% above or below their baseline heart rate, with the feedback gradually increasing or decreasing. This individualized approach ensured that each participant experienced feedback relative to their own baseline heart rate. We now clarify this in the Methods section (lines 306-318).

      (7) I did not follow why we need the TOST and how to interpret its results.

      We thank the reviewer for raising this important point. In classical null hypothesis significance testing (NHST), a non-significant p-value (e.g., p > .05) only indicates that we failed to find a statistically significant difference, not that there is no difference. It therefore does not allow us to conclude that two conditions are equivalent – only that we cannot confidently say they are different. In our case, to support the claim that exteroceptive feedback does not induce perceptual or physiological changes (unlike interoceptive feedback), we needed a method to test for the absence of a meaningful effect, not just the absence of a statistically detectable one.

      The TOST (Two One-Sided Tests) procedure reverses the logic of NHST by testing whether the observed effect falls within a predefined equivalence interval, called the smallest effect size of interest (SESOI) that is in principle measurable with our design parameters (e.g., type of test, number of participants). This approach is necessary when the goal is not to detect a difference, but rather to demonstrate that an observed effect is so small that it can be considered negligible – or at the least smaller than we could in principle expect to observe in the given experiment. We used the TOST procedure in Experiment 2 to test for statistical equivalence between the effects of faster and slower exteroceptive feedback on pain ratings and heart rate.

      We hope that the clearer explanation now provided in data analysis of Experiment 2 section (lines 5589-563) fully addresses the reviewer’s concern.

      (8) Lines 492-3: authors say TOST significant, while p value = 0.065

      We thank the reviewer for spotting this inconsistency. The discrepancy was due to a typographical error in the initial manuscript. During the revision of the paper, we rechecked and fully recomputed all TOST analyses, and the results have now been corrected throughout the manuscript to accurately reflect the statistical outcomes. In particular, for the comparison of heart rate between faster and slower exteroceptive feedback in Experiment 2, the corrected TOST analysis now shows a significant equivalence, with the observed effect size being d = -0.19 (90% CI [-0.36, -0.03]) and both one-sided tests yielding p = .025 and p < .001. These updated results are reported in the revised Results section.

      Reviewer #2 (Recommendations For The Authors):

      I would suggest the authors revise their definition of pain in the introduction, since it is not always a protective experience. The new IASP definition specifically takes this into consideration.

      We thank the reviewer for this suggestion. We have updated the definition of pain in the Introduction (lines 2-4) to align with the most recent IASP definition (2020), which characterizes pain as “an unpleasant sensory and emotional experience associated with, or resembling that associated with, actual or potential tissue damage” (lines 51-53).

      The work on exteroceptive cues does not necessarily neglect the role of interoceptive sources of information, although it is true that it has been comparatively less studied. I suggest rephrasing this sentence to reflect this.

      We thank the reviewer for pointing out this important nuance. We agree that studies employing exteroceptive cues to modulate pain perception do not necessarily neglect the role of interoceptive sources, even though these are not always the primary focus of investigation. Our intention was not to imply a strict dichotomy, but rather to highlight that interoceptive mechanisms have been comparatively under-investigated. We have revised the sentence in the Introduction accordingly to better reflect this perspective (Introduction, lines 110-112, “Although interoceptive processes may have contributed to the observed effects, these studies did not specifically target interoceptive sources of information within the inferential process.”).

      The last paragraph of the introduction (lines 158-164) contains generalizations beyond what can be supported by the data and the results, about the generation of predictive processes and the origins of these predictions. The statements regarding the understanding of pain-related pathologies in terms of chronic aberrant predictions in the context of this study are also unwarranted.

      We have deleted this paragraph now.

      I could not find the study registration (at least in clinicaltrials.gov). This is curious considering that the hypothesis and the experimental design seem in principle well thought out, and a study pre-registration improves the credibility of the research (Nosek et al., 2018). I also find the choice for the smallest effect of interest (SESOI) odd. Besides the unnecessary variable transformations (more on that later), there is no justification for why that particular SESOI was chosen, or why it changes between experiments (Dienes, 2021; King, 2011), which makes the choice look arbitrary. The SESOI is a fundamental component of a priori power analysis (Lakens, 2022), and without rationale and preregistration, it is impossible to tell whether this is a case of SPARKing or not (Sasaki & Yamada, 2023).

      We acknowledge that the study was not preregistered. Although our hypotheses and design were developed a priori and informed by established theoretical frameworks, the lack of formal preregistration is a limitation.

      The SESOI values for Experiments 1 and 2 were derived from sensitivity analyses based on the fixed design parameters (type of test, number of participants, alpha level) of our study, not from any post-hoc interpretation based on observed results - they can therefore not be a case of SPARKing. Following current recommendations (Anderson, Kelley & Maxwell, 2017; Albers & Lakens, 2017; Lakens, 2022), we avoided basing power estimates on published effect sizes, as no such values exist for in novel paradigms, and are typically inflated due to publication and other biases. Instead, sensitivity analyses (using G*Power, v 3.1) allows us to calculate, prospectively, the smallest effect each design could detect with 90 % power, given the actual sample size, test type, and α level. Because more participants were excluded in Experiment 2, this design can detect slightly larger effects (d = 0.62) than Experiment 1 (d = 0.57). Please note that both studies therefore remain well-powered to capture effects of the magnitude typically reported in previous research using feedback manipulations to explore interoceptive illusions (e.g., Iodice et al., 2019, d ≈ 0.7).

      We have added this clarification to the Participants section of Experiment 1 (Lines 208-217).

      Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychological Science, 28(11), 1547-1562.

      Lakens, D. (2022). Sample size justification. Collabra: psychology, 8(1), 33267.

      Albers, C., & Lakens, D. (2018). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Journal of experimental social psychology, 74, 187-195.

      In the Apparatus subsection, it is stated that the intensity of the electrical stimuli was fixed at 2 ms. I believe the authors refer to the duration of the stimulus, not its intensity.

      You are right, thank you for pointing that out. The text should refer to the duration of the electrical stimulus, not its intensity. We have corrected this wording in the revised manuscript to avoid confusion.

      It would be interesting to report (in graphical form) the stimulation intensities corresponding to the calibration procedure for the five different pain levels identified for all subjects.

      That's a good suggestion. We have included a supplementary figure showing the stimulation intensities corresponding to the five individually calibrated pain levels across all participants (Supplementary Figure 11.)

      It is questionable that researchers state that "pain and unpleasantness should be rated independently" but then the first level of the Likert scale for unpleasantness is "1=no pain". This is particularly relevant since simulation (and specifically electrical stimulation) can be unpleasant but non-painful at the same time. Since the experiments were already performed, the researchers should at least explain this choice.

      Thank you for raising this point. You are right in that the label of “no pain” in the pain unpleasantness scale was not ideal, and we now acknowledge this in the text (lines 886-890). Please note that this was always the second rating that participants gave (after pain intensity), and the strongest results come from this first rating.

      Discussion.

      I did not find in the manuscript the rationale for varying the frequency of the heart rate by 25% (instead of any other arbitrary quantity).

      We thank the Reviewer for this observation, which prompted us to clarify the rationale behind our choice of a ±25% manipulation of heart rate feedback. False feedback paradigms have historically relied on a variety of approaches to modulate perceived cardiac signals. Some studies have adopted non-individualised values, using fixed frequencies (e.g., 60 or 110 bpm) to evoke states of calm or arousal, independently of participants’ actual physiology (Valins, 1966; Shahidi & Baluch, 1991; Crucian et al., 2000; Tajadura-Jiménez et al., 2008). Others have used the participant’s real-time heart rate as a basis, introducing accelerations or decelerations without applying a specific percentage transformation (e.g., Iodice et al., 2019). More recently, a growing body of work has employed percentage-based alterations of the instantaneous heart rate, offering a controlled and participant-specific manipulation. These include studies using −20% (Azevedo et al., 2017), ±30% (Dey et al., 2018), and even ±50% (Gray et al., 2007).

      These different methodologies - non-individualised, absolute, or proportionally scaled - have all been shown to effectively modulate subjective and physiological responses. They suggest that the impact of false feedback does not depend on a single fixed method, but rather on the plausibility and salience of the manipulation within the context of the task. We chose to apply a ±25% variation because it falls well within the most commonly used range and strikes a balance between producing a detectable effect and maintaining the illusion of physiological realism. The magnitude is conceptually justified as being large enough to shape interoceptive and emotional experience (as shown by Azevedo and Dey), yet small enough to avoid implausible or disruptive alterations, such as those approaching ±50%. We have now clarified this rationale in the revised Procedure paragraph of Experiment 1 (lines 306-318).

      T. Azevedo, R., Bennett, N., Bilicki, A., Hooper, J., Markopoulou, F., & Tsakiris, M. (2017). The calming effect of a new wearable device during the anticipation of public speech. Scientific reports, 7(1), 2285.

      Crucian, G. P., Hughes, J. D., Barrett, A. M., Williamson, D. J. G., Bauer, R. M., Bowers, D., & Heilman, K. M. (2000). Emotional and physiological responses to false feedback. Cortex, 36(5), 623-647.

      Dey, A., Chen, H., Billinghurst, M., & Lindeman, R. W. (2018, October). Effects of manipulating physiological feedback in immersive virtual environments. In Proceedings of the 2018 Annual Symposium on Computer-Human Interaction in Play (pp. 101-111).

      Gray, M. A., Harrison, N. A., Wiens, S., & Critchley, H. D. (2007). Modulation of emotional appraisal by false physiological feedback during fMRI. PLoS one, 2(6), e546.

      Shahidi, S., & Baluch, B. (1991). False heart-rate feedback, social anxiety and self-attribution of embarrassment. Psychological reports, 69(3), 1024-1026.

      Tajadura-Jiménez, A., Väljamäe, A., & Västfjäll, D. (2008). Self-representation in mediated environments: the experience of emotions modulated by auditory-vibrotactile heartbeat. CyberPsychology & Behavior, 11(1), 33-38.

      Valins, S. (1966). Cognitive effects of false heart-rate feedback. Journal of personality and social psychology, 4(4), 400.

      The researchers state that pain ratings collected in the feedback phase were normalized to the no-feedback phase to control for inter-individual variability in pain perception, as established by previous research. They cite three studies involving smell and taste, of which the last two contain the same normalization presented in this study. However, unlike these studies, the outcomes here require no normalization whatsoever, because there should be no (or very little) inter-individual variability in pain intensity ratings. Indeed, pain intensity ratings in this study are anchored to 30, 50, and 70 / 100 as a condition of the experimental design. The researchers go to extreme lengths to ensure this is the case, by adjusting stimulation intensities until at least 75% of stimulation intensities are correctly matched to their pain ratings counterpart in the pre-experiment procedure. In other words, inter-individual variability in this study is in stimulation intensities, and not pain intensity ratings. Even if it could be argued that pain unpleasantness and heart rate still need to account for inter-individual variability, the best way to do this is by using the baseline (no-feedback) measures as covariates in a mixed linear model. Another advantage of this approach is that all the effects can be described in terms of the original scales and are readily understandable, and post hoc tests between levels can be corrected for multiple comparisons. On the contrary, the familywise error rate for the comparisons between conditions in the current analysis is larger than 5% (since there is a "main" paired t-test and additional "simple" tests).

      We disagree that there is little to no variability in the no feedback phase. Participants were tested in their ability to distinguish intensities in an initial pre-experiment calibration phase. In the no feedback phase, participants rated the pain stimuli in the full experimental context.

      In the pre-experiment calibration phase, participants were tested only once in their ability to match five electrical‐stimulation levels to the 0-100 NPS scale, before any feedback manipulation started. During this pre-experiment calibration we required that each level was classified correctly on ≥ 75 % of the four repetitions; “correct” meant falling within ± 5 NPS units of the target anchor (e.g., a response of 25–35 was accepted for the 30/100 anchor). This procedure served one purpose only: to make sure that every participant entered the main experiment with three unambiguously distinguishable stimulation levels (30 / 50 / 70). We integrated this point in the revised manuscript lines 263-270.

      Once the real task began, the context changed: shocks are unpredictable, attention is drawn to the heartbeat, and participants must judge both intensity and unpleasantness. In this full experimental setting the no-feedback block indeed shows considerable variability, even for the pain intensity ratings. Participants mean rating on the NPS scale was 46.4, with a standard deviation of 11.9 - thus participants vary quite strongly in their mean ratings (range 14.5 to 70). Moreover, while all participants show a positive correlation between actual intensities and their ratings (i.e., they rate the higher intensities as more intense than the lower ones), they vary in how much of the scale they use, with differences between reported highest and lowest intensities ranging between 8 and 91, for the participants showing the smallest and largest differences, respectively.

      Thus, while we simplified the analysis to remove the difference scoring relative to the congruent trials and now use these congruent trials as an additional condition in the analysis, we retained the normalisation procedure to account for the in-fact-existing between-participant variability, and ensure consistency with prior research (Bartolo et al., 2013; Cecchini et al., 2020; Riello et al., 2019) and our a priori analysis plan.

      However, to ensure we fully address your point here (and the other reviewers’ points about potential additional factors affecting the effects, like trial number and stimulus intensity), we also report an additional linear mixed-effects model analysis without normalization. It includes every feedback level as condition (No-Feedback, Congruent, Slower, Faster), plus additional predictors for actual stimulus intensity and trial rank within the experiment (as suggested by the other reviewers). This confirms that all relevant results remain intact once baseline and congruent trials are explicitly included in the model.

      In brief, cross‐experiment analyses demonstrated that the Faster vs Slower contrast was markedly larger when the feedback was interoceptive than when it was exteroceptive. This held for heart-rate deceleration (b = 0.94 bpm, p = .005), for increases in unpleasantness (b = -0.16 Likert units, p = .015), and in pain-intensity ratings (b = -3.27 NPS points, p = .037).

      These findings were then further confirmed by within-experiment analyses. Within the interoceptive experiment, the mixed-model on raw scores replicated every original effect: heart rate was lower after Faster than Slower feedback (estimate = –0.69 bpm, p = .005); unpleasantness was higher after Faster than Slower feedback (estimate = 0.19, p < .001); pain-intensity rose after Faster versus Slower (estimate=-4.285, p < .001). In the exteroceptive experiment, however, none of these Faster–Slower contrasts reached significance for heart rate (all ps > .33), unpleasantness (all ps > .43) or intensity (all ps > .10).  Because these effects remain significant even with No-Feedback and Congruent trials explicitly included in the model and vanish under exteroceptive control, the supplementary, non-normalised analyses confirm that the faster vs. slower interoceptive feedback uniquely lowers anticipatory heart rate while amplifying both intensity and unpleasantness of pain, independent of data transformation or reference conditions.  Please see Supplementary analyses for further details.

      Bartolo, M., Serrao, M., Gamgebeli, Z., Alpaidze, M., Perrotta, A., Padua, L., Pierelli, F., Nappi, G., & Sandrini, G. (2013). Modulation of the human nociceptive flexion reflex by pleasant and unpleasant odors. PAIN®, 154(10), 2054-2059.

      Cecchini, M. P., Riello, M., Sandri, A., Zanini, A., Fiorio, M., & Tinazzi, M. (2020). Smell and taste dissociations in the modulation of tonic pain perception induced by a capsaicin cream application. European Journal of Pain, 24(10), 1946-1955.

      Riello, M., Cecchini, M. P., Zanini, A., Di Chiappari, M., Tinazzi, M., & Fiorio, M. (2019). Perception of phasic pain is modulated by smell and taste. European Journal of Pain, 23(10), 1790-1800.

      I could initially not find a rationale for bringing upfront the comparison between faster vs. slower HR acoustic feedback when in principle the intuitive comparisons would be faster vs. congruent and slower vs. congruent feedback. This is even more relevant considering that in the proposed main comparison, the congruent feedback does not play a role: since Δ outcomes are calculated as (faster - congruent) and (slower - congruent), a paired t-test between Δ faster and Δ slower outcomes equals (faster - congruent) - (slower - congruent) = (faster - slower). I later realized that the statistical comparison (paired t-test) of pain intensity ratings of faster vs. slower acoustic feedback is significant in experiment 1 but not in experiment 2, which in principle would support the argument that interoceptive, but not exteroceptive, feedback modulates pain perception. However, the "simple" t-tests show that faster feedback modulates pain perception in both experiments, although the effect is larger in experiment 1 (interoceptive feedback) compared to experiment 2 (exteroceptive feedback).

      The comparison between faster and slower feedback is indeed crucial, and we regret not having made this clearer in the first version of the manuscript. As noted in our response to your point in the public review, this comparison is both statistically most powerful, and theoretically the most appropriate, as it controls for any influence of salience or surprise when heart rates deviate (in either direction) from what is expected. It therefore provides a clean measure of how much accelerated heartrate affects pain perception and physiological response, relative to an equal change in the opposite direction. However, as noted above, in the new version of the manuscript we have now removed the analysis via difference scores, and directly compared all three relevant conditions (faster, congruent, slower), first via an ANOVA and then with follow-up planned t-tests.

      Please refer to our previous response for further details (i.e., Furthermore, the researchers propose the comparison of faster vs. slower delta HR acoustic feedback throughout the manuscript when the natural comparison is the incongruent vs. the congruent feedback [..]).

      The design of experiment two involves the selection of knocking wood sounds to act as exteroceptive acoustic feedback. Since the purpose is to test whether sound affects pain intensity ratings, unpleasantness, and heart rate, it would have made sense to choose sounds that would be more likely to elicit such changes, e.g. Taffou et al. (2021), Chen & Wang (2022), Zhou et al. (2022), Tajadura-Jiménez et al. (2010). Whereas I acknowledge that there is a difference in effect sizes between experiment 1 and experiment 2 for the faster acoustic feedback, I am not fully convinced that this difference is due to the nature of the feedback (interoceptive vs. exteroceptive), since a similar difference could arguably be obtained by exteroceptive sound with looming or rough qualities. Since the experiment was already carried out and this hypothesis cannot be tested, I suggest that the researchers moderate the inferences made in the Discussion regarding these results.

      Please refer to our previous response for a previous detailed answer to this point in the Public Review (i.e., This could be influenced by the fact that the faster HR exteroceptive cue in experiment 2 also shows a significant modulatory effect [..]). As we describe there, we see little grounds to suspect such a non-specific influence of acoustic parameters, as it is specifically the sensitivity to the change in heart rate (faster vs slower) that is affected by our between-experiment manipulation, not the overall response to the different exteroceptive or interoceptive sounds. Moreover, the specific change induced by the faster interoceptive feedback - a heartrate deceleration - is not consistent with a change in arousal or alertness (which would have predicted an increase in heartrate with increasing arousal). See also Discussion-Accounting for general unspecific contributions.

      Additionally, the fact that no significant effects were found for unpleasantness ratings or heart rate (absence of evidence) should not be taken as proof that faster exteroceptive feedback does not induce an effect on these outcomes (evidence of absence). In this case, it could be that there is actually no effect on these variables, or that the experiment was not sufficiently powered to detect those effects. This would depend on the SESOIs for these variables, which as stated before, was not properly justified.

      We very much agree that the absence of significant effects should not be interpreted as definitive evidence of absence. Indeed, we were careful not to overinterpret the null findings for heart rate and unpleasantness ratings, and we conducted additional analyses to clarify their interpretation. First, the TOST analysis shows that any effects in Experiment 2 are (significantly) smaller than the smallest effect size that can possibly be detected in our experiment, given the experimental parameters (number of participants, type of test, alpha level). Second, and more importantly, we run between-experiments comparisons (see Results Experiment 2, and Supplementary materials, Cross-experiment analysis between-subjects model) of the crucial difference in the changes induced by faster and slower feedback. This showed that the differences were larger with interoceptive (Experiment 1) than exteroceptive cues (Experiment 2). Thus, even if a smaller than is in principle detectable effect is induced by the exteroceptive cues in Experiment 2, it is smaller than with interoceptive cues in Experiment 1.

      To ensure we fully address this point, we have now simplified our main analysis (main manuscript), replicated it with a different analysis (Supplementary material), we motivate more clearly (Methods Experiment 1), why the comparison between faster and slower feedback is crucial, and we make clearer that the difference between these conditions is larger in Experiment 1 than Experiment 2 (Results Experiment 2). Moreover, we went through the manuscript and ensured that our wording does not over-interpret the absence of effects in Experiment 2, as an absence of a difference.

      The section "Additional comparison analysis between experiments" encompasses in a way all possible comparisons between levels of the different factors in both experiments. My original suggestion regarding the use of a mixed linear model with covariates is still valid for this case. This analysis also brings into question another aspect of the experimental design: what is the rationale for dividing the study into two experiments, considering that variability and confounding factors would have been much better controlled in a single experimental session that includes all conditions?

      We thank the reviewer for their comment. We would like to note, first, that the between-experiment analyses did not encompass all possible comparisons between levels, as it just included faster and slower feedback for the within-experiment comparison Instead, they focus on the specific interaction between faster and slower feedback on the one hand, and interoceptive vs exteroceptive cues on the other. This interaction essentially compares, for each dependent measure (HR, pain unpleasantness, pain intensity), the difference between faster and slower feedback in Experiment 1 with that the same difference in Experiment 2 (and would produce identical p values to a between-experiment t-test). The significant interactions therefore indicate larger effects of interoceptive cues than exteroceptive ones for each of the measures. To make this clearer, we have now exchanged the analysis with between-experiment t-tests of the difference between faster and slower feedback for each measure (Results Experiment 2), producing identical results. Moreover, as suggested, we also now report linear mixed model analyses (see Supplementary Materials), which provide a comprehensive comparison across experiments.

      Regarding the experimental design, we appreciate the reviewer’s suggestion regarding a within-subject crossover design. While such an approach indeed offers greater statistical power by reducing interindividual variability (Charness, Gneezy, & Kuhn, 2012), we intentionally chose a between-subjects design due to theoretical and methodological considerations specific to deceptive feedback paradigms. First, carryover effects are a major concern in deception studies. Participants exposed to one type of feedback could develop suspicion or adaptive strategies that would alter their responses in subsequent conditions (Martin & Sayette, 1993). Expectancy effects could thus contaminate results in a crossover design, particularly when feedback manipulation becomes apparent. In line with this idea, past studies on false cardiac feedback (e.g., Valins, 1966; Pennebaker & Lightner, 1980) often employed between-subjects or blocked designs to maintain the ecological validity of the illusion.

      Charness, G., Gneezy, U., & Kuhn, M. A. (2012). Experimental methods: Between-subject and within-subject design. Journal of economic behavior & organization, 81(1), 1-8.

      Martin, C. S., & Sayette, M. A. (1993). Experimental design in alcohol administration research: limitations and alternatives in the manipulation of dosage-set. Journal of studies on alcohol, 54(6), 750-761.

      Pennebaker, J. W., & Lightner, J. M. (1980). Competition of internal and external information in an exercise setting. Journal of personality and social psychology, 39(1), 165.

      Valins, S. (1966). Cognitive effects of false heart-rate feedback. Journal of personality and social psychology, 4(4), 400.

      References

      Chen ZS, Wang J. Pain, from perception to action: A computational perspective. iScience. 2022 Dec 1;26(1):105707. doi: 10.1016/j.isci.2022.105707.

      Dienes Z. Obtaining Evidence for No Effect. Collabra: Psychology 2021 Jan 4; 7 (1): 28202. doi: 10.1525/collabra.28202

      King MT. A point of minimal important difference (MID): a critique of terminology and methods. Expert Rev Pharmacoecon Outcomes Res. 2011 Apr;11(2):171-84. doi: 10.1586/erp.11.9.

      Lakens D. Sample Size Justification. Collabra: Psychology 2022 Jan 5; 8 (1): 33267. doi: 10.1525/collabra.33267

      Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci U S A. 2018 Mar 13;115(11):2600-2606. doi: 10.1073/pnas.1708274114.

      Sasaki K, Yamada Y. SPARKing: Sample-size planning after the results are known. Front Hum Neurosci. 2023 Feb 22;17:912338. doi: 10.3389/fnhum.2023.912338.

      Taffou M, Suied C, Viaud-Delmon I. Auditory roughness elicits defense reactions. Sci Rep. 2021 Jan 13;11(1):956. doi: 10.1038/s41598-020-79767-0.

      Tajadura-Jiménez A, Väljamäe A, Asutay E, Västfjäll D. Embodied auditory perception: The emotional impact of approaching and receding sound sources. Emotion. 2010, 10(2), 216-229.https://doi.org/10.1037/a0018422

      Zhou W, Ye C, Wang H, Mao Y, Zhang W, Liu A, Yang CL, Li T, Hayashi L, Zhao W, Chen L, Liu Y, Tao W, Zhang Z. Sound induces analgesia through corticothalamic circuits. Science. 2022 Jul 8;377(6602):198-204. doi: 10.1126/science.abn4663.

      Reviewer #3 (Recommendations For The Authors):

      The manuscript would benefit from some spelling- and grammar checking.

      Done

      Discussion:

      The discussion section is rather lengthy and would benefit from some re-structuring, editing, and sub-section headers.

      In response, we have restructured and edited the Discussion section to improve clarity and flow.

      I personally had a difficult time understanding how the data relates to the rubber hand illusion (l.623-630). I would recommend revising or deleting this section.

      We thank the reviewer for this valuable feedback. We have revised the paragraph and made the parallel clearer (lines 731-739).

      Other areas are a bit short and might benefit from some elaboration, such as clinical implications. Since they were mentioned in the abstract, I had expected a bit more thorough discussion here (l. 718).

      Thank you for this suggestion. We have expanded the discussion to more thoroughly address the clinical implications of our interoceptive pain illusion (See Limitations and Future Directions paragraph).

      Further, clarification is needed for the following:

      I would like some more details on participant instructions; in particular, the potential difference in instruction between Exp. 1 and 2, if any. In Exp. 1, it says: (l. 280) "Crucially, they were also informed that over the 60 seconds preceding the administration of the shock, they were exposed to acoustic feedback, which was equivalent to their ongoing heart rate". Was there a similar instruction for Exp. 2? If yes, it would suggest a more specific effect of cardiac auditory feedback; if no, the ramifications of this difference in instructions should be more thoroughly discussed.

      Thank you for this suggestion. We have clarified this point in the Procedure of Experiment 2 (548-550).

    1. Reviewer #3 (Public review):

      Wang et al. report multiple experiments using functional magnetic resonance spectroscopy (fMRS) in a multiple object tracking (MOT) task to investigate the effect of experimentally manipulating a) the number of targets, b) object size, and c) total number of objects in the display on GABA and glutamate (Glx) concentrations in parietal and visual cortex. Data is analyzed in two orthogonal ways throughout: via condition differences in behavorial performance (inverse efficiency), GABA, and Glx concentrations and through correlations between changes in inverse efficiency and GABA or Glx. All three experimental manipulations affected inverse efficiency, with worse performance with more targets, smaller objects, and a larger total number of objects. However, only the manipulation of the target number produced a condition difference in GABA and Glx, with higher concentrations of both in the parietal VOI and only of Glx in the visual VOI with more targets ('high load'). Correlational analyses revealed that participants with a larger change in GABA in the parietal VOI with a higher number of targets showed a smaller drop in behavioral performance with more targets. The opposite direction of correlation was observed for Glx in both the visual and parietal VOI.

      In the two control experiments, correlations were only investigated in the parietal VOI. There was a negative correlation between change in Glx and change in inverse efficiency with manipulation of object size, i.e. participants exhibiting a positive change in Glx showed no or little difference in performance, but those with an increase in Glx with smaller targets showed a more pronounced drop in performance. There was no correlation with GABA for the manipulation of object size. For the manipulation of total object number, participants exhibiting an increasing GABA concentration with more objects showed a smaller drop in performance.

      The authors' main claim is that GABAergic suppression of goal-irrelevant distractors in parietal cortex is key to goal-directed visual information processing.

      The study is, to my knowledge, the first to employ fMRS in an MOT paradigm, and I read it with great interest. I am admittedly not an expert on the fMRS technique and have therefore refrained from commenting on the technical aspects of its use. Although the application of fMRS to MOT is novel and adds new knowledge to the field, I have some critiques and believe that a much more nuanced interpretation of the findings is warranted.

      Major

      (1) Especially the control experiments lean heavily on Bettencourt and Somers (2009) and adopt and to some extent exaggerate claims from that paper uncritically. This is obvious in referring to the manipulations of object size and object number as high/low enhancement and high/low suppression, as if the association of these physical manipulations of the stimulus display with attentional mechanisms were so obvious and beyond doubt that drawing any distinction between these manipulations and their supposed effects is entirely superfluous. This seems far beyond what is warranted to me. It may seem plausible that adding distractors engages distractor suppression more, but whether this is truly the case is an empirical question, and Bettencourt and Somers (2009) have no direct measure of distractor suppression to substantiate this claim. Their study is purely behavioral, and there is no attempt to assess distractor processing separately. The case for the 'target enhancement' manipulation is even weaker: objects are of a sufficient size and at maximum contrast (white on black screen, but exact details are omitted) to be clearly visible in either condition, so why would smaller objects require more enhancement? Although the present data shows a clear effect of manipulating object size, the corresponding size of the effect in Bettencourt and Somers (2009) is rather underwhelming and does not warrant such a strong conclusion. In summary, the link between the object number and object size manipulations with suppression and enhancement is very far from the 1:1 that the authors seem to assume. Accordingly, I believe that the manipulations should be labelled as object number and object size rather than their hypothesized effects, throughout and that there should be a much more critical discussion as to whether these manipulations are indeed related to these effects as expected.

      (2) The author's interpretation of the results seems rather uncritical. What is observed (at least in the first experiment) is a change in GABA and Glx concentrations with changes in the number of tracked targets. Is the only conceivable way in which this could happen through target enhancement and distractor suppression? The processing of targets and distractors is not measured directly, so any claims are indirect, at best. The authors cite the recent 'Ten simple rules to study distractor suppression' paper (Wöstmann et al., 2022), which presents a consensus between leading researchers in the field. Neither Bettencourt & Somers (2009) nor the design of the current study live up to the rules established in that paper, so a much more nuanced interpretation and discussion of the current findings seems warranted. It is anything but obvious to me that the only activity in the parietal cortex that could possibly be suppressed by GABA is the representation of distractors. Indeed, cueing more targets (high load) decreases the number of distractors in the first experiment, so the need for distractor suppression in the high load condition is less than in the low load condition. So, shouldn't we observe lower GABA concentrations in the 'high load' condition?

      (3) It seems that the authors included data from both correctly tracked and incorrectly tracked trials in their fMRS analysis. In MOT, attending target objects is the task per se, so task errors indicate that participants did not actually track the targets. So when comparing conditions with different error levels, it is ambiguous whether changes in brain activity reflect the experimental manipulation as such, or rather the different mix of correctly tracked and incorrectly tracked trials that result from this physical manipulation. Are the correlations perhaps driven by the inclusion of different proportions of correctly tracked trials across participants? It seems that the authors may have to separate correct and error trials in the analysis to check for the possibility that effects are due to the inclusion of data from trials in which participants may have stopped tracking at least some of the target objects. Of course, such an analysis is somewhat limited by the fact that only one target was probed, yielding a 50% guessing chance (i.e. even if the response is correct, we do not know whether the other, unprobed, objects were tracked correctly on that trial).

      (4) The key findings from the control experiments are purely correlational. The supposed cause may be what the authors claim, but there is an infinity of alternative explanations. Correlational findings cannot simply be interpreted as if they resulted from an experimental manipulation (...although this is, unfortunately, by no means rare in the cognitive neuroscience literature). The authors should make a rigorous effort to consider the most plausible alternative explanations for these correlations and argue why or why not they believe that they can be discounted.

      (5) Related to the previous point: the experimental manipulations did not produce mean differences in GABA/Glx in the control experiments. Doesn't this speak against the authors' interpretation? They briefly acknowledge this in the discussion, but I think there is a deeper problem. The absence of these effects casts doubt on what these manipulations actually do, and therefore also on the interpretation of the correlations in these experiments. For example, the authors might also have concluded from the same data that the absence of increased GABA in the 'high suppression' condition refutes the very idea that GABA concentrations are related to distractor suppression.

      (6) 'Inverse Efficiency' is a highly unusual measure of MOT performance in the literature, and its use reduces the comparability of the findings with previous work. The standard is to assess the correctness ('accuracy') of responses with no focus on speed. This makes sense as responses are given after the object motion has stopped. At the same time, reaction time can be informative too (e.g., Störmer et al., 2013). I think the authors should justify their use of inverse efficiency as the dependent variable.

      (7) The choice of variable names is problematic: it is sometimes misleading and makes understanding the findings harder (see also points 1 and 6): obvious, unambiguous, and importantly, interpretation free names for conditions such as target number (2/4), object size (small/large), and total object number (8/12) become load (high/low), target enhancement (high/low) and distractor suppression (low/high). This reduces clarity and, especially in the case of enhancement and suppression, conflates the actual manipulation with its interpretation.

    1. We also see this phrase used to say that things seen on social media are not authentic, but are manipulated, such as people only posting their good news and not bad news, or people using photo manipulation software to change how they look

      I think this is an interesting concept to think about, as we are usually conditioned to think that the internet "isn't real", that most things online are fabricated, exaggerated, etc. However, I do think that just because this is common online, it's not to say that "real life" is a place where everyone is completely authentic and themselves, as some people may feel that they only want to share the good parts of their lives with their friends or family, while keeping anything that wouldn't be considered "good" to themselves, and vice versa. I do think it's hasty to say that all that we see on social media "is not real", as there are plenty of real people behind each account, but we must consider that because people are able to be behind potentially anonymous accounts, it is much easier to fabricate stories or life experiences, or to center one's entire online presence around a portion of their life they want the internet to see, essentially artificially creating an online persona that is not reflective of who they are in real life.

    1. What do you think is the responsibility of tech workers to think through the ethical implications of what they are making?

      As an engineer, I understand why tech workers may not think through ethical implications as we are really passionate about creating things, and investors may be pressuring engineers to push out products fast. However, technology should always be made with the goal of helping humanity and safeguards should be created to protect all of us.

      I think it is very interesting that the people Kumail talked to did not have answers as it suggests that the people creating technology may not be prioritizing our well being.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife assessment 

      This useful study reports that the exogenous expression of the microRNA miR-195 can partially compensate in early B cell development for the loss of EBF1, one of the key transcription factors in B cells. While this finding will be of interest to those studying lymphocyte development, the evidence, particularly with regard to the molecular mechanisms that underpin the effect of miR-195, is currently incomplete. 

      Public Reviews: 

      Reviewer #1 (Public review):

      Summary: 

      Here, the authors are proposing a role for miR-196, a microRNA that has been shown to bind and enhance the degradation of mRNA targets in the regulation of cell processes, and has a novel role in allowing the emergence of CD19+ cells in cells in which Ebf1, a critical B-cell transcription factor, has been genetically removed. 

      Strengths: 

      That over-expression of mR-195 can allow the emergence of CD19+ cells missing Ebf1 is somewhat novel. 

      Their data does perhaps support to a degree the emergence of a transcriptional network that may bypass the absence of Ebf1, including the FOXO1 transcription factor, but this data is not strong or definitive. 

      Weaknesses: 

      It is unclear whether this observation is in fact physiological. When the authors analyse a knockout model of miR-195, there is not much of a change in the B-cell phenotype. Their findings may therefore be an artefact of an overexpression system. 

      The authors have provided insufficient data to allow a thorough appraisal of the stepwise molecular changes that could account for their observed phenotype. 

      Reviewer #2 (Public review): 

      Summary: 

      The authors investigate miRNA miR-195 in the context of B-cell development. They demonstrate that ectopic expression of miR-195 in hematopoietic progenitor cells can, to a considerable extent, override the consequences of deletion of Ebf1, a central Blineage defining transcription factor, in vitro and upon short-term transplantation into immunodeficient mice in vivo. In addition, the authors demonstrate that the reverse experiment, genetic deletion of miR-195, has virtually no effect on B-cell development. Mechanistically, the authors identify Foxo1 phosphorylation as one pathway partially contributing to the rescue effect of miR-195. An additional analysis of epigenetics by ATACseq adds potential additional factors that might also contribute to the effect of ectopic expression of miR-195. 

      Strengths: 

      The authors employ a robust assay system, Ebf1-KO HPC, to test for B-lineage promoting factors. The manuscript overall takes on an interesting perspective rarely employed for the analysis of miRNA by overexpressing the miRNA of interest. Ideally, this approach may reveal, if not the physiological function of this miRNA, the role of distinct pathways in developmental processes. 

      Weaknesses: 

      At the same time, this approach constitutes a major weakness: It does not reveal information on the physiological role of miR-195. In fact, the authors themselves demonstrate in their KO approach, that miR-195 has virtually no role in B-cell development, as has been demonstrated already in 2020 by Hutter and colleagues. While the authors cite this paper, unfortunately, they do so in a different context, hence omitting that their findings are not original. 

      Conceptually, the authors stress that a predominant function of miRNA (in contrast to transcription factors, as the authors suggest) lies in fine-tuning. However, there appears to be a misconception. Misregulation of fine-tuning of gene expression may result in substantial biological effects, especially in developmental processes. The authors want to highlight that miR-195 is somewhat of an exception in that regard, but this is clearly not the case. In addition to miR-150, as referenced by the authors, also the miR-17-92 or miR-221/222 families play a significant role in B-cell development, their absence resulting in stage-specific developmental blocks, and other miRNAs, such as miR-155, miR-142, miR-181, and miR-223 are critical regulators of leukocyte development and function. Thus, while in many instances a single miRNA moderately affects gene expression at the level of an individual target, quite frequently targets converge in common pathways, hence controlling critical biological processes. 

      The paper has some methodological weaknesses as well: For the most part, it lacks thorough statistical analysis, and only representative FACS plots are provided. Many bar graphs are based on heavy normalization making the T-tests employed inapplicable. No details are provided regarding the statistical analysis of microarrays. Generation of the miR-195-KO mice is insufficiently described and no validation of deletion is provided. Important controls are missing as well, the most important one being a direct rescue of Ebf1-KO cells by re-expression of Ebf1. This control is critical to quantify the extent of override of Ebf1-deficiency elicited by miR-195 and should essentially be included in all experiments. A quantitative comparison is essential to support the authors' main conclusion highlighted in the title of the manuscript. As the manuscript currently stands, only negative controls are provided, which, given the profound role of Ebf1, are insufficient, because many experiments, such as assessment of V(D)J recombination, IgM surface expression, or class-switch recombination, are completely negative in controls. In addition, the authors should also perform long-term reconstitution experiments. While it is somewhat surprising that the authors obtained splenic IgM+ B cells after just 10 days, these experiments would be certainly much more informative after longer periods of time. Using "classical" mixed bone marrow chimeras using a combination of B-cell defective (such as mb1/mb1) bone marrow and reconstituted Ebf1-KO progenitors would permit much more refined analyses. 

      With regard to mechanism, the authors show that the Foxo1 phosphorylation pathway accounts for the rescue of CD19 expression, but not for other factors, as mentioned in the discussion. The authors then resort to epigenetics analysis, but their rationale remains somewhat vague. It remains unclear how miR-195 is linked to epigenetic changes. 

      Reviewer #3 (Public review): 

      Summary: 

      In this study, Miyatake et al. present the interesting finding that ectopic expression of miR-195 in EBF1-deficient hematopoietic progenitor cells can partially rescue their developmental block and allow B cells to progress to a B220+ CD19+ cells stage. Notably, this is accompanied by an upregulation of B-cell-specific genes and, correspondingly, a downregulation of T, myeloid, and NK lineage-related genes, suggesting that miR-195 expression is at least in part equivalent to EBF1 activity in orchestrating the complex gene regulatory network underlying B cell development. Strengthening this point, ATAC sequencing of miR-195-expressing EBF1-deficient B220+CD19+ cells and a comparison of these data to public datasets of EBF1-deficient and -proficient cells suggest that miR-195 indirectly regulates gene expression and chromatin accessibility of some, but not all regions regulated by EBF1. 

      Mechanistically, the authors identify a subset of potential target genes of miR-195 involved in MAPK and PI3K signaling. Dampening of these pathways has previously been demonstrated to activate FOXO1, a key transcription factor for early B cells downstream of EBF1. Accordingly, the authors hypothesize that miR-195 exerts its function through FOXO1. Supporting this claim, also exogenous FOXO1 expression is able to promote the development of EBF1-deficient cells to the B220+CD19+ stage and thus recapitulates the miR-195 phenotype. 

      Strengths: 

      The strength of the presented study is the detailed assessment of the altered chromatin accessibility in response to ectopic miR-195 expression. This provides insight into how miR-195 impacts the gene regulatory network that governs B-cell development and allows the formation of mechanistic hypotheses. 

      Weaknesses: 

      The key weakness of this study is that its findings are based on the artificial and ectopic expression of a miRNA out of its normal context, which in my opinion strongly limits the biological relevance of the presented work. 

      While the authors performed qPCRs for miR-195 on different B cell populations and show that its relative expression peaks in early B cells, it remains unclear whether the absolute miR-195 expression is sufficiently high to have any meaningful biological activity. In fact, other miRNA expression data from immune cells (e.g. DOI

      10.1182/blood-2010-10-316034 and DOI 10.1016/j.immuni.2010.05.009) suggest that miR-195 is only weakly, if at all, expressed in the hematopoietic system. 

      The authors support their finding by a CRISPR-derived miR-195 knockout mouse model which displays mild, but significant differences in the hematopoietic stem cell compartment and in B cell development. However, they fail to acknowledge and discuss a lymphocyte-specific miR-195 knockout mouse that does not show any B cell defects in the bone marrow or spleen and thus contradicts the authors' findings (DOI

      10.1111/febs.15493). Of note, B-1 B cells in particular have been shown to be elevated upon loss of miR-15-16-1 and/or miR-15b-16-2, which contradicts the data presented here for loss of the family member miR-195. 

      A second weakness is that some claims by the authors appear overstated or at least not fully backed up by the presented data. In particular, the findings that miR-195expressing cells can undergo VDJ recombination, express the pre-BCR/BCR and class switch needs to be strengthened. It would be beneficial to include additional controls to these experiments, e.g. a RAG-deficient mouse as a reference/negative control for the ddPCR and the surface IgM staining, and cells deficient in class switching for the IgG1 flow cytometric staining. 

      Moreover, the manuscript would be strengthened by a more thorough investigation of the hypothesis that miR-195 promotes the stabilization and activity of FOXO1, e.g. by comparing the authors' ATACseq data to the FOXO1 signature. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      Miyatake et al., present a manuscript that explores the role of miR-195 in B cell development. 

      Their data suggests a role for this microRNA: 

      Using an Ebf1 fetal liver knockout of B-cell differentiation that a small population of CD19 expressing with some evidence of V(D)J recombination capable of class switch can be derived by transduction of miR-195. 

      In the emergent CD19+ Ebf1-/- cells, the authors provide some evidence that Mapk and Akt3 may be miR-195 targets that are downregulated allowing FOXO1 transcription factor pathway may be involved in the emergent CD19+ cells arising from miR-195 transduction. 

      Perhaps less compelling data is provided with regards to a role for miR-195 in normal Bcell development through analysis of a miR-195 knockout model. 

      While there are some interesting preliminary data presented for a role for miR-195 in the context of Ebf1-/- cells, there are some questions I think the authors could consider. 

      Comments: 

      (1-1) It is difficult to ascertain the potential role of miR-195 transduction in allowing the emergence of CD19+ cells from the data provided. miR-195 has been generally shown to destabilize mRNA transcripts by 3' UTR binding that targets mRNA transcripts for degradation. The effect of transduction of miR-195 would therefore be expected to be related to the degradation of factors opposing aspects of B-lineage specification or maintenance. I would be particularly interested in transcriptional or epigenetic regulators that may be modified in this way, at an mRNA as well as protein level.

      We appreciate the reviewerʼs thoughtful comments and agree that miRNAs often exert their effects through the degradation or translational repression of mRNAs encoding regulatory factors. In our study, we attempted to address this point by combining predictive analysis (using TargetScan and starBase) with luciferase reporter assays and qPCR to validate several potential targets of miR-195, including Mapk3 and Akt3. We acknowledge that this is not a comprehensive mechanistic analysis. We agree that a broader and systematic identification of direct targets of miR-195, particularly those involved in transcriptional and epigenetic regulation, would further clarify the mechanisms involved. However, due to limitations in resources and time, we are currently unable to perform global proteomic or ChIP-based validations. Nevertheless, our ATAC-seq and microarray data indicate that miR-195 overexpression leads to increased accessibility and expression of several key B-lineage transcription factors (Pax5, Runx1, Irf8), suggesting that miR-195 indirectly activates transcriptional programs relevant to B cell commitment. We have now clarified this limitation in the revised Discussion section (lines 505‒524), and we emphasize that our current findings represent the potential of miR-195 rather than its physiological role. We hope that this clarification addresses the concern.

      (1-2) While I acknowledge the authors have undertaken TargetScan and starBase analysis to try and predict miR-195 interactions, they do not provide a comprehensive list of putative targets that can be referenced against their cDNA data. Though they postulate Mapk3 and Akt3 as putative miR-195 targets and assay these in luciferase reporter systems (Figure 4), these were not clearly differentially regulated in the microarray data they provided (Figure 1E) as being downregulated on miR-195 transduction in Ebf1-/- cells.

      We thank the reviewer for pointing out the need for a more comprehensive list of predicted miR-195 targets. In response, we have now included a supplementary table 4 (human) and 5 (mouse) listing all putative miR-195 targets predicted by TargetScan and starBase. As noted, Mapk3 expression was indeed downregulated upon miR-195 transduction, consistent with our luciferase reporter and qPCR results. For Akt3, we observed variability in the microarray data depending on the probe used, resulting in inconsistent expression levels. We acknowledge this and have added a clarification in the revised manuscript (lines 335‒339), noting that the regulation of Akt3 by miR-195 is potentially probe-dependent and may require further validation. We hope this clarification resolves the concern.

      (1-3) The authors should provide a more comprehensive analysis of transcriptional changes induced by miR-195 Ebf1-/- specifically in the preproB cell stage of development in Ebf1-/- and miR-195 Ebf1-/- cells. The differentially expressed gene list should be provided as a supplemental file. The gene expression data should be provided for the different B-cell differentiation stages, eg. Ebf1-/- preproB cells, and Ebf1-/- miR-195 preproB cells, CD19+ cells and more differentiated subsets induced by miR-195 transduction.

      We appreciate the reviewerʼs suggestion to provide a more comprehensive transcriptomic analysis at different B-cell differentiation stages. Unfortunately, due to the limited availability of cells and technical constraints, we were unable to perform RNA-seq on miR-195 transduced Ebf1<sup>−/−</sup> pre-pro-B or CD19+ cells. However, to address this point, we referenced publicly available RNA-seq data (GEO accession: GSE92434), which includes transcriptomic profiles of Ebf1<sup>−/−</sup> pro-B cells and wild-type controls. By comparing our microarray data from miR-195 transduced Ebf1<sup>−/−</sup> cells with this dataset, we found partial restoration of expression for several key B-lineage genes, such as Pax5, Runx1, and Irf8, which are normally downregulated in the absence of EBF1. This comparison supports the notion that miR-195 partially reactivates the transcriptional network essential for B cell development. We have added this interpretation to the Discussion section (lines 528‒533).

      (1-4) More replicates (at least 3 of each genotype) are required for their Western Blots for FOXO1 and pFOXO1 (Fig 4C, D). Western blots should also be provided for other known B-lineage transcriptional regulators such as PAX5 and ERG.

      We thank the reviewer for these valuable suggestions. In response, we have now quantified and added the relative band intensities of FOXO1 and pFOXO1 from three independent experiments in the revised Figure 4C, and we include statistical analysis to support the reproducibility of these results. Additionally, as requested, we performed western blotting for PAX5 and ERG using the same samples. The results showed no significant change in these protein levels between miR-195-transduced and control Ebf1<sup>−/−</sup> cells, consistent with the modest upregulation observed in our microarray data. We have included the PAX5 and ERG western blot images in Supplementary Figure S3 and have revised the text in the Results section (lines 351‒35)

      (1-5) The authors have not shown a transcriptional binding by ChIPseq or other methods such as cut and tag/ cut and run for FOXO1 binding to B-lineage genes in their Ebf1-/- miR-195 CD19+ cells to be able to definitively show this TF is critical for the emergence of the C19+ cell phenotype by demonstrating direct binding to "upregulated" genes cis-regulatory regions in the Ebf1-/- miR-195 CD19+ cells

      We appreciate the reviewerʼs suggestion regarding the use of ChIP-seq or related methods to demonstrate direct FOXO1 binding to cis-regulatory regions of B-lineage genes in Ebf1<sup>−/−</sup> miR-195 CD19⁺ cells. We agree that such data would provide definitive evidence of FOXO1's direct involvement in promoting the B cell-like transcriptional program. However, due to current technical limitations, including the scarcity of CD19⁺ cells derived from Ebf1<sup>−/−</sup> miR-195 transduction and the requirement for large cell numbers in ChIP-seq or CUT&RUN protocols, we were unable to perform these assays in this study. Nevertheless, our current data provide multiple lines of indirect evidence supporting the involvement of FOXO1:

      miR-195 transduction leads to reduced phosphorylation and increased accumulation of FOXO1 protein (Fig. 4C).

      Overexpression of FOXO1 in Ebf1<sup>−/−</sup> HPCs partially recapitulates the miR-195 phenotype (Fig. 4D).

      ATAC-seq data show increased chromatin accessibility at known FOXO1 target gene loci (e.g., Pax5, Runx1, Irf8) in miR-195-induced CD19⁺ cells, many of which overlap with FOXO1 motifs(Fig.5)

      These observations collectively suggest that FOXO1 activity is functionally important for the emergence of CD19⁺ cells, even though direct binding has not been confirmed. We have added this limitation to the Discussion (lines 531‒537), and we note that future studies using FOXO1 CUT&RUN in this system would be valuable to further define the underlying mechanism.

      (1-6) The authors have not shown significant upregulation of expression of other critical B-cell regulatory transcription factors in their Ebf1-/- miR-195 CD19+ cells that could account for the emergence of these cells such as Pax5 or Erg. The legend in Figure 1E suggests for example the change in expression of Pax5 is modest if anything at best as no LogFC or western blot data is presented. 

      We thank the reviewer for raising this point. In our microarray analysis (Figure 1D, original Figure 1E), we observed that both Pax5 and Erg mRNA levels were upregulated in Ebf1<sup>−/−</sup> cells upon miR-195 transduction. Specifically, Pax5 showed an increase of approximately log₂FC 1.2, and Erg was also consistently elevated across biological replicates. These changes, although modest, were statistically significant and consistent with the upregulation of other B-lineage-associated transcription factors, such as Runx1 and Irf8. We agree that the magnitude of Pax5 upregulation is not as high as typically seen during full B cell commitment, and therefore may not have been immediately apparent in Figure 1D (original Figure 1E). To clarify this point, we have now revised the text in the Results section (lines 170‒174) to highlight the observed changes in Pax5 and Erg expression. We believe that the upregulation of these transcription factors, together with increased FOXO1 activity and changes in chromatin accessibility (Figure 5), contributes to the partial reactivation of the B cell gene regulatory network in the absence of EBF1.

      (1-7) Which V(D)J transcripts have been produced? A more detailed analysis other than ddPCR is required to help understand the emergence of this population that can presumably proceed through the preBCR and BCR checkpoints.

      We appreciate the reviewerʼs interest in understanding the nature of the V(D)J rearrangements in Ebf1<sup>−/−</sup> miR-195 CD19⁺ cells. As noted, our current data rely on droplet digital PCR (ddPCR), which was used to detect rearranged VH-JH segments in the bone marrow of engrafted mice. While this approach does not allow for detailed mapping of specific V, D, or J gene usage, it provides a sensitive and quantitative measure of V(D)J recombination activity. The detection of rearranged VH-JH fragments in miR-195-transduced Ebf1<sup>−/−</sup> cells suggests that at least partial recombination of the immunoglobulin heavy chain locus is occurring̶an essential checkpoint for progression past the pro-B cell stage. Given the lack of such rearrangements in control-transduced Ebf1<sup>−/−</sup> cells, we interpret this as evidence that miR-195 enables cells to initiate the recombination process. We acknowledge the limitations of ddPCR and agree that a more detailed analysis using VDJ-seq or singlecell RNA-seq would be valuable in determining the diversity and completeness of the V(D)J transcripts produced. This is a direction we intend to pursue in future work. We have added this limitation to the Discussion section (lines 538‒543).

      (1-8) The authors reveal that the Foxo1 transduced Ebf1-/- cells (Fig. 4D) do not persist in vitro or be detected via transplant assay (line 256) and therefore does not represent a truly "rescued" B cell, suggesting that CD19+ cells Ebf1-/- miR-195 transduced cells have more B-cell potential. Further characterisation is therefore warranted of this cell population. For instance, can these cells be induced to undergo myeloid differentiation in myeloid cytokine conditions? What other B-lineage transcriptional regulators are expressed in this cell population that could account for VDJ recombination and expression of a B-lineage transcriptional program (see comments 1, 3, and 5) that allow transition through preBCR and BCR checkpoints as well as undergo class switching?

      We thank the reviewer for this insightful comment. We agree that the persistence and lineage potential of the CD19⁺ cells emerging from Ebf1<sup>−/−</sup> miR-195-transduced progenitors deserve further characterization. Although we were unable to perform additional lineage re-direction assays, our current data provide several lines of evidence suggesting that these cells are stably committed toward the B-lineage:

      Gene expression profiling revealed upregulation of multiple B cell transcriptional regulators, including Pax5, Runx1, and Irf8.

      ATAC-seq analysis showed increased chromatin accessibility at B cell‒specific loci and enrichment of motifs bound by key B-lineage factors such as FOXO1 and E2A.

      The cells express surface IgM and undergo class switch recombination to IgG1 upon stimulation, indicating successful transition through the pre-BCR and BCR checkpoints and acquisition of mature B cell functions.

      Importantly, no upregulation of myeloid- or T-lineage genes was detected in the microarray analysis, arguing against multipotency at this stage.We acknowledge that functional tests for lineage plasticity under altered cytokine conditions would provide important insights and plan to address this question in future studies. This limitation has now been noted in the revised Discussion (lines 544‒550).

      (1-9) In the original Ebf1-/- miR-195 CD19+ experiments, a wild-type control should be provided for each experiment. 

      We appreciate the reviewerʼs suggestion to include wild-type controls in all experiments. While we did not include wild-type samples side-by-side in every assay, we carefully designed our experiments to include biologically appropriate and informative comparisons. For example, in the bone marrow transplantation experiments (Figure 2), Ebf1<sup>−/−</sup> cells transduced with empty vector served as negative controls, clearly lacking CD19 expression, V(D)J recombination, IgM surface expression, and class switch capability. This allowed us to specifically assess the gain-of-function effects of miR-195 in the EBF1-deficient background. In several analyses̶such as the ATAC-seq and microarray comparisons̶we did incorporate or refer to existing wild-type datasets (e.g., GSE92434), providing context for the extent of recovery toward a WT-like profile. We agree, however, that including parallel WT controls across all experimental platforms would enhance interpretability.

      (1-10) For ATACseq data, a comparison between Ebf1-/- preproB cells and Ebf1-/- miR-195 CD19+ cells should be undertaken.

      We thank the reviewer for this important point. As suggested, we have performed a direct comparison of chromatin accessibility between Ebf1<sub>−/−</sub> pre-pro-B‒like cells (CD19<sub>-</sub>, control transduction) and Ebf1<sub>−/−</sub> miR-195‒transduced CD19⁺ cells. This comparison is shown in green in Figure 5B and represents the ATAC-seq peaks differentially accessible between these two populations.  

      (1-11) I cannot agree with the authors with some of their statements such as Line 242 - "therefore miR-195 considered to have similar function with EBF1 to some extent" - how can this be the case when miR-195 is a miRNA and EBF1 is a transcription factor with pioneering transcriptional activity? Surely the effects of miR-195 must be secondary.

      We thank the reviewer for pointing out the inappropriateness of comparing miR-195 to EBF1 in terms of functional similarity. We agree that miR-195, as a microRNA, operates through post-transcriptional regulation and does not possess the pioneering transcriptional activity characteristic of EBF1. To avoid confusion or overstatement, we have removed the sentence in line 242 ("therefore miR-195 is considered to have similar function with EBF1 to some extent").

      (1-12) It is unclear whether this observation is in fact physiological. When the authors analyse a knockout model of miR-195, there is not much of a change in the B-cell phenotype. Their findings may therefore be an artefact of an overexpression system. The authors should comment on this observation in their discussion.  

      We thank the reviewer for this important observation. We agree that the mild phenotype observed in our miR-195 knockout mice suggests that miR-195 is not essential for B cell development under steady-state physiological conditions. Accordingly, we do not claim a physiological requirement for miR-195. Rather, our study demonstrates that miR-195 possesses the potential to activate a B-lineage program in the absence of EBF1 when ectopically expressed. This functional potential̶rather than its endogenous necessity̶ is the main focus of our work. We have now clarified this distinction in the revised Discussion section (lines 551‒560), and we emphasize that our findings highlight an alternative regulatory pathway that can be artificially engaged under specific conditions.

      (1-13) I recommend the authors check spelling and grammar throughout their manuscript.

      We thank the reviewer for the suggestion. In response, we have carefully reviewed the manuscript for spelling, grammar, and clarity. Minor corrections have been made throughout the text to improve readability and ensure consistency. We hope that the revised version addresses any language-related concerns. In addition, the manuscript has been reviewed by professional editing service to improve the language quality.

      (1-14) In general, I recommend more comprehensive primary data be presented in the manuscript or supplementary files to add value to their submission.

      We thank the reviewer for this helpful suggestion. In response, we have revised the manuscript and supplementary materials to include additional primary data wherever possible. The bar graphs have been updated to include individual data points to show variability and replicate information. Uncropped western blot images are now provided in Supplementary Figure S2. We hope these additions provide greater transparency and value to the manuscript. 

      Reviewer #2 (Recommendations for the authors): 

      I have a number of suggestions with regard to inclusion of details and controls: 

      (2-1) The authors need to provide more details on in vitro differentiation, especially culture times. 

      Thank you for your comment. The culture conditions for in vitro differentiation of Ebf1<sup>−/−</sup> hematopoietic progenitor cells are described in the Methods section (lines 648‒ 649) under “Culture of lineage-negative (Lin‒) cells from the fetal liver.” As stated, cells were cultured more than 7 days under the specified conditions.

      (2-2) In Figure 1E, the authors need to provide information on statistics (FDR or similar). 

      I thank the reviewer for the suggestion. In Figure 1D (Original Figure 1E) (the microarray analysis), only two biological replicates were available for each condition (n = 2 per group). Due to this limited sample size, we did not perform statistical testing, as the power would be insufficient to produce reliable p-values or adjusted FDRs. Instead, we focused on genes with consistent and biologically meaningful changes in expression, and presented representative examples based on fold change values.

      (2-3) For in vivo experiments (Figure 2) the authors should comment on their use of two different recipient mouse strains despite very low n numbers. As described above, classical mixed BM chimeras would be much more informative. In these experiments, the authors should also show the formation of other lymphoid lineages. This would answer the question of whether miR-195 redirects cells to the B lineage. Most importantly, absolute numbers need to be provided, especially in conjunction with Ebf1 rescue as described above. 

      We thank the reviewer for the thoughtful and detailed suggestions regarding our in vivo experiments. Regarding the use of different recipient mouse strains, our initial intention was to perform the transplantations in BRG mice; however, due to facility restrictions and animal husbandry considerations, we had to switch to NOG mice. All in vivo experiments were performed with n = 3 per group, in accordance with ethical guidelines and efforts to minimize animal use while still ensuring reproducibility. With respect to the suggestion of mixed bone marrow chimeras, we agree that this approach can provide valuable information on lineage competitiveness. However, in our system, miR-195 confers only a very limited B cell developmental potential in Ebf1<sup>−/−</sup> progenitors. In such a setting, the inclusion of wild-type competitor cells would overwhelmingly dominate the B cell compartment, likely masking any measurable effect of miR-195. Therefore, we opted to assess the gain-of-function potential of miR-195 in a noncompetitive setting. Regarding the assessment of other lymphoid lineages, we focused our analysis on the emergence of B-lineage cells, as the frequency of CD19⁺ cells induced by miR-195 is quite low. Given this low efficiency, we consider it unlikely that miR-195 significantly alters the development of non-B lineages, and thus did not observe substantial lineage diversion effects. Our aim was not to demonstrate lineage redirection, but rather to show that miR-195 can confer partial B cell potential in the absence of EBF1.

      Finally, we acknowledge the importance of presenting absolute cell numbers. However, the cell number collected from the mice were so few that we did not get the reliable results, we described it in the manuscript. (lines 498-501)

      (2-4) The statistics in Figure 3 are inadequate. No S.D. is provided for WT. How then was normalization performed? Student's T-test cannot be applied to ratios. 

      We thank the reviewer for highlighting the need for more appropriate statistical analysis. Due to considerable inter-batch variability in absolute measurements, we normalized the KO values to their paired WT counterparts from the same experimental batch. Specifically, for each replicate, we calculated the KO/WT ratio to control for batch-specific variation. We then applied a one-sample t-test (against a null hypothesis of ratio = 1) to determine statistical significance. We have now revised the figure to show individual ratio values for each replicate and updated the legend and Methods to clearly explain the statistical approach. We hope this addresses the concern and improves the clarity and rigor of the analysis.

      (2-5) In Figure 4A, the authors should comment on the strong repression of the Akt3UTR. 

      We appreciate the reviewerʼs observation regarding the strong repression observed with the Akt3 3'UTR construct. Indeed, we also noted that luciferase activity was markedly reduced in the presence of the Akt3 3'UTR, even in cells transduced with a control vector. We hypothesize that the Akt3 3'UTR contains strong post-transcriptional regulatory elements̶such as AU-rich elements or binding sites for endogenous miRNAs or RNA-binding proteins̶which may suppress mRNA stability or translation independent of miR-195. Alternatively, the secondary structure or length of the UTR may inherently reduce luciferase expression. We have added this limitation to the Discussion section (lines 561‒569).

      (2-6) The Western blot in Figure 4C is of insufficient quality. The authors need to provide unspliced versions of the bands including markers. 

      We thank the reviewer for this important comment. In response, we have included the unprocessed, full-length Western blot images corresponding to Figure 4C as Fig. S2. This provides a transparent view of the original data and addresses the concern about image cropping.

      (2-7) The ATACseq experiment in Figure 5 is difficult to comprehend. A simpler design including Ebf1 rescue controls would clearly improve this part. 

      We thank the reviewer for this valuable feedback. We agree that the original presentation of the ATAC-seq data may have been difficult to interpret. To address this, we have included a clear interpretation of the overlapping regions in the revised figure legend (lines 1018-1022). We hope this improves the clarity of the data and facilitates understanding of the chromatin changes mediated by EBF1 and miR-195.

      (2-8) The miR-195 KO mouse lacks validation (RT-PCR, genomic PCR) as well as a clear description of the deleted region and whether miR-497 is affected. In addition, the genetic background and number of backcrosses for the removal of potential off-target effects need to be mentioned. 

      We thank the reviewer for this important comment. The miR-195 knockout mouse was generated via CRISPR/Cas9, and Sanger sequencing confirmed a 628 bp deletion on chromosome 11 (GRCm38/mm10 chr11:70,234,425‒70,235,103). This deletion includes the entire miR-497 locus and part of the miR-195 precursor sequence. Although we do not show PCR gel images, the deletion was validated by sequencing, and the results are now clearly described in the revised Methods section (lines 607619). All transgenic mice in this study were backcrossed to the C57BL/6 background for at least eight generations.

      (2-9) The manuscript requires extensive editing for language. 

      We appreciate the reviewerʼs comment. The manuscript has now been revised and professionally edited for language by a native English-speaking editor. We believe clarity and readability have been significantly improved.

      Reviewer #3 (Recommendations for the authors): 

      (3-1) What is the expression level of miR-195 after viral overexpression? In Figure 4B, the authors show a 2.5-fold increase, but this appears very low for the experimental system (expression through the MDH1 retroviral construct) and the observed repressive effects (e.g. Figure 4A and B). 

      We thank the reviewer for this insightful comment. We agree that the apparent ~2.5fold increase in miR-195 levels (Figure 4B) may seem modest in the context of retroviral overexpression and the associated functional effects. However, due to the high sequence similarity within the miR-15/16/195/497 family, it is technically challenging to measure mature miR-195 levels with complete specificity. The baseline signal observed in control samples likely reflects cross-reactivity with endogenous miRNAs such as miR-497 or miR-16, which share similar seed sequences. Therefore, the reported fold-change may underestimate the true level of ectopic miR-195 expression. Despite this, we observed robust repression of validated targets (e.g., Mapk3, Akt3) in both qPCR and luciferase assays, indicating that functionally effective levels of miR-195 were achieved. We have now clarified this limitation and interpretation in the revised Results sections (lines 332‒335).

      (3-2) In alignment with the transparency of the data, I would encourage the authors to display the individual data points for all bar graphs. 

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we have updated bar graphs to include individual data points to increase transparency and allow better visualization of data variability. In the ddPCR experiments, we provided the raw data in Fig. S1 for full transparency. In Fig. 1A, we have confirmed miR-195 expression profiles using the deposit data which the reviewer suggested, but miR-195 expression was very lower than we expected. We also performed scRNA-seq using hematopoietic lineage cells in 8-week-old C57BL/6 mice, but we could not get the reproducibility of miR-195 expression profiles. Therefore, we determined that this is an artifact caused by the miR-195 probe used for qPCR, and deleted Fig. 1A.

      (3-3) The references appear to be compromised. For example, the authors state that "The Ebf1−/+ mouse was originally generated by R. Grosschedl (39)" (line 297), but this is not the respective paper. Likewise, the knockout mouse was generated "based on the CRISPR/Cas9 system established by C. Gurumurthy (40)" (line 299), but he/she is not involved in the referenced study. 

      We thank the reviewer for pointing out the discrepancies in the reference citations. Upon revising the Methods section to integrate it with the main text, the reference numbering became misaligned. We have corrected the reference in the revised manuscript, and we thank the reviewer for bringing this to our attention.

      (3-4) Given that the miRNA Taqman assays the authors used here have difficulties to discriminate closely related miRNAs such as e.g. miR-16 (highly expressed in the hematopoietic system) and miR-195, I would suggest that the authors test their qPCR in an appropriate setup, e.g. in their knockout mouse model. In this context, did the authors use another small RNA as a reference for the qPCR analysis? In the methods, only GAPDH is mentioned, but in my opinion, another RNA that uses the same stemloop-based cDNA synthesis protocol would be better suited.

      We thank the reviewer for this valuable and technically insightful comment.

      As correctly pointed out, TaqMan-based qPCR assays for miRNAs such as miR-195 can show cross-reactivity with closely related family members, particularly miR-16, which is abundantly expressed in hematopoietic cells. Indeed, due to this limitation, we do not treat the qPCR results shown in the original Figures 1A and 4B as definitive quantification of miR-195 expression. Rather, these data are used to provide a suggestion and a rough estimate of overexpression efficiency, while our core functional analyses rely on phenotypic and molecular outcomes such as target gene repression and lineage emergence. With this in mind, although we acknowledge that a small RNA reference based on the same stem-loop cDNA synthesis would offer a more compatible normalization in principle, the inherent variability and lack of absolute specificity in such assays also limits their interpretive value. Therefore, we used GAPDH as a normalization control for consistency with other qPCR analyses in the manuscript. We have now clarified this rationale and limitation in the revised Methods sections (lines 712‒716), and we thank the reviewer again for highlighting this important technical consideration.

      (3-5) The Western blot data used to support the hypothesis that FOXO1 phosphorylation is reduced upon overexpression of miR-195 are not convincing. The authors should not crop everything but the band. 

      We thank the reviewer for the helpful comment. In response, we have now provided the full-length, uncropped Western blot images corresponding to Figure 4C, including both total FOXO1 and phospho-FOXO1 blots. These images are included in Fig. S2.

    1. Author response:

      The following is the authors’ response to the original reviews

      Comment from the editors at eLife:

      You could consider further strengthening the manuscript with the incorporation of new relevant public datasets for network modeling, but that is entirely your choice.

      We thank the editors and reviewers for their thoughtful and positive feedback on our article. We are particularly appreciative of the eLife assessment describing our work as valuable with a convincing methodology.

      As suggested, we have expanded our neuron class analysis by incorporating transcriptomic data from young adult animals (Kaletsky et al., 2016 Nature; Ghaddar et al., 2023 Science Advances; St Ange et al., 2024 Cell Genomics) to complement our existing analysis of larval stage 4 (L4) animals.

      In addition, we have updated Table S1 to include the outcross status of all strains used in this study, providing clearer information on the genotypes tested. We have also corrected the typographical errors noted by the reviewers. Please note that page and line numbers below refer to the MS Word Document with tracked changes set to ‘simple markup’.

      We greatly appreciate the reviewers’ input and hope these revisions further enhance the value and clarity of our study.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Rahmani et al. utilize the TurboID method to characterize global proteome changes in the worm's nervous system induced by a salt-based associative learning paradigm. Altogether, they uncover 706 proteins tagged by the TurboID method in worms that underwent the memory-inducing protocol. Next, the authors conduct a gene enrichment analysis that implicates specific molecular pathways in salt-associative learning, such as MAP kinase and cAMP-mediated pathways, as well as specific neuronal classes including pharyngeal neurons, and specific sensory neurons, interneurons, and motor neurons. The authors then screen a representative group of hits from the proteome analysis. They find that mutants of candidate genes from the MAP kinase pathway, namely dlk-1 and uev-3, do not affect performance in the learning paradigm. Instead, multiple acetylcholine signaling mutants, as well as a protein-kinase-A mutant, significantly affected performance in the associative memory assay (e.g., acc-1, acc-3, lgc-46, and kin-2). Finally, the authors demonstrate that protein-kinase-A mutants, as well as acetylcholine signaling mutants, do not exhibit a phenotype in a related but distinct conditioning paradigm-aversive salt conditioning-suggesting their effect is specific to appetitive salt conditioning.

      Overall, the authors addressed the concerns raised in the previous review round, including the statistics of the chemotaxis experiments and the systems-level analysis of the neuron class expression patterns of their hits. I also appreciate the further attempt to equalize the sample size of the chemotaxis experiments and the transparent reporting of the sample size and statistics in the figure captions and Table S9. The new results from the panneuronal overexpression of the kin-2 gain-of-function allele also contribute to the manuscript. Together, these make the paper more compelling. The additional tested hits provide a comprehensive analysis of the main molecular pathways that could have affected learning. However, the revised manuscript includes more information and analysis, raising additional concerns.

      Major comments:

      As reviewer 4 noted, and as also shown to be relevant for C30G12.6 presented in Figure 6, the backcrossing of the mutants is important, as background mutations may lead to the observed effects. Could the authors add to Table 1, sheet 1, the outcrossing status of the tested mutants?

      We appreciate this important point. A column has now been added to Table S1 to indicate the outcross status of all strains used in this study. Additionally, we have updated the table legend on page 77 to clarify how to interpret the information provided in this column.

      It is important to validate that the results of the positive hits (where learning was affected), such as acc-1, acc-3, and lgc-46, do not stem from background mutations.

      While we agree that confirming the absence of background mutations is important, we have taken alternative steps to address this concern:

      - The outcross status of each strain is now clearly indicated in Table S1.

      - Observed phenotypes were consistent across multiple biological replicates over extended periods (months, sometimes years), reducing the likelihood that results stem from background mutations.

      We believe these measures provide confidence in the validity of our findings.

      The fold change in the number of hits for different neurons in the CENGEN-based rank analysis requires a statistical test (discussed on pages 17-19 and summarized in Table S7). Similar to the other gene enrichment analyses presented in the manuscript, the new rank analysis also requires a statistical test. Since the authors extensively elaborate on the results from this analysis, I think a statistical analysis is especially important for its interpretation. For example, if considering the IL1 neurons, which ranked highest, and assuming random groups of genes-each having the same size as those of the ranked neurons (209 genes in total for IL1 in Table S7)-how common would it be to get the calculated fold change of 1.38 or higher? Such bootstrapping analysis is common for enrichment analysis. Perhaps the authors could consult with an institutional expert (Dr. Pawel Skuza, Flinders University) for the statistical aspects of this analysis.

      We appreciate the suggestion and agree that statistical testing can be valuable for enrichment analyses. However, implementing additional tests such as bootstrapping is beyond the scope of this study. Our aim was to provide a descriptive overview rather than inferential statistics. To ensure transparency and interpretability, we have:

      - Clearly reported fold changes and rankings in Table S7.

      - Discussed the limitations of this approach in the manuscript text (page 18, lines 17–20).

      - Clearly outlined the methods used to perform this analysis (pages 53–54).

      We believe this descriptive analysis provides sufficient context for interpreting these results.

      The learning phenotypes from Figure S8, concerning acc-1, acc-3, and lgc-46 mutants, are summarized in a scheme in Figure 4; however, the chemotaxis results are found in the supplemental Figure S8. Perhaps I missed the reasoning, but for transparency, I think the relevant Figure S8 results should be shown together with their summary scheme in Figure 4.

      Thank you for this suggestion to improve clarity. We have now moved the panels corresponding to cholinergic signalling components from Figure S8 into Figure 4 on page 21, so that the summary scheme and underlying data are presented together. The figure legends and main text have been updated accordingly to reflect the correct figure numbers.

      Reviewer #2 (Public review):

      Summary:

      In this study by Rahmani in colleagues, the authors sought to define the "learning proteome" for a gustatory associative learning paradigm in C. elegans. Using a cytoplasmic TurboID expressed under the control of a pan-neuronal promoter, the authors labeled proteins during the training portion of the paradigm, followed by proteomics analysis. This approach revealed hundreds of proteins potentially involved in learning, which the authors describe using gene ontology and pathway analysis. The authors performed functional characterization of over two dozen of these genes for their requirement in learning using the same paradigm. They also compared the requirement for these genes across various learning paradigms and found that most hits they characterized appear to be specifically required for the training paradigm used for generating the "learning proteome".

      Strengths:

      The authors have thoughtfully and transparently designed and reported the results of their study. Controls are carefully thought-out, and hits are ranked as strong and weak. By combining their proteomics with behavioral analysis, the authors also highlight the biological significance of their proteomics findings, and support that even weak hits are meaningful.

      The authors display a high degree of statistical rigor, incorporating normality tests into their behavioral data which is beyond the field standard.

      The authors include pathway analysis that generates interesting hypotheses about processes involved learning and memory

      The authors generally provide thoughtful interpretations for all of their results, both positive and negative, as well as any unexpected outcomes.

      Weaknesses:

      - The authors use the Cengen single cell-transcriptomic atlas to predict where the proteins in the "learning proteome" are likely to be expressed and use this data to identify neurons that are likely significant to learning, and building hypothetical circuit. This is an excellent idea; however, the Cengen dataset only contains transcriptomic data from juvenile L4 animals, while the authors performed their proteome experiments in Day 1 Adult animals. It is well documented that the C. elegans nervous system transcriptome is significant different between these two stages (Kaletsky et al., 2016, St. Ange et al., 2024), so the authors might be missing important expression data, resulting in inaccurate or incomplete networks. The adult neuronal single-cell atlas data (https://cestaan.princeton.edu/) would be better suited to incorporate into neuronal expression analysis.

      Thank you for highlighting this important point. We have now incorporated transcriptomic data from young adult animals to complement the L4-based CeNGEN dataset. Specifically, we integrated data from CeSTAAN (https://cestaan.princeton.edu/, including St. Ange et al., 2024) and WormSeq (https://wormseq.org/, including Ghaddar et al., 2023), as outlined below. Importantly, CeSTAAN and WormSeq provide data for 79 and 104 neuron classes, respectively (compared to 128 from CeNGEN); for this reason, the main analysis focuses on CeNGEN due to its broader coverage, with additional datasets noted in brackets for completeness. This is stated on page 18, lines 15–17 to ensure transparency regarding our rationale.

      The main text has been updated to describe these datasets and their integration into our analysis (pages 18–20), and further details on how these resources were used have been added to the Experimental Procedures (pages 53–54).

      We also incorporated data from Kaletsky et al. (2016) and St. Ange et al. (2024) into our neuron identity checks for all assigned and unassigned hits (page 16, lines 8–19). This analysis shows that the nervous system is highly represented in our proteome data: 75–87% of assigned hits and 75–83% of all hits correspond to neuron-enriched genes identified by St. Ange et al. and Kaletsky et al.

      In addition, we used several transcriptomic databases to confirm that learning regulators identified in this study through TurboID and validation experiments are expressed in the same neuron classes as suggested by CenGEN (page 36).

      - The authors offer many interpretations for why mutants in "learning proteome" hits have no detectable phenotype, which is commendable. They are however overlooking another important interpretation, it is possible that these changes to the proteome are important for memory, which is dependent upon translation and protein level changes, and is molecularly distinct from learning. It is well established in the field mutating or knocking down memory regulators in other paradigms will often have no detectable effect on learning. Incorporating this interpretation into the discussion and highlighting it as an area for future exploration would strengthen the manuscript.

      Thank you for this suggestion. We have incorporated this interpretation into the Results section (page 31, lines 17–23), specifying the potential role of these proteomic changes in memory encoding and retention, which are molecularly distinct from learning.

      - A minor weakness - In the discussion, the authors state that the Lakhina, et al 2015 used RNA-seq to assess memory transcriptome changes. This study used microarray analysis.

      This has been corrected on page 38, line 5.

      Significance:

      The approach used in this study is interesting and has the potential to further our knowledge about the molecular mechanisms of associative behaviors. There have been multiple transcriptomic studies in the worm looking at gene expression changes in the context of behavioral training. This study compliments and extends those studies, by examining how the proteome changes in a different training paradigm. This approach here could be employed for multiple different training paradigms, presenting a new technical advance for the field. This paper would be of interest to the broader field of behavioral and molecular neuroscience. Though it uses an invertebrate system, many findings in the worm regarding learning and memory translate to higher organisms, making this paper of interest and significant to the broader field of behavioral neuroscience.

      Reviewer #4 (Public review):

      Summary:

      In this manuscript, authors used a learning paradigm in C. elegans; when worms were fed in a saltless plate, its chemotaxis to salt is greatly reduced. To identify learning-related proteins, authors employed nervous system-specific transcriptome analysis to compare whole proteins in neurons between high-salt-fed animals and saltless-fed animals. Authors identified "learning-specific proteins" which are observed only after saltless feeding. They categorized these proteins by GO analyses, pathway analyses and expression site analyses, and further stepped forward to test mutants in selected genes identified by the proteome analysis. They find several mutants that are defective or hyper-proficient for learning, including acc-1/3 and lgc-46 acetylcholine receptors, F46H5.3 putative arginine kinase, and kin-2, a cAMP pathway gene. These mutants were not previously reported to have abnormality in the learning paradigm.

      Concerns:

      Upon revision, authors addressed all concerns of this reviewer, and the results are now presented in a way that facilitates objective evaluation. Authors' conclusions are supported by the results presented, and the strength of the proteomics approach is persuasively demonstrated.

      Thank you, we appreciate this positive feedback.

      Significance:

      (1) Total neural proteome analysis has not been conducted before for learning-induced changes, though transcriptome analysis has been performed for odor learning (Lakhina et al., http://dx.doi.org/10.1016/j.neuron.2014.12.029). This warrants the novelty of this manuscript, because for some genes, protein levels may change even though mRNA levels remain the same. Although in a few reports TurboID has been used in C. elegans, this is the first report of a systematic analysis of tissue-specific differential proteomics.

      (2) Authors found five mutants that have abnormality in the salt learning. These genes have not been described to have the abnormality, providing novel knowledge to the readers, especially those who work on C. elegans behavioural plasticity. Especially, involvement of acetylcholine neurotransmission has not been addressed before. Although transgenic rescue experiments have not been performed except kin-2, and the site of action (neurons involved) has not been tested in this manuscript, it will open the venue to further determine the way in which acetylcholine receptors, cAMP pathway etc. influences the learning process.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors stated in their response to reviewers that "referring to a phenotype as both a trend and non-significant may confuse readers, which was originally stated in the manuscript in two locations," and that such sentences were removed. Unfortunately, in the new text (page 28, lines 18-19), the authors write: "uev-3 mutants showed a lower average CI after training compared with wild-type, but this did not reach statistical significance." As stated before, I find such sentences confusing and not interpretable. If the changes are not significant, then the lower average CI is not informative.

      Thank you for pointing this out. This has been corrected to improve clarity – we say instead that “trained phenotypes between wild-type and uev-3 mutants were not statistically significant” (page 29, lines 21–22).

      In response to reviewers' comments, the authors added more information about the biotinylation efficiency of the experiment, which is also described in the text:

      Page 8, line 27: "we found that biotin exposure increased the signal 1.3-fold for non-Tg and 1.7-fold for TurboID C. elegans."

      Page 10, line 4: "Quantification of the signal within entire lanes showed a 1.1-fold increase in the 'TurboID, control' lane compared with the 'non-Tg, control' lane, and a 1.9-fold increase in the 'TurboID, trained' lane compared with the 'non-Tg, trained' lane."

      Is it common in this field not to show the actual raw quantified numbers? I was expecting either a bar graph or instead that the measured values would appear in the text alongside the fold-change information.

      Table S2 (and its table legend on page 77) have been edited to include raw area values.

      Figure 5: Typo? - "pan neuronal expression of ..." The allele number is written as 139, but I believe it should be 179, as in the rest of the paper.

      The typo has been corrected on page 25.

      The results describing the absence of a learning phenotype in backcrossed C30G12.6 are presented in the main figure. If the authors believe this is an important result, I understand keeping it in the main figure; however, I find this uncommon.

      Thank you for your comment. We consider the absence of a learning phenotype in backcrossed C30G12.6 to be an important control for interpreting the original findings, which is why we have retained it in the main figure.

      Reviewer #4 (Recommendations for the authors):

      I noted a few typos.

      (1) In Fig 5B, the transgene is depicted kin-2(ce139) but it is probably kin-2(ce179).

      The typo has been corrected on page 25.

      (2) In text, R97C and ce179 are used interchangeably, but in fact there is no description that they are identical.

      We now state the following in the manuscript: “We tested worms with the ce179 mutant allele in kin-2, in which a conserved residue in the inhibitory domain (which normally functions to keep PKA turned off in the absence of cAMP) is mutated to cause an R92C amino acid change – this results in increased PKA activity (Schade et al., 2005).” (page 25, lines 1–3),

      (3) p31 line 7, Figure S7 -> Fig S9 C-E

      We apologise for this typographical error. This figure number is meant to correspond to salt associative learning assay data (Fig. S8), not salt aversive learning (Fig. S9). Since the data from Fig. S8 was moved to Fig. 4, the figure citation has been changed from Fig. S7 (which was incorrect) to Fig. 4 (page 32, line 17).

      (4) p45 line 11, Fig S9 -> Fig S6

      The typo has been corrected (page 47, line 12).

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Syed et al. investigate the circuit underpinnings for leg grooming in the fruit fly. They identify two populations of local interneurons in the right front leg neuromere of ventral nerve cord, i.e. 62 13A neurons and 64 13B neurons. Hierarchical clustering analysis identifies each 10 morphological classes for both populations. Connectome analysis reveals their circuit interactions: these GABAergic interneurons provide synaptic inhibition either between the two subpopulations, i.e. 13B onto 13A, or among each other, i.e. 13As onto other 13As, and/or onto leg motoneurons, i.e. 13As and 13Bs onto leg motoneurons. Interestingly, 13A interneurons fall into two categories with one providing inhibition onto a broad group of motoneurons, being called "generalists", while others project to few motoneurons only, being called "specialists". Optogenetic activation and silencing of both subsets strongly effects leg grooming. As well activating or silencing subpopulations, i.e. 3 to 6 elements of the 13A and 13B groups has marked effects on leg grooming, including frequency and joint positions and even interrupting leg grooming. The authors present a computational model with the four circuit motifs found, i.e. feed-forward inhibition, disinhibition, reciprocal inhibition and redundant inhibition. This model can reproduce relevant aspects of the grooming behavior.

      Strengths:

      The authors succeeded in providing evidence for neural circuits interacting by means of synaptic inhibition to play an important role in the generation of a fast rhythmic insect motor behavior, i.e. grooming. Two populations of local interneurons in the fruit fly VNC comprise four inhibitory circuit motifs of neural action and interaction: feed-forward inhibition, disinhibition, reciprocal inhibition and redundant inhibition. Connectome analysis identifies the similarities and differences between individual members of the two interneuron populations. Modulating the activity of small subsets of these interneuron populations markedly affects generation of the motor behavior thereby exemplifying their important role for generating grooming. The authors carefully discuss strengths and limitations of their approaches and place their findings into the broader context of motor control.

      We thank the reviewer for their thoughtful and constructive evaluation of our work.

      Weaknesses:

      Effects of modulating activity in the interneuron populations by means of optogenetics were conducted in the so-called closed-loop condition. This does not allow to differentiate between direct and secondary effects of the experimental modification in neural activity, as feedforward and feedback effects cannot be disentangled. To do so open loop experiments, e.g. in deafferented conditions, would be important. Given that many members of the two populations of interneurons do not show one, but two or more circuit motifs, it remains to be disentangled which role the individual circuit motif plays in the generation of the motor behavior in intact animals.

      Our optogenetic experiments show a role for 13A/B neurons in grooming leg movements – in an intact sensorimotor system - but we cannot yet differentiate between central and reafferent contributions. Activation of 13As or 13Bs disinhibits motor neurons and that is sufficient to induce walking/grooming. Therefore, we can show a role for the disinhibition motif.

      Proprioceptive feedback from leg movements could certainly affect the function of these reciprocal inhibition circuits. Given the synapses we observe between leg proprioceptors and 13A neurons, we think this is likely.

      Our previous work (Ravbar et al 2021) showed that grooming rhythms in dusted flies persist when sensory feedback is reduced, indicating that central control is possible. In those experiments, we used dust to stimulate grooming and optogenetic manipulation to broadly silence sensory feedback. We cannot do the same here because we do not yet have reagents to separately activate sparse subsets of inhibitory neurons while silencing specific proprioceptive neurons. More importantly, globally silencing proprioceptors would produce pleiotropic effects and severely impair baseline coordination, making it difficult to distinguish whether observed changes reflect disrupted rhythm generation or secondary consequences of impaired sensory input. Therefore, the reviewer is correct – we do not know whether the effects we observe are feedforward (central), feedback sensory, or both. We have included this in the revised results and discussion section to describe these possibilities and the limits of our current findings.

      Additionally, we have used a computational model to test the role of each motif separately and we show that in the results.  

      Comments on revisions:

      The careful revision of the manuscript improved the clarity of presentation substantially.

      Reviewer #2 (Public review):

      Summary:

      This manuscript by Syed et al. presents a detailed investigation of inhibitory interneurons, specifically from the 13A and 13B hemilineages, which contribute to the generation of rhythmic leg movements underlying grooming behavior in Drosophila. After performing a detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits, the authors build on this anatomical framework by performing optogenetic perturbation experiments to functionally test predictions derived from the connectome. Finally, they integrate these findings into a computational model that links anatomical connectivity with behavior, offering a systems-level view of how inhibitory circuits may contribute to grooming pattern generation.

      Strengths:

      (1) Performing an extensive and detailed connectomic analysis, which offers novel insights into the organization of premotor inhibitory circuits.

      (2) Making sense of the largely uncharacterized 13A/13B nerve cord circuitry by combining connectomics and optogenetics is very impressive and will lay the foundation for future experiments in this field.

      (3) Testing the predictions from experiments using a simplified and elegant model.

      Thank you for the positive assessment of our work.

      Weaknesses:

      (1) In Figure 4-figure supplement 1, the inclusion of walking assays in dusted flies is problematic, as these flies are already strongly biased toward grooming behavior and rarely walk. To assess how 13A neuron activation influences walking, such experiments should be conducted in undusted flies under baseline locomotor conditions.

      We agree that there are better ways to assay potential contributions of 13A/13B neurons to walking. We intended to focus on how normal activity in these inhibitory neurons affects coordination during grooming, and we included walking because we observed it in our optogenetic experiments and because it also involves rhythmic leg movements. The walking data is reported in a supplementary figure because we think this merits further study with assays designed to quantify walking specifically. We will make these goals clearer in the revised manuscript and we are happy to share our reagents with other research groups more equipped to analyze walking differences.

      (2) Regarding Fig 5: The 70ms on/off stimulation with a slow opsin seems problematic. CsChrimson off kinetics are slow and unlikely to cause actual activity changes in the desired neurons with the temporal precision the authors are suggesting they get. Regardless, it is amazing the authors get the behavior! It would still be important for authors to mention the optogentics caveat, and potentially supplement the data with stimulation at different frequencies, or using faster opsins like ChrimsonR.

      We were also intrigued by the behavioral consequences of activating these inhibitory neurons with CsChrimson. We appreciate the reviewer’s point that CsChrimson’s slow off-kinetics limit precise temporal control. To address this, we repeated our frequency analysis using a range of pulse durations (10/10, 50/50, 70/70, 110/110, and 120/120 ms on/off) and compared the mean frequency of proximal joint extension/flexion cycles across conditions. We found no significant difference in frequency (LLMS, p > 0.05), suggesting that the observed grooming rhythm is not dictated by pulse period but instead reflects an intrinsic property of the premotor circuit once activated. We now include these results in ‘Figure 5—figure supplement 1’ and clarify in the text that we interpret pulsed activation as triggering, rather than precisely pacing, the endogenous grooming rhythm. We continue to note in the manuscript that CsChrimson’s slow off-kinetics may limit temporal precision. We will try ChrimsonR in future experiments.

      Overall, I think the strengths outweigh the weaknesses, and I consider this a timely and comprehensive addition to the field.

      Reviewer #3 (Public review):

      Summary:

      The authors set out to determine how GABAergic inhibitory premotor circuits contribute to the rhythmic alternation of leg flexion and extension during Drosophila grooming. To do this, they first mapped the ~120 13A and 13B hemilineage inhibitory neurons in the prothoracic segment of the VNC and clustered them by morphology and synaptic partners. They then tested the contribution of these cells to flexion and extension using optogenetic activation and inhibition and kinematic analyses of limb joints. Finally, they produced a computational model representing an abstract version of the circuit to determine how the connectivity identified in EM might relate to functional output. The study makes important contributions to the literature.

      The authors have identified an interesting question and use a strong set of complementary tools to address it:

      They analysed serial‐section TEM data to obtain reconstructions of every 13A and 13B neuron in the prothoracic segment. They manually proofread over 60 13A neurons and 64 13B neurons, then used automated synapse detection to build detailed connectivity maps and cluster neurons into functional motifs.

      They used optogenetic tools with a range of genetic driver lines in freely behaving flies to test the contribution of subsets of 13A and 13B neurons.

      They used a connectome-constrained computational model to determine how the mapped connectivity relates to the rhythmic output of the behavior.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I still have the following specific suggestions and questions, which need the attention of the authors:

      P5, 2nd para, li 1: shouldn't "(Figures 1E and 1E')" be (Figures 1G and 1H)?

      P7, last para, li 3: shouldn't "(Figures 2C and 2D)" be (Figures 2A and 2B)?

      P19, para 2, last 2li: "...we observe that optogenetic activation......triggers grooming movements." I could not find the place in the text or a figure, where this was reported or shown. Please specify

      P19, last para: "... shows that 13A neurons can generate rhyhtmic movements....." Given that the experiments were conducted in closed-loop, i.e. including the loop through the leg and its movements, the following formulation appears more justified: "....shows that 13A neurons significantly contribute to the generation of rhythmic movements,....."

      P28, para 1, li 3 from bottom: "...themselves, rather than solely between antagonistsic motor neurons." While the authors are correct that in the stick insect and locust alternating inhibitory synaptic drive to flexor and extensor motoneurons has been shown to underly alternating activity of these two antagonistic motoneuron pools the previous studies have not shown or claimed that these synaptic inputs arise from direct interactions between these motoneuron pools. Based on this this text should be moved to the part "feed-forward inhibition" on page 27.

      P28: "redundant inhibition": this motif has been shown to be instrumental in the locust flight CPG, e.g. Robertson & Pearson, 1985, Fig. 16.

      P28: "reciprocal inhibition" The reviewer agrees with the authors that this motif has been shown for the mouse spinal cord, but also for other CPGs in vertebrates and invertebrates, e.g. clione, leech, xenopus - see the initial comment "(3) Intro and Discussion"

      Thank you, we have incorporated the suggested corrections and clarifications into the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      I'm satisfied with the revised version

      Reviewer #3 (Recommendations for the authors):

      The authors have made a substantial effort to address my original points. They corrected the title, expanded Discussion and Methods sections, reran statistical tests using mixed models, added modelling clarifications and constraints, and fixed or removed confusing figure panels. Those changes have improved clarity and reduced some of the claims that I thought were exaggerated.

      That said, some of my concerns remain only partially addressed, which could be fixed with relatively small tweaks. The authors should:

      (1) Explicitly separate empirical findings from modelling inferences throughout the manuscript, including the Abstract, Results and Discussion (i.e., label claims of "intrinsic rhythmogenesis" as model-based inferences, not direct experimental demonstrations)

      (2) Provide supplemental information on modelling to quantify the role of the black-box input (e.g., quantitative coordination/phase/frequency metrics for full model vs constant-input vs no black box), show pre- vs post-fine-tuning weight changes and the exact tuning constraints/optimization details (I could not find these details)

      (3) To ensure results are reproducible, provide a supplemental table mapping each split line to EM-identified neuron(s) with NBLAST/morphological scores for each match;

      (4) Fully document the statistical models (exact LMM/GLMM formulas, software/packages, etc);

      (5) Deposit model code, trained weights and analysis scripts in a public repository.

      We have updated the GitHub repository with the full statistical analysis documentation and model code, including trained weights and scripts.

    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1:

      (1) As such amount of work has been put into developing this community tool, it would be worth thinking about how it could serve other multiplex-immunofluorescence methods (such as immunoSABER, 4i, etc). Adding an extra tab where the particular method that uses those reagents is mentioned. This would also help as IBEX itself and related methods evolve in the future.

      We agree and currently support six other methods beyond the original ”IBEX2D Manual”, with the most generic being ”Multiplexed 2D Imaging”: standard, single cycle (non-iterative) imaging method applied to thin, 2D (5-30 micron) tissue sections. Descriptions of supported methods are given in the reagent glossary. We plan to evolve to include multiplex IF methods such as Immuno-SABER, 4i, Cell DIVE, etc. The current structure of the reagent resources table can support other immunofluorescence methods without modifications. The table contains information for IBEX and related methods. The particular method for which a reagent validation was evaluated is specified in the column titled ”Method”. Descriptions of supported methods are given in the reagent glossary.

      (2) It has a rather minimal description of the software. In particular, there is software that has not been developed for IBEX specifically but that could be used for IBEX datasets (ASHLAR, WSIReg, VALIS, WARPY, and QuPath, etc). It would be nice if there was mention of those.

      ASHLAR, WSIReg, VALIS, and Warpy have been added to the Knowledge-Base. These software components are specifically relevant for iterative imaging protocols which require image alignment. With respect to QuPath, Fiji, Napari and other general microscopy image analysis frameworks, these are not listed. Such frameworks provide a wide range of operations relevant for many microscopy image analysis tasks and are likely already familiar to researchers who are interested in the information contained in the Knowledge-Base.

      (3) There is a concern about how the negative data information will be added, as no publication or peer-review process can back it up. Perhaps the particular conditions of the experiment should be very well described to allow future users to assess the validity.

      We agree with this observation and have added the following language to the contribute page:

      ”When reporting information that has not appeared in a peer-reviewed publication, both negative and positive results, include more details with respect to experimental conditions and provide sample images as part of the supporting material files. In all cases, peer reviewed or not, we encourage providing additional details in the supporting material that you deem important and are not part of the csv file structure. These include, but are not limited to, lot numbers, versioned protocols used in the work, and any other information which will facilitate validation reproducibility.”

      (4) The proposed scheme where a reagent can be validated or recommended against by up to 4 different labs should be good. It may be good to make sure that researchers who validate belong to different labs and are not only different ORCID that belong to the same group. Similar to making a case of recommendations against a reagent.

      We generally support this recommendation. Based on our experience, even members within the same laboratory encounter challenges when attempting to validate reagents contributed by current or former colleagues. Additionally, research labs often experience significant personnel turnover, with minimal overlap over a five year span.

      To address these concerns, we have updated the instructions on the contribute page as follows: ”We only accept up to 5 ORCID additions in the Agree or Disagree columns. This means that the original contributor’s work was replicated by up to 4 individuals or refuted by up to 5 people. Priority is given to contributions from individuals in laboratories distinct from the original source.”

      (5) It is very interesting to keep track of the protocol versions used. Perhaps users should be able to validate independent versions and it will be important to know how information is kept.

      Thank you for your suggestion. We encourage members of the community to cite the latest version of the Knowledge-Base in the “Citing the Knowledge-Base” section.

      (6) The final point I would make is that the need to form a GitHub repository may deter some people from submitting data. For sporadic contributions, authors could think that users could either reach out to main developers and/or provide a submission form that can help less experienced users of command-line and GitHub programming, but still promote the contribution from the community.

      We have given this significant thought and now support a secondary path for contributing that does not require familiarity with git or GitHub. This path involves downloading a zip file, modifying the contents of the csv files and providing supporting material text files and images. Once the work is completed, the contributor contacts the Knowledge-Base maintainers and we complete the submission together, with the maintainers dealing with the usage of git and GitHub. This information has been added to the notes which are listed at the top of the Contribute page. We have recently completed the first contribution that followed this new workflow.

      We still encourage researchers to familiarize themselves with git and the GitHub repository hosting service. These tools have been shown to be useful for collaborative and reproducible laboratory research.

      Reviewer #2:

      (1) The potential impact of IBEX KB is very clear. However, the paper would benefit by also discussing more on KB maintenance and outreach, and how higher participation could be incentivized.

      We have added the following details to the discussion:

      The KB is actively maintained by its chairs, who meet bi-weekly to ensure its continued development and maintenance. In addition to these regular meetings, we engage with both current and prospective community members to gather feedback, encourage contributions, and expand the collective knowledge supporting the KB. To broaden outreach and foster sustained engagement, the IBEX community will collaborate with synergistic initiatives such as the HuBMAP Affinity Reagents Working Group, the European Society for Spatial Biology (ESSB), and the Global Alliance for Spatial Technologies (GESTALT).

      As a further incentive for participation, we intend to launch an annual “Reagent Validation Week”, a community driven event inspired by software hackathons. During this dedicated week, researchers would focus on validating or reproducing validation for selected reagents and contribute their findings to the KB. We have also discussed hosting an “Around the World” symposium, featuring presentations from both junior and senior scientists across the community, to showcase diverse perspectives and foster global collaboration.

      (2) Use of resources like GitHub may limit engagement from non-coding members of the scientific community. Will there be alternative options like a user-friendly web interface to contribute more easily?

      We agree with this observation and have addressed it. Please see detailed response to point 6 from Reviewer 1.

      Reviewer #3:

      (1) IBEX is a specific immunofluorescence method. However, the utility of the Knowledge base is not limited to the specific IBEX method. Therefore, I suggest removing the unnecessary branding of the term IBEX from the KB and citing potentially other similar cyclic immunofluorescence methods in the manuscript (e.g. CycIF Lin et al 2018). This would also emphasize the wider impact and applicability of the KB to the wider imaging community.

      For now, we have decided to keep the original reference to the IBEX method in the resource name and re-brand it in the next development phase. In that phase we intend to solicit reagent validations for methods unrelated to IBEX. We have added the reference to the CycIF publication. The manuscript text now reads: “We are optimistic that future versions will include extension of the IBEX method to other tissues and species and we intend to solicit contributions of reagent validations for other multiplexed imaging techniques such as CycIF Lin et al. (2015). At that point in time we expect to re-brand the KB as the IBEX++ Knowledge-Base...”

      (2) I believe reporting negative results with reagents is highly valuable. However, the way to report antibodies must include more details. To ensure data quality, every report should be linked to a specific protocol + images (or doc with the standard document variations, and sample information. This should be a mandatory requirement.

      We agree that this information is desirable, but we do not agree that it should be mandatory. In the contribution instructions we now explicitly list lot numbers and versioned protocols as examples of details that we encourage contributors to include in their supporting material files. We believe that requiring this information for a contribution sets the bar too high and will deter many from contributing information that can benefit others.

      (3) While cross-validation among researchers is beneficial, even if five individuals fail to reproduce results with a given antibody, their findings may be influenced by techniquespecific factors. It is also important to consider whether these researchers come from the same group, institution, or geographical region, as this could impact reproducibility. Additionally, entries that have not been reproduced at least five times using the same protocol should still be considered valuable information. To address this, an ”insufficient validation data” flag could be implemented, ensuring that incomplete but useful findings remain accessible.

      The contribution instructions now state that ”Priority is given to contributions from individuals in laboratories distinct from the original source”.

      While our goal is to support reproducing reagent validations, we do not expect these type of contributions be the rule as the only incentive we can provide to encourage this behavior is co-authorship on the authoritative dataset. As a result, it is likely that many of the validations will have a single endorser, the original contributor. These results are valuable information and we do not think they should be singled out (insufficient validation label). We leave it up to the users of the KB to decide whether they trust recommendations with multiple endorsers or if endorsement by a single highly trusted contributor is sufficient for them. In all cases, issues with contributions can be rasied and discussed on the KB discussion forum.

      The rationale for limiting the number of reproduction studies to five was that this is a minimal, yet sufficiently large, number that provides confidence in the results. Placing an upper limit ensures that researchers do not provide reproduction results for widely used and well established reagents just because these results are readily available to them.

      (4) This system could flag reagents with inconsistent reports, highlight potential techniquespecific issues, and suggest alternative reagents with stronger validation records. Furthermore, a validation confidence ranking could be introduced, taking into account the number of independent confirmations, protocol consistency, and reproducibility data. These measures would help refine the reporting process while maintaining transparency and scientific rigor.

      We agree that the functionality described here is desirable, but this is not part of the KB. At its core the KB is a dataset and we do not envision developing dedicated tools to perform these tasks. Instead, we foresee using the KB as context for interacting with AI agents. Providing the KB as context to an AI, one can currently use it to answer domain specific questions and perform related tasks such as designing imaging panels (under subject matter expert supervision). This was added to the sample usecases in the manuscript with a transcript from interaction with an AI model using the website as context provided as supplemental material.

      (5) Regarding image formats for results reporting, while JPG files are convenient due to their small size, TIFF files offer significant advantages, such as preserving metadata and maintaining the integrity of real data values. Proper signal adjustments may not always be applied by researchers, making TIFF crucial for accurate data analysis. I suggest in this regard making available the possibility of including a link to the original TIFF data

      The goal of the supporting material image is similar to that of an image used in a manuscript and it should not be used for data analysis purposes. This is the reason we chose the JPG format. Sharing these images is not intended to be a substitute for publicly sharing the original images and their associated metadata. This is now noted in the contributing instructions.

      (6) Homepage:

      Include a brief summary of the knowledge base’s purpose and tabs to provide clarity for new users. The current homepage is a bit misleading for newcomers.

      The homepage has been modified to include information about the Knowledge-Base, contents and how to use it including as context for interaction with AI agents.

      (7) Reagent Resources Section: Enable users to search for a target name directly, rather than filtering through dropdown options.

      The dropdown menu explicitly shows all available targets and also allows for direct search of target name. To use it for direct search, once the dropdown is selected start typing the name of the target and the focus will jump to it. Thus, if looking for ”Zrf1” there is no need to scroll through all targets in the dropdown. This also facilitates easy clearing of a filter, select the dropdown and start typing the word ”clear”, then press enter when it is highlighted. This information has been added to the page.

      Provide an option to download the dataset as a CSV file. This feature will be highly valued by non-computational researchers.

      Links to download the reagent resources csv file and the whole Knowledge-Base have been added.

      Add the same column documentation here as in the contributor instructions. For example, you need to make clear the distinctions between ”Recommend,” ”Agree,” and ”Disagree” ratings, as they may be misleading to those who have not visited the rules to contribute.

      A link to the column documentation in the contributor instructions has been added here. Information on the website is displayed in one location and linked as needed. Duplicated display of information creates uncertainty for users and results in more complex instructions when referring to the information.

      Include additional details in the dataset, such as lot numbers, or the date of the contribution, that could be relevant in different settings.

      Please see response to point 2.

      (8) Data & Software Section:

      Add filtering options in the table based on organism and tissue availability

      This data is not encoded in the available information in an independent manner so we do not directly enable filtering. It is usually included in the ”Details” free form text. This text is duplicated from the original dataset descriptions. One can still search this page using the browsers search functionality to achieve behavior similar to filtering. While the ”Details” text may not be visible due to the usage of the accordion user interface, it is still searchable and will automatically expand when the search text is found under the collapsed accordion button.

      (9) Contributor Section:

      Incorporate figures from the manuscript to make it more visual and improve understanding of rules and standards.

      Figure 4 from the manuscript was added to this page.

      I believe reporting negative results with reagents is highly valuable. However, to ensure data quality, every report should be linked to a specific protocol and sample information. This should be a mandatory requirement. To streamline the process, warnings for certain reagents could be implemented, but a reagent should not be outright labeled as ineffective without proper validation.

      Please see response to point 2.

      Cross-validation among researchers is beneficial, but even if five individuals fail to reproduce results with a given antibody, it may still be due to technique-specific factorsparticularly for non-routine antibodies.

      We agree with this observation and have modified the contribution instructions accordingly:

      When overturning previously reported results, the number of ORCIDs in the Disagree column becomes greater than those in the Agree column, we will open the contribution for public discussion on the Knowledge-Base forum before accepting it.

      The intent is to increase the community’s confidence in the results, particularly when dealing with non-routine antibodies. This allows the original contributor and other members of the community to engage with the researchers who were unable to replicate a specific validation, possibly helping them to replicate the original results by adding missing details to the KB, or explicitly identifying and documenting issues with the original work.

      Regarding image formats, JPG files are convenient due to their small size, but TIFF offers significant advantages, such as preserving metadata and maintaining the integrity of real data values. Proper signal adjustments may not always be applied by researchers, making TIFF crucial for accurate data analysis.

      Please see response to point 5.

    1. AbstractThe increasing availability of viral sequences has led to the emergence of many optimized viral genome reconstruction tools. Given that the number of new tools is steadily increasing, it is complex to identify functional and optimized tools that offer an equilibrium between accuracy and computational resources as well as the features that each tool provides. In this paper, we surveyed open-source computational tools (including pipelines) used for human viral genome reconstruction, identifying specific characteristics, features, similarities, and dissimilarities between these tools. For quantitative comparison, we create an open-source reconstruction benchmark based on viral data. The benchmark was executed using both synthetic and real datasets. With the former, we evaluated the effects to the reconstruction process of using different human viruses with simulated mutation rates, contamination and mitochondrial DNA inclusion, and various coverage depths. Each reconstruction program was also evaluated using real datasets, demonstrating their performance in real-life scenarios. The evaluation measures include the identity, a Normalized Compression Semi-Distance, and the Normalized Relative Compression between the genomes before and after reconstruction, as well as metrics regarding the length of the genomes reconstructed, computational time and resources spent by each tool. The benchmark is fully reproducible and freely available at https://github.com/viromelab/HVRS.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf159), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Levente Laczkó

      I reviewed the manuscript titled "An evaluation of computational methods for reconstruction of human viral genomes" by Sousa et al. The authors reviewed different tools for the reconstruction of viral genomes and developed a benchmarking framework to measure the performance of the different tools. The benchmarking was performed with both synthetic and real sequencing data, and the authors provide recommendations for different scenarios. The benchmarking framework developed with Bash is also made available on GitHub, providing the scientific community a good example to increase reproducibility. The analysis steps are also clearly described in the manuscript. Independent benchmarks, such as presented in the manuscript, are valuable contributions to the scientific literature and help to select the right tool for different tasks. The manuscript is clearly structured and well written, and the results are appropriately presented with rich supplementary material. I definitely recommend the publication of the manuscript in GigaScience. However, I have some questions that I think should be addressed before publishing the final version to further improve the manuscript.

      The authors describe that multiple strains may be present within a single infection. Indeed, the variability of strains within a single infection may be particularly important for some viruses. QuRe, ViSpA, SAVAGE and ViQUF are explicitly designed to find quasispecies. Are there any other tools in the benchmark that can predict whether samples are heterogeneous (or whose results can be used to infer this)?

      The authors have used the human mitochondrion as a source of contamination to test whether the tools are sensitive to it. Is there a reason why only the mitochondrion was used for this test and other, perhaps random, human DNA fragments were not?

      The error rate can strongly influence the accuracy of reference-based genome reconstructions. Has the effect of error rate been tested or could it affect the results, e.g. are there any tools in the benchmark that are less sensitive to higher error rates?

      In the synthetic dataset, the coverage ranged from 2-40×. This range represents scenarios where the viral copy number is low, but especially if the viral DNA was enriched before sequencing, the coverage could be much higher. Is there a reason to specifically choose 40x coverage as the highest coverage value? I agree that low coverage is a difficult challenge, but checking the performance of different tools at high read depth can help readers to choose the right tool for these use cases if there is a difference in the performance of the tools at e.g. >100x coverage.

      The authors correctly describe that the complexity of genomes can be a challenge for accurate genome reconstruction. Assessing the complexity (e.g. repetitive content ratio, GC ratio) of the genomes used in the synthetic dataset can add additional value to the results by showing how different tools perform on genomes of different complexity.

      Some reference-based tools (QVG, TRACESPipe, TRACESPipeLite and V-pipe) produced results with many gaps. Could the different approach be a reason for how they deal with low coverage regions? QVG, for example, masks positions with low sequencing depth to increase the specificity of the search for polymorphisms. Can the gaps be explained by the variation in sequencing depth, i.e. could the gaps be linked to genomic regions with very low or very high sequencing depth?

      I agree that benchmarking real datasets without the correct original sequence is a difficult task. I believe that showing the coverage and completeness (e.g. the ratio of the reconstructed length of the reference genome) can be an additional and useful information for the reader to choose the right tool for different tasks. The expected length of the viral genomes could be determined by the length of the reference genomes used, based on the classification of FALCON-meta, and in the case of de novo pipelines, the scaffolds that do not match the references could be classified using e.g. kraken2. This could show how complete the reconstructed genomes are and whether there are other viral genomes in the samples that FALCON-meta missed but still represent valuable information. Supplementary Figures S143-S146 show the number of reconstructed bases with and without gaps, but I think that this experiment should be emphasised more in the main text and that the ratio of reconstructed bases to the expected genome sizes might be more informative than just the total number of reconstructed base pairs.

      1) Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Yes

      2) Are the conclusions adequately supported by the data shown? Yes

      3) Please indicate the quality of language in the manuscript. Does it require a heavy editing for language and clarity? The language is well understandable

      4) Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Yes

    1. ABSTRACTHigh-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating these diverse data types can yield deeper insights into the biological mechanisms driving complex traits and diseases. Yet, extracting key shared biomarkers from multiple data layers remains a major challenge. We present a multivariate random forest (MRF)–based framework enhanced by a novel inverse minimal depth (IMD) metric for integrative variable selection. By assigning response variables to tree nodes and employing IMD to rank predictors, our approach efficiently identifies essential features across different omics types, even when confronted with high-dimensionality and noise. Through extensive simulations and analyses of multi-omics datasets from The Cancer Genome Atlas, we demonstrate that our method outperforms established integrative techniques in uncovering biologically meaningful biomarkers and pathways. Our findings show that selected biomarkers not only correlate with known regulatory and signaling networks but can also stratify patient subgroups with distinct clinical outcomes. The method’s scalable, interpretable, and user-friendly implementation ensures broad applicability to a range of research questions. This MRF-based framework advances robust biomarker discovery and integrative multi-omics analyses, accelerating the translation of complex molecular data into tangible biological and clinical insights.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf148), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 2: Yun-Juan Bao

      The article presents an Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery. It addresses the challenge of extracting key shared biomarkers from multiple omics data types by introducing a multivariate random forest-based approach enhanced by an inverse minimal depth metric.

      I have some concerns and comments below: 1. The new algorithm described in the study selected omics variables by assigning response variable to decision tree nodes. How the response variables relate to biological responses/outcomes? From the authors' description, it seems that the selected omics variables using the IMD are almighty, i.e., they can predict anything needed, such as prognosis, cancer types, and et al. Actually, the usual logic to select omics variables to predict prognosis is to evaluate the association between omics variables and survival time. 2. Following the discussion in 1, what is the biological meaning to extract shared biomarkers from multiple data layers? While it is straightforward to think that the shared biomarkers between multiple data layers or data types may induce the same biological responses, the unique biomarkers also matter depending on what biological responses we care. 3. The Introduction section is not sufficient. The biological significance and technical details of "extract shared biomarkers from multiple data layers" need to be explained in more details. 4. It is advised to provide some examples of the statement in the Introduction: "may fail to capture nonlinear interactions" of the current methods (sPLS, CCA). 5. It is also advised to explain and illustrate how the new method proposed in this study addressed the challenge of traditional methods for capturing nonlinear relationships. Ablation study could be one of the choices. 6. The authors showed that their new approach "uncovered known cancer biological relevant pathways". How about the functional enrichment of genes selected from traditional methods, such as sPLS, CCA? 7. The authors showed that the selected RNA-seq and ATAC-seq features using the new approach are able to capture the distinction between different cancer types (Figure 8). It is suggested to quantitatively evaluate this capability using metrics of recall, precision, and et al. to calculate how many samples are corrected classified and how many are mis-classified in comparison with other methods. 8. It is advised to re-find the Discussion. In what scenario their new method can be applied? What biological insights can be obtained and what can be missed by the new method? 9. The authors did not provide sufficient details about the datasets they used in the section Method. How many samples in TCGA? How many features did they use? How many features left after filtering? 10. Although the performance of the new approach showed some kind of superior in comparison with other methods, the authors only used the currently known databases. It is advised to apply their approach to additional testing datasets or real-world datasets to increase the confidence of the conclusion of this study. It is also observed that the performance of sPLS is better than others in some cases (Figure 4). 11. It is suggested to re-fine the figures. The labels and legends are too tiny to be seen. 12. There is no sub-figure labels a,b,c,d,e,f in Figure 8. The positions of sub-figure labels in Figure 3, Figure 4, Figure 5, Figure 7 are not correct.

    1. AbstractThe processing and analysis of magnetic resonance images is highly dependent on the quality of the input data, and systematic differences in quality can consequently lead to loss of sensitivity or biased results. However, varying image properties due to different scanners and acquisition protocols, as well as subject-specific image interferences, such as motion artifacts, can be incorporated in the analysis. A reliable assessment of image quality is therefore essential to identify critical outliers that may bias results. Here we present a quality assessment for structural (T1-weighted) images using tissue classification. We introduce multiple useful image quality measures, standardize them into quality scales and combine them into an integrated structural image quality rating to facilitate the interpretation and fast identification of outliers with (motion) artifacts. The reliability and robustness of the measures are evaluated using synthetic and real datasets. Our study results demonstrate that the proposed measures are robust to simulated segmentation problems and variables of interest such as cortical atrophy, age, sex, brain size and severe disease-related changes, and might facilitate the separation of motion artifacts based on within-protocol deviations. The quality control framework presents a simple but powerful tool for the use in research and clinical settings.Competing Interest StatementThe authors have declared no competing interest.

      This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf146), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

      Reviewer 1: Chris Foulon

      The article presents a valuable effort towards standardising quality control methods and their evaluation. However, too many choices seem arbitrary without sufficient justification, and too many sections are unclear. Overall, the quality of the work cannot be fully assessed in the current state of the manuscript, and major revisions are needed to correct that. There is also not enough comparison (one) with other methods and no way of evaluating whether these measures are relevant to actual downstream imaging uses. Additionally, the article's goal is highly unclear and led me to think the segmentation measures were part of the QC pipeline until I read the discussion ... Nothing until the discussion explains that the segmentation measures are used to evaluate the single SIQR score output of the QC pipeline.

      Comments: "All measures and tools are part of the Computational Anatomy Toolbox (CAT; https://neuro-jena.github.io//cat, Gaser et al., 2024) of the Statistical Parametric Mapping (SPM; http://www.fil.ion.ucl.ac.uk/spm, Ashburner et al. 2002) software and also available as a standalone version (https://neuro-jena.github.io/enigma-cat12/#standalone)." I cannot really expect everyone to avoid Matlab tools. Still, Matlab is a drag to the development of scalable tools nowadays (every system admin's nightmare is to have to try to make Matlab tools run on high-performance computing servers).

      "such as noise, inhomogeneities, and resolution (Figure 1B)." At this point in the article, it's a bit unclear how that works in Figure 1B.

      "It is assessed within optimized cerebrospinal fluid (CSF) and white matter (WM) regions." Then, the NCR relies on the segmentation, right? What if the segmentation fails?

      Oh, most of the measures actually rely on the segmentation. Are segmentation errors accounted for in the tool? I am thinking specifically about "abnormal" brains that can be difficult for segmentation algorithms. At least at this point of the article, it's not clear.

      "To accommodate various international rating systems, we have adopted a linear percentage and a corresponding (alpha-)numeric scaling." this doesn't match the complexity of the following explanation about the rather arbitrary range. I think a much more international and understandable rating would have been a 0 to 1 range. A 0.5 to 10.5 range is not helping users at all. As the rating is linear, I am struggling to see the added value of this choice.

      "Although the BWP does not include the simulation of motion artifacts, these are in general comparable to an increase of noise in the BWP dataset by 2 percentage points." Maybe that should be justified with a reference? "in general" might be a bit light to justify not having a direct measure for something presented as important (motion artefacts) in the introduction and goal of the tool. I think the absence of a noise estimation in the QC ratings should be more thoroughly justified.

      "To balance the sensitivity to different quality measures while ensuring that the necessary quality conditions are met, we apply an exponentially weighted averaging approach — similar to the root mean square (RMS) but using the fourth power and fourth root." Why is there no justification or references for these arbitrary choices? Why not the fifth root or tenth root? Why the square root and not an exponential or any other function?

      "Sample Normalization for Outlier Detection" It is unclear whether this is systematically applied or not. Is it a separate measure, or is it aggregated into another score? That measure could be relevant in many cases but could also be really bad in some specific cases (for example, historical data where the "ideal" quality would probably be well below standards.

      "raw (co-registered)" Well, it is not raw if it's co-registered. I suggest reformulation to avoid confusion with actual raw images.

      The "Evaluation Concept and Data" section is very unclear. The need for a training-testing scheme is not explained, and the scheme itself is very arbitrary (choosing odd and even numbered files ordered by filenames). How does that splitting strategy help with generalisation? Why that specific split? Why not another? How do we know that split is not biased? Finally, the selection of 6 scans also seems completely arbitrary. Overall, this section does not provide enough information to justify the seemingly arbitrary choices.

      "Of note, obvious subject/scan-specific motion artifacts generally increase the scans' rating for about 1 grade, which corresponds to a decrease of 10 rps (and +0.5 grade / -5 rps for light artifacts), in comparison to the typical rating achieved by the majority of scans of the same protocol." This is incredibly vague! How are readers supposed to evaluate the quality control measures with this information?

      Discussion: "as this is more relevant for segmentation and surface reconstruction (Ashburner et al., 2005)." A lot of work has been done in these domains in 20 years; this reference, however solid, is not enough to justify that choice. This might not be relevant with the methods developed in the last 20 years.

      "with a power of 4 rather than 2, to place greater emphasis on the more problematic aspects of image quality." Still not enough to justify that choice. The authors failed to convince me that one single score is better than reporting all the measures significantly, as different quality measures will influence different tasks. A very practical example is the fact that the vast majority of acquisitions in clinical settings, the resolution is anisotropic (though less with T1 images nowadays, historical datasets will still have it). This anisotropy is not necessarily an issue for human diagnosis, for example; however, aggregating all the scores in one might hide that a low-quality measurement might not affect the specific downstream task. Coupled with the lack of justification for the factor scalings, this choice of a single score is a significant negative point for the tool.

      Data availability: Where can the sources of these specific tools be accessed?

    1. R0:

      Review Comments to the Author

      Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

      Reviewer #1: 1. The manuscript primarily shows that adding a visual inspection step increased the proportion of prosthetic feet deemed usable (83% to 94%). This outcome is predictable and does not constitute meaningful scientific innovation. The work reads as an operational description rather than rigorous research; novelty and contribution are therefore limited. 2. The proposed checklist is not validated. There is no mechanical or structural testing, no clinical functional outcomes, no prospective field evaluation, no inter-rater reliability assessment, and no sensitivity or specificity analysis. Accordingly, the checklist cannot be considered a standard, and the conclusions overstate the evidence. A formal validation phase is required. 3. Safety, mechanical integrity, and lifespan have not been evaluated. Visual inspection alone is inadequate for medical devices. No ISO-aligned static or cyclic loading tests are presented, nor are durability or time-in-service data available. This is a critical omission given the manuscript’s intent to inform international practice. 4. No patient-level outcomes are included (for example, fit success, comfort, skin issues, mobility, abandonment, repair frequency, or time-to-failure). Without these data, the practical value of the intervention remains uncertain. 5. Brand-level comparisons are underpowered, and model-level or material-level analyses are not presented. Despite acknowledging this limitation, the manuscript still interprets brand-related effects. 6. The Introduction and narrative sections are disproportionately long and repetitive; substantial condensation is recommended. In contrast, the Methods and Results require greater depth and clarity. 7. The statistical analysis is limited. Logistic models do not account for key confounders such as service age, storage duration, materials, or model type. Model diagnostics, effect sizes with confidence intervals, and multiple-comparison considerations are not reported. 8. Economic evaluation is absent. Donation and reuse programs in low and middle income settings are cost sensitive, and without cost modeling, the recommendations have limited actionable value. 9. Several claims are overstated, including suggestions related to circular economy effects, international standard development, and safety assurance. These assertions are not supported by the presented data and should be moderated.

      Reviewer #2: It is suggested to review the Nippon Foundation/Exceed Cambodia in proposing the standards of P&O. The case study that has been done in Cambodia, Myanmar, Laos, Vietnam and Sri Lanka in will guide the current P&O Standard in low and middle income countries.

      It is best to review the minimum standards of P&O in these countries as a underlying theory to govern the foundation of foot reuse and donation used.

      A robust systematic reviews are vital in proposing standards for foot reuse and donations used in low and middle income countries. An updated literature are needed.

      It is suggested to explore the preliminary findings in these low and middle income countries.

      Reviewer #3: GENERAL This reviewer welcomes the ambition of the authors to start developing standards for donated prosthetic componentry to LMICs. Such standards are indeed much needed as one important factor to improve the quality of the prosthetic devices provided within LMICs.

      The authors’ work has carefully been imbedded into a wealth of information and reasons for why the need is urgent for developing standards of donated prosthetic components. This information has been mindfully drafted including viewpoints and situation of many LMICs as well as HICs. Well done!

      What left this reviewer wondering is why the development of the checklist has not been carried out with locals at the two centers, where MB and PM were able to collect the data of the stored feet. The rationale for not doing so should be included into the Limitations section.

      Further, why has no testing of the developed checklist been carried out with the two centers? For example, dividing the available feet into two equal sized groups would have raised the opportunity to develop the checklist with one group of feet including the regression model and then test it on the remaining feet in the second group. Why was this not considered? One could classify all available feet as indicated in Table 1, but then consider only these feet who were mostly used in the field or were mostly available. Lowering the numbers of independent variables to the those variables that would represent the essence of the checklist best would have given the option for a regression model, or is this reviewer mistaken? These points should be discussed in the paper. In case the paper gets too long (word count), it is recommended to concise the actual discussion section as it provides similar points stated in the introduction.

      And lastly, this reviewer does not think that retesting used feet similar to the stated ISO standards would be feasible. Instead, it might be worthwhile checking in other industries (aviation, deep-sea shipping) what type of non-mechanical controls for checking of wear and tear on materials/motors are available without dismantling motors or testing of used structures. Perhaps some light and/or sonar evaluation would be a way to check the mechanical structure of used prosthetic feet and other componentry without putting any more strain on the used materials. That might be some thoughts for the Future Work section. Also probable collaboration with universities in LMICs should be considered as a close source of additional brain power for the development of standards within a given country.

      DETAILED The reviewer finds the word ‘prosthetics’ difficult and prefers the (correct) term ‘prosthetic componentry or prosthetic components’ instead. In her experience using the nomenclature of the P/O profession adds clarity in an interdisciplinary context. It is often unclear to people outside of or adjacent to the P/O profession that a ‘prosthetics’ is composed of different products, i.e. some industrial produced prosthetic components and – in most cases – a bespoken locally fabricated prosthetic socket. By using prosthetic components or prosthesis/prostheses when referring to the final product – the authors will signal directly that there are ‘pieces’ needed to compose an entire prosthesis. Further, using the correct term assists in distinguishing prostheses fabricated with componentry from those being fabricated by 3D printing, also a field needing standards for C2C design. Therefore, please change the wording accordingly within the entire paper – thank you!

      Lines 165-168. This sentence seems to be incomplete – please check.

      Line 229. This statement is incorrect. In Switzerland (and the reviewer is sure this is the case in France, Netherlands and the UK), prosthetic componentry has different life/warranty cycles depending on the type of prosthetic component and its model. Please rephrase this sentence pointing out that different prosthetic components and their models have different life/warranty cycles set by the industrial manufacturers.

      Lines 284-286.This sentence is unclear: Are the authors checking prosthetic feet shipped to Africa prior to the study or as part of the study when these feet arrive in Africa? If they are analyzed prior to the study how do the authors make sure that the damage seen is indeed due to shipping and not due to storage, for example? If the authors controlled feet within the study time period, would the sentence not needed to be stated “… we review prosthetic feet ALSO in Africa.”? Or did the authors not review the feet at the study place, but only in Africa? Please clarify and rephrase – thank you. These clarifications/details seem to be better placed within the Materials and Methods Chapter.

      Lines 287-311, in particular lines 311-317. Because the authors use an experimental setup, variables are usually considered as ‘independent’ or ‘dependent’. Please clarify what variables (independent, dependent) were considered. All variables the authors used to classify the different feet need be listed together with the rationale for the decision to include them into the regression model, including their order.

      Ok – are the variables listed on line 314 the once considered as independent variables to classify a prosthetic foot as ‘reusable’ or ‘not reusable’? If so, why? In other words, why do the authors consider the ‘brand’ to be more important than the condition of the foot itself? Or is it the case because only those feet that passed the visual test of being 'usable' were included into the regression model? Up to this point, this reviewer understood the aim of the study as being to develop a set of criteria to classify a prosthetic foot as reusable or not. If a visual pre-selection needs to be carried out first, how good/robust is the regression model that follows? Please clarify and add this clarification to the text – thank you.

      Lines 296-298. What variables (the authors call them ‘flaws’, if understood correctly) did the authors consider during the usability tests? How were these tests carried out? What happened with the feet the authors did consider as ‘not usable’: where they removed from the total sample of 366 feet (see below remarks to line 319)? For illustration: assuming the authors used for their visual check a variable called ‘cracks within the cosmetic’: did the authors classify a foot as still usable when only surface cracks were available, or did they exclude any foot with a crack in its shell? What were the criteria to classify a SACH foot as ‘usable’? More detailed information about the entire method for the visual checks and the resulting classification needs to be stated.

      When did the authors add any of this variable into the regression model and they give some of the variables a weighting, i.e. were some of the variables considered more important than others, and if so, why? Please add this information and make a reference to Table 2 or better, create a new Table or flowchart showing the authors thoughts and decision process including the variables used upon which they based their decision to classify a foot as ‘usable’ or ‘not usable’. Clarification on this matter will strengthen the work as it helps the reader to better understand the authors’ rationale – thank you!

      Line 319. Please start the results section with “A total of 366 feet where analyzed, 196 left and 170 right feet…”

      Line 320. Please add “… and A brand could be identified for… ” – thank you.

      Lines 320-322. Based on the information given in Table 1, there were 12 brands identified as categories plus one category with feet unknown to the authors. Because ‘unknown’ is not a brand, the sentence needs to be rephrased – thank you.

      Lines 353-357. These sentences seem to be missing some text, at least, they do not make sense to this reviewer. In lines 353-355 the authors state that the feet of Trulife and Ossur performed worst. Then in the following lines the authors state that they are (nevertheless??) considered as appropriate for donation. Please clarify – thank you.

      Table 4. Please explain/add, either in the corresponding text (lines 350 and subsequently) how the negative signs have to be read. Why has the measurement made against ‘BioQuest’ and not ‘Janton’ and how do the authors explain the difference in the coefficient between these two feet? Both feet were represented with n=1, why is there a difference? Please explain and add the clarification into the text within the Discussion section – thank you.

      Figure 2. Please add to Fig. 2, a, b, and c, as done in Fig. 1. This assists in clarifying matters. Please add this clarification into the text: line 364 = Figure 2a; line 378: delete (Figure 2) and add after ‘NCRPPD’ (Figure 2b); line 379: add (Figure 2c) after ‘K4C’.

      Line 388. Add at the end of the sentence ‘(Figure 3)’.

      Line 395. Please expand this sentence like or similar as proposed “…can be a burden to the recipient LMIC [31, 39,40], as indicated by Marks et al (2019 – Please check PLOS rules!!):” and then have the quotation followed. This will connect the quotation with the text and makes it easier to read.

      Line 469. Please check this sentence – the word ‘design’ seems to be twice stated. If this is correct, consider rephrasing as the sentence reads strange, thank you.

      Checklist questions: • Question (1): Please add example of ‘completeness’ of a prosthetic foot, as you did for Question 2. • Question (3): Add examples of what the authors consider ‘compliant’: forefoot, heel, middle section? All of these, only one? Usable for light persons, like children if only one part of the foot is too compliant? If so, which one do the authors consider as the most important variable for a foot to be still considered ‘usable’?

      Line 529. Word missing: “..cost of what” was the biggest barrier? Please complete.

      Line 533. Please consider replacing ‘in this way’ with ‘Therefore’ or similar that would connect clearer the content of the previous paragraph with this new one.

      Line 544. Typos: ‘reduce’ instead of ‘reduces’, ‘limit’ instead of ‘limits’.

      Line 567. Stop the sentence after ‘repair of equipment’ and continue with a new sentence starting, for example with “Hamner et al (please check PLOS rules!!) point out that … and than add the quotation.

      Line 570. Please delete ‘etc.’ This should not be used in a text as it lefts the reader wonder what else – in this case – could have had an influence. Instead write ‘for example’ and list the three most missing points that were not considered.

      Line 620. Keep the number correct: the authors tested 306 feet. The number speaks for itself, no need to bolster it. To this reviewer bolstering looks bad, stay with the figures.

      Line 622. Replace ‘are’ with ‘were’, as this was the case for the authors' sample. Samples of other authors might vary.

    1. Pew Research Center has been studying online harassment for several years now. A new report on Americans’ experiences with and attitudes toward online harassment finds that 41% of U.S. adults have personally experienced some form of online harassment – and the severity of the harassment has increased since we last studied it in 2017. We spoke with Emily Vogels, a research associate at the Center focusing on internet and technology research, about the new findings. The interview has been edited for clarity and condensed. One of the big takeaways from this report – and, to me, the biggest surprise – is that, while the overall number of people facing online harassment seems to be more or less stable, the nature of the harassment has changed over time. What are some of the most significant ways in which online harassment has worsened since we first started studying it? Emily Vogels, research associate at Pew Research Center While the overall number of those facing at least one of the six problems we ask about hasn’t changed, this survey finds that the level of harassment is increasing in two key ways: People are more likely to have encountered multiple forms of harassment online, and severe encounters have become more common. When the Center began studying online harassment in 2014, we found that 35% of American adults had experienced it. That grew to 41% in 2017 and remains the same in the new survey. But the shares who have ever experienced more severe forms of harassment – such as physical threats, stalking, sexual harassment or sustained harassment – or multiple forms of harassing behaviors online have both risen substantially in the past three years. This is not the pattern we saw in prior surveys. There has been a markedly steeper rise in these measures since 2017, compared with the change between our 2014 and 2017 studies. The shares who have ever experienced more severe forms of harassment or multiple forms of harassing behaviors online have both risen substantially in the past three years. Also, when we ask people about their most recent harassment experience, they’re more likely than in the past to include these more severe behaviors and involve multiple forms of harassment. And as of 2020, 41% of online harassment targets say their most recent experience spanned multiple locations online – for example, a person being harassed on social media and by text message. Does this suggest that online harassment is, to some extent, becoming “normalized”? It is commonplace. Roughly four-in-ten American adults say they’ve personally experienced harassment online. These numbers are more staggering when we look at adults under 30 – 64% of them say they’ve faced such issues online and 48% say they’ve experienced at least one of the more severe types of harassment. In addition, previous work by the Center found that a majority of adults overall have witnessed others being harassed online. Even when online harassment hasn’t been the focus of our research, we have seen this online incivility play a role in people’s perceptions and experiences of other online phenomena, such as online dating, political discussions on social media and social media in general. The Center’s past research on harassment has shown there are some demographic differences in the kinds of problems people face online. What did this survey show in particular about men, women and harassment? Men are slightly more likely than women to encounter at least one of the six types of online harassment we asked about, but there are notable differences in the types of harassment they encounter. Men are more likely than women to be called an offensive name or be physically threatened. Women are about three times as likely as men to face sexual harassment online, and younger women are even more likely to experience this type of abuse. Another difference in the new survey is that sexual harassment of women has doubled in the past three years, while the rate of sexual harassment among men is largely the same as in 2017. Women who have been the target of online harassment also report finding their most recent harassment experiences to be more upsetting than their male counterparts. There are also differences in where men and women encountered harassment online in their most recent experience. Social media sites are the most common location regardless of gender, but a larger share of women who have been harassed say their most recent incident was on social media, compared with men who have been targeted. Men targeted in online harassment are more likely than women to have been harassed while online gaming or while using an online forum or discussion site. Beyond personal experiences, men and women express different attitudes about online harassment, with women more likely to say it’s a major problem. And prior Center work finds that a greater share of women than men value people feeling safe online over people being able to speak their minds freely. When it comes to how to address online harassment, women are more optimistic than men about a variety of potential solutions, including criminal charges for social media users who harass others online, temporary or permanent bans for users who harass others, and social media companies proactively deleting bullying or harassing posts. Interesting. To what extent do those gender differences in harassment experiences reflect differences in men’s and women’s online activities? Men are more likely to report they had these types of experiences in online forums or gaming platforms. Is that because more men than women use such platforms? It’s a bit complicated. Prior work from the Center suggests there are modest gender differences in gaming, with men being more likely than women to at least sometimes play video games. But this study didn’t ask if people played games online, so we can’t say whether the gender differences in harassment incidents tied to gaming hold when looking at just online gamers. It’s worth keeping in mind that the data on where people were harassed online is for people’s most recent incident, not every incident these folks may have encountered in the past. Prior Center findings show people may stop engaging in an activity – for example, withdrawing from a platform or deleting a social media account – if they encounter harassment. Similarly, do the age differences in those who say they have experienced harassment reflect how many, and how frequently, people of different ages are online? In other words, does the fact that far more adults under 30 report experiencing online harassment reflect younger people spending much more of their lives online than older folks? We don’t quite have enough evidence to make this causal connection, but the broad patterns are pretty clear. This survey found that adults under 30 consistently experience each of the six forms of harassment we asked about at higher rates than any other age group. The Center’s previous work does show that younger adults are more likely to use the internet and to use it almost constantly. Our research on teens in 2018 found that greater exposure to the internet puts people at a higher likelihood of encountering harassment at some point online. It’s worth noting, though, that non-internet users were not asked about their possible experiences with online harassment. So, if people stopped using the internet sometime after they were harassed online, our data wouldn’t capture their earlier harassment experience. The survey finds that 75% of targets of online harassment say their most recent experience was on social media. Has this been true since the Center began researching online harassment? Do people feel social media companies have done enough to discourage this behavior? Fully 79% of Americans think social media companies are doing an only fair to poor job when it comes to addressing online harassment or bullying. The share of online harassment targets who say their most recent harassing encounter took place on social media is growing – up 17 percentage points since 2017. The Center’s prior work reveals a variety of negative opinions Americans hold about social media companies, and when it comes to Americans’ views of how these companies handle online harassment, the pattern of criticism continues. Fully 79% of Americans think social media companies are doing an only fair to poor job when it comes to addressing online harassment or bullying on their platforms. Based on previous Center findings, American teens hold similarly negative views of social media companies’ ability to address these issues. Many Americans suggest that permanent bans for users who harass others and required identity disclosure to use these platforms would be very effective ways to combat harassment on social media. To what extent do you think that the fact 2020 was an election year accounts for the increase in the number of people who say they were harassed because of their political views? Politics was already a heated issue long before this election. According to other research from the Center, partisan antipathy has been growing for years. Americans increasingly say they find they have less in common politically with people with whom they disagree, and they see political discussions online as less respectful, less civil and angrier than political discussions in other places. There are also some striking demographic differences among those who say they’ve been harassed for their politics. Online harassment targets who are White or male – 56% and 57% of each – are particularly likely to think their harassment was a result of their political views. This is especially true for White men who say they’ve been targeted, at 61%. Other groups commonly point to other aspects of their identity as the reason they faced harassment online. For example, roughly half or more Black or Hispanic online harassment targets – 54% and 47% respectively – identify their race or ethnicity as a reason they were harassed, while only 17% of their White counterparts say the same. Bear in mind that politics isn’t the only perceived reason for harassment being on the rise. Over the past several years, rising shares of online harassment targets have said they think they were harassed because of their gender, race, ethnicity, religion or sexual orientation.

      The government reports highlight that cyberbullying is widespread and often chronic, affecting many youth for long periods.

  3. Dec 2025
    1. Author response:

      The following is the authors’ response to the original reviews

      Reviewer #1 (Public review):

      The manuscript by Shan et al seeks to define the role of the CHI3L1 protein in macrophages during the progression of MASH. The authors argue that the Chil1 gene is expressed highly in hepatic macrophages. Subsequently, they use Chil1 flx mice crossed to Clec4F-Cre or LysM-Cre to assess the role of this factor in the progression of MASH using a high-fat, high-cholesterol diet (HFHC). They found that loss of Chil1 in KCs (Clec4F Cre) leads to enhanced KC death and worsened hepatic steatosis. Using scRNA seq, they also provide evidence that loss of this factor promotes gene programs related to cell death. From a mechanistic perspective, they provide evidence that CHI3L serves as a glucose sink and thus loss of this molecule enhances macrophage glucose uptake and susceptibility to cell death. Using a bone marrow macrophage system and KCs they demonstrate that cell death induced by palmitic acid is attenuated by the addition of rCHI3L1. While the article is well written and potentially highlights a new mechanism of macrophage dysfunction in MASH, there are some concerns about the current data that limit my enthusiasm for the study in its current form. Please see my specific comments below.

      (1) The authors' interpretation of the results from the KC (Clec4F) and MdM KO (LysM-Cre) experiments is flawed. For example, in Figure 2 the authors present data that knockout of Chil1 in KCs using Clec4f Cre produces worse liver steatosis and insulin resistance. However, in supplemental Figure 4, they perform the same experiment in LysM-Cre mice and find a somewhat different phenotype. The authors appear to be under the impression that LysM-Cre does not cause recombination in KCs and therefore interpret this data to mean that Chil1 is relevant in KCs and not MdMs. However, LysM-Cre DOES lead to efficient recombination in KCs and therefore Chil1 expression will be decreased in both KCs and MdM (along with PMNs) in this line.

      Therefore, a phenotype observed with KC-KO should also be present in this model unless the authors argue that loss of Chil1 from the MdMs has the opposite phenotype of KCs and therefore attenuates the phenotype. The Cx3Cr1 CreER tamoxifen inducible system is currently the only macrophage Cre strategy that will avoid KC recombination. The authors need to rethink their results with the understanding that Chil1 is deleted from KCs in the LysM-Cre experiment. In addition, it appears that only one experiment was performed, with only 5 mice in each group for both the Clec4f and LysM-Cre data. This is generally not enough to make a firm conclusion for MASH diet experiments.

      We thank the reviewer for raising this important point regarding our data interpretation. We have carefully examined the deletion efficiency of Chi3l1 in primary Kupffer cells (KCs) from Lyz2<sup>∆Chil1</sup> (LysM-Cre) mice. Our results show roughly a 40% reduction in Chi3l1 expression at both the mRNA and protein levels (Revised Manuscript, Figure S7B and C). Given this modest decrease, Chi3l1 deletion in KCs of Lyz2<sup>∆Chil1</sup> mice was incomplete, which likely accounts for the phenotypic differences observed between Clec4f<sup>∆Chil1</sup> and Lyz2<sup>∆Chil1</sup> mice in the MASLD model.

      Furthermore, we have increased the sample size in both the Clec4f- and LysM-Cre experiments to 9–12 mice per group following the HFHC diet, thereby strengthening the statistical power and reliability of our findings (Revised Figures 2 and S8).

      (2) The mouse weight gain is missing from Figure 2 and Supplementary Figure 4. This data is critical to interpret the changes in liver pathology, especially since they have worse insulin resistance.

      We thank the reviewer for this valuable comment. We have now included the mouse body weight data in the revised manuscript (Figure 2A, B and Figures S8A, B). Compared with mice on a normal chow diet (NCD), all groups exhibited progressive weight gain during HFHC diet feeding. Notably, Clec4f<sup>∆Chil1</sup> mice gained significantly more body weight than Chil1<sup>fl/fl</sup> controls, whereas Lyz2<sup>∆Chil1</sup> mice showed a similar weight gain trajectory to Chil1<sup>fl/fl</sup> mice under the same conditions.

      (3) Figure 4 suggests that KC death is increased with KO of Chil1. However, this data cannot be concluded from the plots shown. In Supplementary Figure 6 the authors provide a more appropriate gating scheme to quantify resident KCs that includes TIM4. The TIM4 data needs to be shown and quantified in Figure 4. As shown in Supplementary Figure 6, the F4/80 hi population is predominantly KCs at baseline; however, this is not true with MASH diets. Most of the recruited MoMFs also reside in the F4/80 hi gate where they can be identified by their lower expression of TIM4. The MoMF gate shown in this figure is incorrect. The CD11b hi population is predominantly PMNs, monocytes, and cDC,2 not MoMFs (PMID:33997821). In addition, the authors should stain the tissue for TIM4, which would also be expected to reveal a decrease in the number of resident KCs.

      We thank the reviewer for raising this critical point regarding the gating strategy and interpretation of KC death. We have now refined our flow cytometry gating based on the reviewer’s suggestion. Specifically, we analyzed TIM4 expression and attempted to identify TIM4<sup>low</sup> MoMFs populations in our model. However, we did not detect a distinct TIM4<sup>low</sup> population, likely because our mice were fed the HFHC diet for only 16 weeks and had not yet developed liver fibrosis. We therefore reason that MoMFs have not fully acquired TIM4 expression at this stage.

      To improve our analysis, we referred to published strategies (PMID: 41131393; PMID: 32562600) and gated KCs as CD45<sup>+</sup>CD11b<sup>+</sup>F4/80<sup>hi</sup> TIM4<sup>hi</sup> and MoMFs as CD45<sup>+</sup>Ly6G<sup>-</sup>CD11b<sup>+</sup>F4/80<sup>low</sup> TIM4<sup>low/-</sup>. Using this approach, we observed a gradual reduction of KCs and a corresponding increase in MoMFs in WT mice, with a significantly faster loss of KCs in Chil1<sup>-/-</sup> mice (Revised Figure 4C, D; Figure S10A).

      Furthermore, immunofluorescence staining for TIM4 combined with TUNEL or cleaved caspase-3 confirmed an increased number of dying KCs in Chil1<sup>-/-</sup> mice compared to WT following HFHC diet feeding (Revised Figure 4E; Figure S10B).

      (4) While the Clec4F Cre is specific to KCs, there is also less data about the impact of the Cre system on KC biology. Therefore, when looking at cell death, the authors need to include some mice that express Clec4F cre without the floxed allele to rule out any effects of the Cre itself. In addition, if the cell death phenotype is real, it should also be present in LysM Cre system for the reasons described above. Therefore, the authors should quantify the KC number and dying KCs in this mouse line as well.

      We thank the reviewer for raising this important point. During our study, we indeed observed an increased number of KCs in Clec4f-Cre mice compared to WT controls, suggesting that the Clec4f-Cre system itself may modestly affect KC homeostasis. To address this, we compared KCs numbers between Clec4f<sup>∆Chil1</sup> and Clec4f-Cre mice and found that Clec4f<sup>∆Chil1</sup> mice displayed a significant reduction in KCs numbers following HFHC diet feeding. Moreover, co-staining for TIM4 and TUNEL revealed a marked increase in KCs death in Clec4f<sup>∆Chil1</sup> mice relative to Clec4f-Cre mice, indicating that the observed phenotype is attributable to Chil1 deletion rather than Cre expression alone. These data have been reported in our related manuscript (He et al., bioRxiv, 2025.09.26.678483; doi: 10.1101/2025.09.26.678483).

      In addition, we quantified KCs numbers and KCs death in the Lyz2-Cre line. TIM4/TUNEL co-staining showed comparable levels of KCs death between Chil1<sup>fl/fl</sup> and Lyz2<sup>∆Chil1</sup> mice (Revised Figure S11B). Consistently, flow cytometry analyses revealed no significant differences in KCs numbers between these two groups before (0 weeks) or after (20 weeks) HFHC diet feeding (Revised Figures S11C, D). As discussed in our response to Comment 1, this may be due to the incomplete deletion of Chi3l1 in KCs (<50%) in the Lyz2-Cre line, which likely attenuates the phenotype.

      (5) I am somewhat concerned about the conclusion that Chil1 is highly expressed in liver macrophages. Looking at our own data and those from the Liver Atlas it appears that this gene is primarily expressed in neutrophils. At a minimum, the authors should address the expression of Chil1 in macrophage populations from other publicly available datasets in mouse MASH to validate their findings (several options include - PMID: 33440159, 32888418, 32362324). If expression of Chil1 is not present in these other data sets, perhaps an environmental/microbiome difference may account for the distinct expression pattern observed. Either way, it is important to address this issue.

      We thank the reviewer for this insightful comment and agree that analysis of scRNA-seq data, including our own and those reported in the Liver Atlas as well as in the referenced studies (PMID: 33440159, 32888418, 32362324), indicates that Chil1 is predominantly expressed in neutrophils.

      However, our immunofluorescence staining under normal physiological conditions revealed that Chi3l1 protein is primarily localized in Kupffer cells (KCs), as demonstrated by strong co-staining with TIM4 (Revised Figure 1E). In MASLD mouse models induced by HFHC or MCD diets, we observed that both KCs and monocyte-derived macrophages (MoMFs) express Chi3l1, with particularly high levels in MoMFs.

      We speculate that the apparent discrepancy between scRNA-seq datasets and our in situ findings may reflect differences in cellular proportions and detection sensitivity. Since hepatic macrophages (particularly KCs and MoMFs) constitute a larger proportion of total liver immune cells compared with neutrophils, their contribution to total Chi3l1 protein levels in tissue staining may appear dominant, despite lower transcript abundance per cell in sequencing datasets. We have included a discussion of this point in the revised manuscript to clarify this distinction (Revised manuscript, page 8,line 341-350 ).

      Minor points:

      (1) Were there any changes in liver fibrosis or liver fibrosis markers present in these experiments?

      We assessed liver fibrosis using Sirius Red staining and α-SMA Western blot analysis.

      We found no induction of liver fibrosis in our HFHC-induced MASLD model (Revised Figure S1A, B), but a clear elevation of fibrosis markers in the MCD-induced MASH model (Revised Figure S6A, B).

      (2) In Supplementary Figure 3, the authors do a western blot for CHI3L1 in BMDMs. This should also be done for KCs isolated from these mice. Does this antibody work for immunofluorescence? Staining liver tissue would provide valuable information on the expression patterns.

      We have included qPCR and western blot for Chi3l1 in isolated primary KCs from Lyz2<sup>∆Chil1</sup> mice. The data show a slight, non-significant reduction in both mRNA and protein levels in KCs (Revised Figure S7B, C). The immunofluorescence staining on liver tissue showed that Chi3l1 is more likely expressed in the plasma membranes of TIM4<sup>+</sup> F4/80<sup>+</sup> KCs both under NCD and HFHC diet (Revised Figure 1E).

      (3) What is the impact of MASH diet feeding on Chil1 expression in KCs or in the liver in general?

      In both our MASLD and MASH models, diet feeding consistently upregulates Chi3l1 in KCs or in the liver in general (Revised Figure 1F, G, S6C,D).

      (4) In Figure S1 the authors show tSNE plots of various monocyte and macrophage genes in the liver. Are these plots both diets together? How do things look when comparing these markers between the STD and HFHC diet? The population of recruited LAMs seems very small for 16 weeks of diet. Moreover, Chil1 should also be shown on these tSNE plots as well.

      Yes, these plots are both diets together. When compared separately, the core marker expression is consistent between NCD and HFHC diets. However, the HFHC diet induces a relative increase in KC marker expression within the MoMF cluster, suggesting phenotypic adaptation (Author response image 1A, below). Moreover, Chil1 expression on the t-SNE plot was shown (Author response image 1B, below). However, compared to lineage-specific marker genes, Chi3l1 expression is rather low.

      Author response image 1.

      Gene expression levels of lineage-specific marker genes in monocytes/macrophages clusters between NCD and HFHC diets. (A) UMAP plots show the scaled expression changes of lineage-specific markers in KCs/monocyte/macrophage clusters from mice under NCD and HFHC diets. Color represents the level of gene expression. (B) UMAP plots show the scaled expression changes of Chil1 in KCs/monocyte/macrophage clusters from mice under NCD and HFHC diets. Color represents the level of gene expression.

      (5) In Figure 5, the authors demonstrate that CHI3L1 binds to glucose. However, given that all chitin molecules bind to carbohydrates, is this a new finding? The data showing that CHI3L is elevated in the serum after diet is interesting. What happens to serum levels of this molecule in KC KO or total macrophage KO mice? Do the authors think it primarily acts as a secreted molecule or in a cell-intrinsic manner?

      We thank the reviewer for these insightful comments, which helped us clarify the novelty of our findings.

      (1) Novelty of CHI3L1-Glucose Binding:

      While chitin-binding domains are known to interact with carbohydrate polymers, our key discovery is that CHI3L1 (YKL-40)—a mammalian chitinase-like protein lacking enzymatic activity—specifically binds to glucose, a simple monosaccharide. This differs fundamentally from canonical binding to insoluble polysaccharides such as chitin and reveals a potential role for CHI3L1 in monosaccharide recognition, linking it to glucose metabolism and energy sensing. We clarified this point in the revised manuscript (page 9, line374-379).

      (2) Serum CHI3L1 in Knockout Models:

      Consistent with the reviewer’s suggestion, serum Chi3l1 levels are altered in our knockout models:

      KC-specific KO (Clec4f<sup>ΔChil1</sup>): Under normal chow, serum CHI3L1 is markedly reduced compared to controls and remains lower following HFHC feeding (Author response image 2A, below), indicating that Kupffer cells are the main source of circulating CHI3L1 under basal and disease conditions.

      Macrophage KO (Lyz2<sup>ΔChil1</sup>): No significant changes were observed between Chil1<sup>fl/fl</sup> and Lyz2<sup>ΔChil1</sup> mice under either diet (Author response image 2B, below), likely due to minimal monocyte-derived macrophage recruitment in this HFHC model (see Revised Figure 4C,D).

      (3) Secreted vs. Cell-Intrinsic Role:

      CHI3L1 predominantly localizes to the KC plasma membrane, consistent with a secreted role, and its serum reduction in KC-specific knockouts supports the physiological relevance of its secreted role. While cell-intrinsic effects have been reported elsewhere, our current data do not address this in KCs and warrant future investigation.

      Author response image 2.

      Chi3l1 expression in serum before and after HFHC in CKO mice. (A) Western blot to detect Chi3l1 expression in serum of Chil1<sup>fl/fl</sup> and Clec4f<sup>ΔChil1</sup> mice before and after 16 weeks’ HFHC diet. n=3 mice/group. (B) Western blot to detect Chi3l1 expression in serum of Chil1<sup>fl/fl</sup> and Lyz2ΔChil1 before and after 16 weeks’ HFHC diet. n=3 mice/group.

      Reviewer #2 (Public review):

      The manuscript from Shan et al., sets out to investigate the role of Chi3l1 in different hepatic macrophage subsets (KCs and moMFs) in MASLD following their identification that KCs highly express this gene. To this end, they utilise Chi3l1KO, Clec4f-CrexChi3l1fl, and Lyz2-CrexChi3l1fl mice and WT controls fed a HFHC for different periods of time.

      Major:

      Firstly, the authors perform scRNA-seq, which led to the identification of Chi3l1 (encoded by Chil1) in macrophages. However, this is on a limited number of cells (especially in the HFHC context), and hence it would also be important to validate this finding in other publicly available MASLD/Fibrosis scRNA-seq datasets. Similarly, it would be important to examine if cells other than monocytes/macrophages also express this gene, given the use of the full KO in the manuscript. Along these lines, utilisation of publicly available human MASLD scRNA-seq datasets would also be important to understand where the increased expression observed in patients comes from and the overall relevance of macrophages in this finding.

      We thank the reviewer for this valuable suggestion and acknowledge the limited number of cells analyzed under the HFHC condition in our original dataset. To strengthen our findings, we have now examined four additional publicly available scRNA-seq datasets— two from mouse models and two from human MASLD patients (Revised Figure S3, manuscript page 4, line 164-172). Across these datasets, the specific cell type showing the highest Chil1 expression varied somewhat between studies, likely reflecting model differences and disease stages. Nevertheless, Chil1 expression was consistently enriched in hepatic macrophage populations, including both Kupffer cells and infiltrating macrophages, in mouse and human livers. Notably, Chil1 expression was higher in infiltrating macrophages compared to resident Kupffer cells, supporting its upregulation during MASLD progression. These additional analyses confirm the robustness and crossspecies relevance of our finding that macrophages are the primary Chil1-expressing cell type in the liver.

      Next, the authors use two different Cre lines (Clec4f-Cre and Lyz2-Cre) to target KCs and moMFs respectively. However, no evidence is provided to demonstrate that Chil1 is only deleted from the respective cells in the two CRE lines. Thus, KCs and moMFs should be sorted from both lines, and a qPCR performed to check the deletion of Chil1. This is especially important for the Lyz2-Cre, which has been routinely used in the literature to target KCs (as well as moMFs) and has (at least partial) penetrance in KCs (depending on the gene to be floxed). Also, while the Clec4f-Cre mice show an exacerbated MASLD phenotype, there is currently no baseline phenotype of these animals (or the Lyz2Cre) in steady state in relation to the same readouts provided in MASLD and the macrophage compartment. This is critical to understand if the phenotype is MASLD-specific or if loss of Chi3l1 already affects the macrophages under homeostatic conditions.

      We thank the reviewer for raising this important point.

      (1) Chil1 deletion efficiency in Clec4f-Cre and Lyz2-Cre lines:

      We have assessed the efficiency of Chil1 deletion in both Lyz2<sup>∆Chil1</sup> and Clec4f<sup>∆Chil1</sup> mice by evaluating mRNA and protein levels of Chi3l1. For the Lyz2<sup>∆Chil1</sup> mice, we measured Chi3l1 expression in bone marrow-derived macrophages (BMDMs) and primary Kupffer cells (KCs). Both qPCR (for mRNA) and Western blotting (for protein) reveal that Chi3l1 is almost undetectable in BMDMs from Lyz2<sup>∆Chil1</sup> mice when compared to Chil1<sup>fl/fl</sup> controls. In contrast, we observe no significant reduction in Chi3l1 expression in KCs from these animals (Revised Figure S7B, C), suggesting Chil1 is deleted in BMDMs but not in KCs in Lyz2-Cre line.

      For the Clec4f<sup>∆Chil1</sup> mice, both mRNA and protein levels of Chi3l1 are barely detectable in BMDMs and primary KCs when compared to Chil1<sup>fl/fl</sup> controls (Revised Figure S4B, C). However, we did observe a faint Chi3l1 band in KCs of Clec4f<sup>∆Chil1</sup> mice, which we suspect is due to contamination from LSECs during the KC isolation process, given that the TIM4 staining for KCs was approximately 90%. Overall, Chil1 is deleted in both KCs and BMDMs in Clec4f-Cre line.

      Notably, since we observed a pronounced MASLD phenotype in Clec4f-Cre mice but not in Lyz2-Cre mice, these findings further underscore the critical role of Kupffer cells in the progression of MASLD.

      (2) Whether the phenotype is MASLD-specific or whether loss of Chi3l1 already affects the macrophages under homeostatic conditions: We now included phenotypic data of Clec4f<sup>ΔChil1</sup> mice (KC-specific KO) and Lyz2<sup>∆Chil1</sup> mice (MoMFs-specific KO) fed with NCD 16w (Revised Figure 2A-F, S8A-F). Shortly speaking, there is no baseline difference between Chil1<sup>fl/fl</sup> and Clec4f<sup>ΔChil1</sup> or Lyz2<sup>∆Chil1</sup> mice in steady state in relation to the same readouts provided in MASLD.

      Next, the authors suggest that loss of Chi3l1 promotes KC death. However, to examine this, they use Chi3l1 full KO mice instead of the Clec4f-Cre line. The reason for this is not clear, because in this regard, it is now not clear whether the effects are regulated by loss of Chi3l1 from KCs or from other hepatic cells (see point above). The authors mention that Chi3l1 is a secreted protein, so does this mean other cells are also secreting it, and are these needed for KC death? In that case, this would not explain the phenotype in the CLEC4F-Cre mice. Here, the authors do perform a basic immunophenotyping of the macrophage populations; however, the markers used are outdated, making it difficult to interpret the findings. Instead of F4/80 and CD11b, which do not allow a perfect discrimination of KCs and moMFs, especially in HFHC diet-fed mice, more robust and specific markers of KCs should be used, including CLEC4F, VSIG4, and TIM4.

      We thank the reviewer for raising this important point. We performed experiments in Clec4f<sup>∆Chil1</sup> (KC-specific KO) model. The phenotype in these mice closely mirrors that of the full KO: we observed a significant reduction in KC numbers and a concurrent increase in KC cell death following an HFHC diet in Clec4f<sup>∆Chil1</sup> mice post HFHC diet compared to Clec4f-cre mice. We have reported these data in the following related manuscript (Figure 6 D-G). This confirms that the loss of CHI3L1 specifically from KCs is sufficient to drive this effect.

      Hyperactivated Glycolysis Drives Spatially-Patterned Kupffer Cell Depletion in MASLD Jia He, Ran Li, Cheng Xie, Xiane Zhu, Keqin Wang, Zhao Shan bioRxiv 2025.09.26.678483; doi: https://doi.org/10.1101/2025.09.26.678483

      While other hepatic cells (e.g., neutrophils and liver sinusoidal endothelial cells) also express Chi3l1, our data indicate that KC-secreted Chi3l1 plays a dominant and cellautonomous role in maintaining KCs viability. The potential contribution of other cellular sources to this phenotype remains an interesting direction for future study.

      We apologize for the lack of clarity in our initial immunophenotyping. We have revised the flow cytometry data to clearly show that KCs are rigorously defined as TIM4+ cells (Revised Figure 4C, D).

      Additionally, while the authors report a reduction of KCs in terms of absolute numbers, there are no differences in proportions. Thus, coupled with a decrease also in moMF numbers at 16 weeks (when one would expect an increase if KCs are decreased, based on previous literature) suggests that the differences in KC numbers may be due to differences in total cell counts obtained from the obese livers compared with controls. To rule this out, total cell counts and total live CD45+ cell counts should be provided. Here, the authors also provide tunnel staining in situ to demonstrate increased KC death, but as it is typically notoriously difficult to visualise dying KCs in MASLD models, here it would be important to provide more images. Similarly, there appear to be many more Tunel+ cells in the KO that are not KCs; thus, it would be important to examine this in the CLEC4F-Cre line to ascertain direct versus indirect effects on cell survival.

      We thank the reviewer for raising this important point. We have now included the total cell counts and total live CD45<sup>+</sup> cell counts, which showed similar numbers between WT and Chil1<sup>-/-</sup> mice post HFHC diet (Figure 3A, below).

      Moreover, we included cleavaged caspase 3 and TIM4 co-staining in WT and Chil1<sup>-/-</sup> mice before and after HFHC diets, which confirmed increased KCs death in Chil1<sup>-/-</sup> mice (Revised Figure S10B). We have compared KCs number and KCs death between Clec4fcre and Clec4f<sup>∆Chil1</sup> mice under NCD and HFHC diet in the following manuscript (Figure 6 D-G). The data showed similar KCs number under NCD and reduced KCs number in Clec4f<sup>∆Chil1</sup> mice compared to Clec4f-cre mice, which confirms direct effects of Chi3l1 on cell survival but not because of cre insertion.

      Hyperactivated Glycolysis Drives Spatially-Patterned Kupffer Cell Depletion in MASLD Jia He, Ran Li, Cheng Xie, Xiane Zhu, Keqin Wang, Zhao Shan bioRxiv 2025.09.26.678483; doi: https://doi.org/10.1101/2025.09.26.678483

      Author response image 3.

      Number of total cells and total live CD45+ cells in liver of WT and Chil1<sup>-/-</sup> mice. (A) Number of total cells and total live CD45+ cells/liver were statistically analyzed. n= 3-4 mice per group.

      Finally, the authors suggest that Chi3l1 exerts its effects through binding glucose and preventing its uptake. They use ex vivo/in vitro models to assess this with rChi3l1; however, here I miss the key in vivo experiment using the CLEC4F-Cre mice to prove that this in KCs is sufficient for the phenotype. This is critical to confirm the take-home message of the manuscript.

      We agree that it is essential to confirm the in vivo relevance of Chi3l1-mediated glucose regulation in Kupffer cells (KCs). Our data suggest that KCs undergo cell death not because they express Chi3l1 per se, but because they exhibit a glucose-hungry metabolic phenotype that makes them uniquely dependent on Chi3l1-mediated regulation of glucose uptake. To directly assess this mechanism in vivo, we injected 2-NBDG, a fluorescent glucose analog, into overnight-fasted and refed mice and quantified its uptake in hepatic KCs. Notably, Chi3l1-deficient KCs exhibited significantly increased 2-NBDG uptake compared with controls, and this effect was markedly suppressed by co-treatment with recombinant Chi3l1 (rChi3l1) (Revised Figure 6G, H). These findings demonstrate that Chi3l1 regulates glucose uptake by KCs in vivo, supporting our proposed mechanism that Chi3l1 controls KC metabolic homeostasis through modulation of glucose availability.

      Minor points:

      (1) Some key references of macrophage heterogeneity in MASLD are not cited: PMID: 32362324 and PMID: 32888418.

      We thank the reviewer for highlighting these critical references and have included them in the introduction (Revised manuscript, page 2, line 64-73).

      (2) In the discussion, Figure 3H is referenced (Serum data), but there is no Figure 3H. If the authors have this data (increased Chi3l1 in serum of mice fed HFHC diet), what happens in CLEC4F-Cre mice fed the diet? Is this lost completely? This comes back to the point regarding the specificity of expression.

      We apologize for the mistake. It should be Figure 5F now in the revised version, in which serum Chi3l1 was significantly upregulated after HFHC diet. Moreover, under a normal chow diet (NCD), serum CHI3L1 is significantly lower in Clec4f<sup>ΔChil1</sup> mice compared to controls (Chil1<sup>fl/fl</sup>). Following an HFHC diet, levels increase in both genotypes but remain relatively lower in the KC-KO mice (please see Figure 2A above). This data strongly suggests that Kupffer Cells (KCs) are the primary source of serum CHI3L1 under basal conditions and a major contributor during MASLD progression.

      Reviewer #3 (Public review):

      This paper investigates the role of Chi3l1 in regulating the fate of liver macrophages in the context of metabolic dysfunction leading to the development of MASLD. I do see value in this work, but some issues exist that should be addressed as well as possible.

      (1) Chi3l1 has been linked to macrophage functions in MASLD/MASH, acute liver injury, and fibrosis models before (e.g., PMID: 37166517), which limits the novelty of the current work. It has even been linked to macrophage cell death/survival (PMID: 31250532) in the context of fibrosis, which is a main observation from the current study.

      We thank the reviewer for this insightful comment regarding the novelty of our findings. We agree that Chi3l1 has previously been linked to macrophage survival and function in models of liver injury and fibrosis (e.g., PMID: 37166517, 31250532). However, our study focuses specifically on the early stage of MASLD, prior to the onset of fibrosis, revealing a distinct mechanistic role for CHI3L1 in this context.

      We demonstrate that CHI3L1 directly interacts with extracellular glucose to regulate its cellular uptake—a previously unrecognized biochemical function. Furthermore, we show that CHI3L1’s protective role is metabolically dependent, safeguarding glucose-dependent Kupffer cells (KCs) but not monocyte-derived macrophages (MoMFs). This metabolic dichotomy and the direct link between CHI3L1 and glucose sensing represent conceptual advances beyond previous studies of CHI3L1 in fibrotic or injury models.

      (2) The LysCre-experiments differ from experiments conducted by Ariel Feldstein's team (PMID: 37166517). What is the explanation for this difference? - The LysCre system is neither specific to macrophages (it also depletes in neutrophils, etc), nor is this system necessarily efficient in all myeloid cells (e.g., Kupffer cells vs other macrophages). The authors need to show the efficacy and specificity of the conditional KO regarding Chi3l1 in the different myeloid populations in the liver and the circulation.

      We thank the reviewer for this important comment and the opportunity to clarify both the efficiency and specificity of our conditional knockouts, as well as the differences from the study by Feldstein’s group (PMID: 37166517).

      (1) Chil1 deletion efficiency in Clec4f-Cre and Lyz2-Cre lines:

      We have assessed the efficiency of Chil1 deletion in both Lyz2<sup>∆Chil1</sup> and Clec4f<sup>∆Chil1</sup> mice by evaluating mRNA and protein levels of Chi3l1. For the Lyz2<sup>∆Chil1</sup> mice, we measured Chi3l1 expression in bone marrow-derived macrophages (BMDMs) and primary Kupffer cells (KCs). Both qPCR (for mRNA) and Western blotting (for protein) reveal that Chi3l1 is almost undetectable in BMDMs from Lyz2<sup>∆Chil1</sup> mice when compared to Chil1<sup>fl/fl</sup> controls. In contrast, we observe no significant reduction in Chi3l1 expression in KCs from these animals (Revised Figure S7B, C), suggesting that Chil1 is deleted in BMDMs but not in KCs in Lyz2-Cre line.

      For the Clec4f<sup>∆Chil1</sup> mice, both mRNA and protein levels of Chi3l1 are barely detectable in BMDMs and primary KCs when compared to Chil1<sup>fl/fl</sup> controls (Revised Figure S4B, C). However, we did observe a faint Chi3l1 band in KCs of Clec4f<sup>∆Chil1</sup> mice, which we suspect is due to contamination from LSECs during the KC isolation process, given that the TIM4 staining for KCs was approximately 90%. Overall, Chil1 is deleted in both KCs and BMDMs in Clec4f-Cre line.

      Notably, since we observed a pronounced MASLD phenotype in Clec4f-Cre mice but not in Lyz2-Cre mice, these findings further underscore the critical role of Kupffer cells in the progression of MASLD.

      (2) Explanation for Differences from Feldstein et al. (PMID: 37166517):

      Our findings differ from those reported by Feldstein’s group primarily due to differences in disease stage and model. We used a high-fat, high-cholesterol (HFHC) diet to model earlystage MASLD characterized by steatosis and inflammation without fibrosis (Revised Figure S1A,B). In this context, we observed KC death but minimal MoMF infiltration (Revised Figure 4D). Accordingly, deletion of Chi3l1 in MoMFs (Lyz2<sup>∆Chil1</sup>) had no measurable effect on insulin resistance or steatosis, consistent with limited MoMF involvement at this stage. In contrast, the Feldstein study employed a CDAA-HFAT diet that models later-stage MASH with fibrosis. In that setting, Lyz2<sup>∆Chil1</sup> mice showed reduced recruitment of neutrophils and MoMFs, which likely underlies the attenuation of fibrosis and disease severity reported. Together, these data support a model in which KCs and MoMFs play temporally distinct roles during MASLD progression: KCs primarily drive early lipid accumulation and metabolic dysfunction, whereas MoMFs contribute more substantially to inflammation and fibrosis at later stages.

      (3) The conclusions are exclusively based on one MASLD model. I recommend confirming the key findings in a second, ideally a more fibrotic, MASH model.

      We thank the reviewer for this valuable suggestion to validate our findings in an additional MASH model. We have now included data from a methionine- and choline-deficient (MCD) diet–induced MASH model, which exhibits pronounced hepatic lipid accumulation and fibrosis (Revised Figure S6A,B). Consistent with our HFHC results, Clec4f<sup>∆Chil1</sup> mice displayed exacerbated MASH progression in this model, including increased lipid deposition, inflammation, and fibrosis (Revised Figure S6E-G).These findings confirm that CHI3L1 deficiency in Kupffer cells promotes hepatic lipid accumulation and disease progression across distinct MASLD/MASH models.

      (4) Very few human data are being provided (e.g., no work with own human liver samples, work with primary human cells). Thus, the translational relevance of the observations remains unclear.

      We thank the reviewer for this important comment regarding translational relevance. We fully agree that validation in human liver samples would further strengthen our study. However, obtaining tissue from early-stage steatotic livers is challenging due to the asymptomatic nature of this disease stage. Nonetheless, multiple studies have consistently reported Chi3l1 upregulation in human fibrotic and steatotic liver disease (PMID: 31250532, 40352927, 35360517), supporting the clinical significance of our mechanistic findings. We have now expanded the Discussion to highlight these human data and better contextualize our results within the spectrum of human MASLD/MASH progression (Revised manuscript, page 9, line390-394).

      Minor points:

      The authors need to follow the new nomenclature (e.g., MASLD instead of MAFLD, e.g., in Figure 1).

      "MASLD" used throughout.

      We thank the reviewers for their rigorous critique again. We thank eLife for fostering an environment of fairness and transparency that enables authors to communicate openly and present their data honestly.

      Reference

      (1) Tran, S. Baba I, Poupel L, et al(2020) Impaired Kupffer Cell Self-Renewal Alters the Liver Response to Lipid Overload during Non-alcoholic Steatohepatitis. Immunity 53, 627-640.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      In this manuscript, Chengjian Zhao et al. focused on the interactions between vascular, biliary, and neural networks in the liver microenvironment, addressing the critical bottleneck that the lack of high-resolution 3D visualization has hindered understanding of these interactions in liver disease.

      Strengths:

      This study developed a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized CUBIC tissue clearing. This method enables the simultaneous 3D visualization of spatial networks of the portal vein, hepatic artery, bile ducts, and central vein in the mouse liver. The authors reported a perivascular structure termed the Periportal Lamellar Complex (PLC), which is identified along the portal vein axis. This study clarifies that the PLC comprises CD34⁺Sca-1⁺ dual-positive endothelial cells with a distinct gene expression profile, and reveals its colocalization with terminal bile duct branches and sympathetic nerve fibers under physiological conditions.<br />

      Weaknesses:

      This manuscript is well-written, organized, and informative. However, there are some points that need to be clarified.

      (1) After MCNP-dye injection, does it remain in the blood vessels, adsorb onto the cell surface, or permeate into the cells? Does the MCNP-dye have cell selectivity?

      The experimental results showed that after injection, the MCNP series nanoparticles predominantly remained within the lumens of blood vessels and bile ducts, with their tissue distribution determined by physical perfusion. No diffusion of the dye signal into the surrounding parenchymal tissue was observed, nor was there any evidence of adsorption onto the cell surface or entry into cells. The newly added Supplementary Figure S2A–H further confirmed this feature, demonstrating that the dye signals were strictly confined to the luminal space, clearly delineating the continuous course of blood vessels and the branching morphology of bile ducts. These findings strongly support the conclusion that “MCNP dyes are distributed exclusively within the luminal compartments.”

      Therefore, the MCNP dyes primarily serve as intraluminal tracers within the tissue rather than as labels for specific cell types.

      (2) All MCNP-dyes were injected after the mice were sacrificed, and the mice's livers were fixed with PFA. After the blood flow had ceased, how did the authors ensure that the MCNP-dyes were fully and uniformly perfused into the microcirculation of the liver?

      Thank you for the reviewer’s valuable comments. Indeed, since all MCNP dyes were perfused after the mice were euthanized and blood circulation had ceased, we cannot fully ensure a homogeneous distribution of the dye within the hepatic microcirculation. The vascular labeling technique based on metallic nanoparticle dyes used in this study offers clear imaging, stable fluorescence intensity, and multiplexing advantages; however, it also has certain limitations. The main issue is that the dye distribution within the hepatic parenchyma can be affected by factors such as lobular overlap, local tissue compression, and variations in vascular pathways, resulting in regional inhomogeneity of dye perfusion. This is particularly evident in areas where multiple lobes converge or where anatomical structures are complex, leading to local dye accumulation or over-perfusion.

      In our experiments, we attempted to minimize local blockage or over-perfusion by performing PBS pre-flushing and low-pressure, constant-speed perfusion. Nevertheless, localized dye accumulation or uneven distribution may still occur in lobe junctions or structurally complex regions. Such variation represents one of the methodological limitations. Overall, the dye signals in most samples remained confined to the vascular and biliary lumens, and the distribution pattern was highly reproducible.

      We have addressed this issue in the Discussion section but would like to emphasize here that, although this system has clear advantages, it remains sensitive to anatomical variability in the liver—such as lobular overlap and vascular heterogeneity. At vascular junctions, local perfusion inhomogeneity or dye accumulation may occur; therefore, injection strategies and perfusion parameters should be adjusted according to liver size and vascular condition to improve reproducibility and imaging quality. It should also be noted that the results obtained using this method primarily aim to visualize the overall and fine anatomical structures of the hepatic vascular system rather than to quantitatively reflect hemodynamic processes. In the future, we plan to combine in vivo perfusion or dynamic fluid modeling to further validate the diffusion characteristics of the dyes within the hepatic microcirculation.

      (3) It is advisable to present additional 3D perspective views in the article, as the current images exhibit very weak 3D effects. Furthermore, it would be better to supplement with some videos to demonstrate the 3D effects of the stained blood vessels.

      Thank you for the reviewer’s valuable comments. In response to the suggestion, we have added perspective-rendered images generated from the 3D staining datasets to provide a more intuitive visualization of the spatial morphology of the hepatic vasculature. These images have been included in Figure S2A–J. In addition, we have prepared supplementary videos (available upon request) that dynamically display the three-dimensional distribution of the stained vessels, further enhancing the spatial perception and visualization of the results.

      (4) In Figure 1-I, the authors used MCNP-Black to stain the central veins; however, in addition to black, there are also yellow and red stains in the image. The authors need to explain what these stains are in the legend.

      Thank you for the reviewer’s constructive comment. In Figure 1I, MCNP-Black labels the central vein (black), MCNP-Yellow labels the portal vein (yellow), MCNP-Pink labels the hepatic artery (pink), and MCNP-Green labels the bile duct (green). We have revised the Figure 1 legend to include detailed descriptions of the color signals and their corresponding structures to avoid any potential confusion.

      (5) There is a typo in the title of Figure 4F; it should be "stem cell".

      Thank you for the reviewer’s careful correction. We have corrected the spelling error in the title of Figure 4F to “stem cell” and updated it in the revised manuscript.

      (6) Nuclear staining is necessary in immunofluorescence staining, especially for Figure 5e. This will help readers distinguish whether the green color in the image corresponds to cells or dye deposits.

      We thank the reviewer for the valuable suggestion. We understand that nuclear staining can help determine the origin of fluorescence signals. However, in our three-dimensional imaging system, the deep signal acquisition range after tissue clearing often causes nuclear dyes such as DAPI to generate highly dense and widespread fluorescence, especially in regions rich in vascular structures, which can obscure the fine vascular and perivascular details of interest. Therefore, this study primarily focuses on high-resolution visualization of the spatial architecture of the vascular and biliary systems. We have added an explanation regarding this point in Figures S2I–J.

      Reviewer #2 (Public review):

      Summary:

      The present manuscript of Xu et al. reports a novel clearing and imaging method focusing on the liver. The authors simultaneously visualized the portal vein, hepatic artery, central vein, and bile duct systems by injecting metal compound nanoparticles (MCNPs) with different colors into the portal vein, heart left ventricle, inferior vena cava, and the extrahepatic bile duct, respectively. The method involves: trans-cardiac perfusion with 4% PFA, the injection of MCNPs with different colors, clearing with the modified CUBIC method, cutting 200 micrometer thick slices by vibratome, and then microscopic imaging. The authors also perform various immunostaining (DAB or TSA signal amplification methods) on the tissue slices from MCNP-perfused tissue blocks. With the application of this methodical approach, the authors report dense and very fine vascular branches along the portal vein. The authors name them as 'periportal lamellar complex (PLC)' and report that PLC fine branches are directly connected to the sinusoids. The authors also claim that these structures co-localize with terminal bile duct branches and sympathetic nerve fibers, and contain endothelial cells with a distinct gene expression profile. Finally, the authors claim that PLC-s proliferate in liver fibrosis (CCl4 model) and act as a scaffold for proliferating bile ducts in ductular reaction and for ectopic parenchymal sympathetic nerve sprouting.

      Strengths:

      The simultaneous visualization of different hepatic vascular compartments and their combination with immunostaining is a potentially interesting novel methodological approach.

      Weaknesses:

      This reviewer has several concerns about the validity of the microscopic/morphological findings as well as the transcriptomics results. In this reviewer's opinion, the introduction contains overstatements regarding the potential of the method, there are severe caveats in the method descriptions, and several parts of the Results are not fully supported by the documentation. Thus, the conclusions of the paper may be critically viewed in their present form and may need reconsideration by the authors.

      We sincerely thank the reviewer for the thorough evaluation and constructive comments on our study. We fully understand and appreciate the reviewer’s concerns regarding the methodological validity and interpretation of the results. In response, we have made comprehensive revisions and additions to the manuscript as follows:

      First, we have carefully revised the Introduction and Discussion sections to provide a more balanced description of the methodological potential, removing statements that might be considered overstated, and clarifying the applicable scope and limitations of our approach (see the revised Introduction and Discussion).

      Second, we have substantially expanded the Methods section with detailed information on model construction, imaging parameters, data processing workflow, and technical aspects of the single-cell transcriptomic reanalysis, to enhance the transparency and reproducibility of the study.

      Third, we have added additional references and explanatory notes in the Results section to better support the main conclusions (see Section 6 of the Results).

      Finally, we have rechecked and validated all experimental data, and conducted a verification analysis using an independent single-cell RNA-seq dataset (Figure S6). The results confirm that the morphological observations and transcriptomic findings are consistent and reproducible across independent experiments.

      We believe these revisions have greatly strengthened the reliability of our conclusions and the overall scientific rigor of the manuscript. Once again, we sincerely appreciate the reviewer’s valuable comments, which have been very helpful in improving the logic and clarity of our work.

      Reviewer #3 (Public review):

      Summary:

      In the reviewed manuscript, researchers aimed to overcome the obstacles of high-resolution imaging of intact liver tissue. They report successful modification of the existing CUBIC protocol into Liver-CUBIC, a high-resolution multiplex 3D imaging method that integrates multicolor metallic compound nanoparticle (MCNP) perfusion with optimized liver tissue clearing, significantly reducing clearing time and enabling simultaneous 3D visualization of the portal vein, hepatic artery, bile ducts, and central vein spatial networks in the mouse liver. Using this novel platform, the researchers describe a previously unrecognized perivascular structure they termed Periportal Lamellar Complex (PLC), regularly distributed along the portal vein axis. The PLC originates from the portal vein and is characterized by a unique population of CD34⁺Sca-1⁺ dual-positive endothelial cells. Using available scRNAseq data, the authors assessed the CD34⁺Sca-1⁺ cells' expression profile, highlighting the mRNA presence of genes linked to neurodevelopment, biliary function, and hematopoietic niche potential. Different aspects of this analysis were then addressed by protein staining of selected marker proteins in the mouse liver tissue. Next, the authors addressed how the PLC and biliary system react to CCL4-induced liver fibrosis, implying PLC dynamically extends, acting as a scaffold that guides the migration and expansion of terminal bile ducts and sympathetic nerve fibers into the hepatic parenchyma upon injury.

      The work clearly demonstrates the usefulness of the Liver-CUBIC technique and the improvement of both resolution and complexity of the information, gained by simultaneous visualization of multiple vascular and biliary systems of the liver at the same time. The identification of PLC and the interpretation of its function represent an intriguing set of observations that will surely attract the attention of liver biologists as well as hepatologists; however, some claims need more thorough assessment by functional experimental approaches to decipher the functional molecules and the sequence of events before establishing the PLC as the key hub governing the activity of biliary, arterial, and neuronal liver systems. Similarly, the level of detail of the methods section does not appear to be sufficient to exactly recapitulate the performed experiments, which is of concern, given that the new technique is a cornerstone of the manuscript.

      Nevertheless, the work does bring a clear new insight into the liver structure and functional units and greatly improves the methodological toolbox to study it even further, and thus fully deserves the attention of readers.

      Strengths:

      The authors clearly demonstrate an improved technique tailored to the visualization of the liver vasulo-biliary architecture in unprecedented resolution.

      This work proposes a new biological framework between the portal vein, hepatic arteries, biliary tree, and intrahepatic innervation, centered at previously underappreciated protrusions of the portal veins - the Periportal Lamellar Complexes (PLCs).

      Weaknesses:

      Possible overinterpretation of the CD34+Sca1+ findings was built on re-analysis of one scRNAseq dataset.

      Lack of detail in the materials and methods section greatly limits the usefulness of the new technique to other researchers.

      We thank the reviewer for this important comment. We agree that when conclusions are mainly based on a single dataset, overinterpretation should be avoided. In response to this concern, we have carefully re-evaluated and clearly limited the scope of our interpretation of the scRNA-seq analysis. In addition, we performed a validation analysis using an independent single-cell RNA-seq dataset (see new Figure S6), which consistently confirmed the presence and characteristic transcriptional profile of the periportal CD34⁺Sca1⁺ endothelial cell population. These supplementary analyses strengthen the robustness of our findings and address the reviewer’s concern regarding potential overinterpretation.

      In the revised manuscript, we have also greatly expanded the Materials and Methods section by providing detailed information on sample preparation, imaging parameters, data processing workflow, and single-cell reanalysis procedures. These revisions substantially improve the transparency and reproducibility of our methodology, thereby enhancing the usability and reference value of this technique for other researchers.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Introduction

      (1) In general, the Introduction is very lengthy and repetitive. It needs extensive shortening to a maximum of 2 A4 pages.

      We thank the reviewer for the valuable suggestions. We have thoroughly condensed and restructured the Introduction, removing redundant content and merging related paragraphs to make the theme more focused and the logic clearer. The revised Introduction has been shortened to within two A4 pages, emphasizing the scientific question, innovation, and technical approach of the study.

      (2) Please correct this erroneous sentence:

      '...the liver has evolved the most complex and densely n organized vascular network in the body, consisting primarily of the portal vein system, central vein system, hepatic artery system, biliary system, and intrahepatic autonomic nerve network [6, 7].'

      We thank the reviewer for pointing out this spelling error. The revised sentence is as follows:

      “…the liver has evolved the most complex and densely organized ductal-vascular network in the body, consisting primarily of the portal vein system, central vein system, hepatic artery system, biliary system, and intrahepatic autonomic nerve network [6, 7].”

      (3) '...we achieved a 63.89% improvement in clearing efficiency and a 20.12% increase in tissue transparency'

      Please clarify what you exactly mean by 'clearing efficiency' and 'increased tissue transparency'.

      We thank the reviewer for the valuable comments and have clarified the relevant terminology in the revised manuscript.

      “Clearing efficiency” refers to the improvement in the time required for the liver tissue to become completely transparent when treated with the optimized Liver-CUBIC protocol (40% urea + H₂O₂), compared with the conventional CUBIC method. In this study, the clearing time was reduced from 9 days to 3.25 days, representing a 63.89% increase in time efficiency.

      “Tissue transparency” refers to the ability of the cleared tissue to transmit visible light. We quantified the optical transparency by measuring light transmittance across the 400–900 nm wavelength range using a microplate reader. The results showed that the average transmittance increased by 20.12%, indicating that Liver-CUBIC treatment markedly enhanced the optical clarity of the liver tissue.

      (4) I am concerned about claiming this imaging method as real '3D imaging'. Namely, while the authors clear full lobes, they actually cut the cleared lobes into 200-micrometer-thick slices and perform further microscopy imaging on these slices. Considering that they focus on ductular structures of the liver (such as vasculature, bile duct system, and innervations), 200 micrometer allows a very limited 3D overview, particularly in comparison with the whole-mount immuno-imaging methods combined with light sheet microscopy (such as Adori 2021, Liu 2021, etc). In this context, I feel several parts of the Introduction to be an overstatement: besides of emphasizing the advantages of the technique (such as simultaneous visualization of different hepatic vascular compartments and the bile duct system by MCNPs, the combination with immunostainings), the authors must honestly discuss the limitations (such as limited tissue overview, potential dye perfusion problems - uneven distribution of the dye etc).

      We appreciate the reviewer’s insightful comments. It is true that most of the imaging depth in this study was limited to approximately 200 μm, and thus it could not achieve whole-liver three-dimensional imaging comparable to light-sheet microscopy. However, the primary focus of our study was to resolve the microscopic intrahepatic architecture, particularly the spatial relationships among blood vessels, bile ducts, and nerve fibers. Through high-resolution imaging of thick tissue sections, combined with MCNP-based multichannel labeling and immunofluorescence co-staining, we were able to accurately delineate the three-dimensional distribution of these microstructures within localized regions.

      In addition to thick-section imaging, we also obtained whole-lobe dye perfusion data (as shown in Figure S1F), which comprehensively depict the three-dimensional branching patterns and distribution of the vascular systems within the liver lobe. These images were acquired from intact liver lobes perfused with MCNP dyes, revealing a continuous vascular network extending from major trunks to peripheral branches, thereby demonstrating that our approach is also capable of achieving organ-level visualization.

      We have added this image and a corresponding description in the revised manuscript to more comprehensively present the coverage of our imaging system, and we have incorporated this clarification into the Discussion section.

      Method

      (5) More information may be needed about MCNPs:

      a) As reported, there are nanoparticles with different colors in brightfield microscopy, but the particles are also excitable in fluorescence microscopy. Would you please provide a summary about excitation/emission wavelengths of the different MCNPs? This is crucial to understand to what extent the method is compatible with fluorescence immunohistochemistry.

      We thank the reviewer for the careful attention and professional suggestion. We fully agree that this issue is critical for evaluating the compatibility of our method with fluorescent immunohistochemistry. Different types of metal compound nanoparticles (MCNPs) have clearly distinguishable spectral properties:

      - MCNP-Green and MCNP-Yellow: AF488-matched spectra, with excitation/emission wavelengths of 495/519 nm.

      - MCNP-Pink: Designed for far-red spectra, with excitation/emission wavelengths of 561/640 nm.

      - MCNP-Black: Non-fluorescent, appearing black under bright-field microscopy only.

      The above information has been added to the Materials and Methods section.

      b) Also, is there more systematic information available concerning the advantage of these particles compared to 'traditional' fluorescence dyes, such as Alexa fluor or Cy-dyes, in fluorescence microscopy and concerning their compatibility with various tissue clearing methods (e.g., with the frequently used organic-solvent-based methods)?

      We thank the reviewer for the detailed question. Compared with conventional organic fluorescent dyes, MCNP offers the following advantages:

      - Enhanced photostability: Its inorganic core-shell structure resists fading even after hydrogen peroxide bleaching.

      - High signal stability: Fluorescence is maintained during aqueous-based clearing (e.g., CUBIC) and multiple rounds of staining without quenching.

      We appreciate the reviewer’s suggestion. In our Liver-CUBIC system, MCNP nanoparticles exhibited excellent multi-channel labeling stability and fluorescence signal retention. Regarding compatibility with other clearing methods (e.g., SCAFE, SeeDB, CUBIC), since these methods have limited effectiveness for whole-liver clearing (see Figure 2 of Tainaka, et al. 2014) and cannot meet the requirements for high-resolution microstructural imaging in this study, we consider further testing of their compatibility unnecessary.

      In summary, MCNP dye demonstrates superior signal stability and spectral separation compared with conventional organic fluorescent dyes in multi-channel, long-term, high-transparency three-dimensional tissue imaging.

      c) When you perfuse these particles, to which structures do they bind inside the ducts (vessels, bile ducts)? Is the 48h post-fixation enough to keep them inside the tubes/bind them to the vessel walls? Is there any 'wash-out' during the complex cutting/staining procedure? E.g., in Figure 2D: the 'classical' hepatic artery in the portal triad is not visible - but the MCNP apparently penetrated to the adjacent sinusoids at the edge of the lobulus. Also, in Figure 3B, there is a significant mismatch between the MNCP-green (bile duct) signal and the CD19 (epithelium marker) immunostaining. Please discuss these.

      The experimental results showed that following injection, MCNP nanoparticles primarily remained within the vascular and biliary lumens, and their tissue distribution depended on physical perfusion. No dye signal was observed to diffuse into the surrounding parenchyma, nor did the particles adhere to cell surfaces or enter cells. The newly added Supplementary Figures S2A–H further confirm this feature: the dye signal is strictly confined within the lumens, clearly delineating continuous vascular paths and biliary branching patterns, strongly supporting the conclusion that “MCNP dye is distributed only within luminal spaces.”

      Thus, MCNP dye mainly serves as an intraluminal tracer rather than a label for specific cell types.

      We provide the following explanations and analyses regarding MCNP distribution in the hepatic vascular and biliary systems and its post-fixation stability:

      - Potential signal displacement during sectioning/immunostaining: During slicing and immunostaining, a small number of particles may be washed away due to mechanical cutting or washing steps; however, the overall three-dimensional structure retains high spatial fidelity.

      - Observation in Figure 2D: MCNP was seen entering the sinusoidal spaces at the lobule periphery, but hepatic arteries were not visible, likely due to limitations in section thickness. Although arteries were not apparent in this slice, arterial distribution around the portal vein is visible in Figure 2C. It should be noted that Figures 2C, D, and E do not represent whole-liver imaging, so not all regions necessarily contain visible hepatic arteries. For easier identification, the main hepatic artery trunk is highlighted in cyan in Figure 2E.

      - Incomplete biliary signal in Figure 3B: This may be because CK19 labeling only covers biliary epithelial cells, whereas MCNP-green distributes throughout the biliary lumen. In Figure 3B, the terminal MCNP-green signal exhibits irregular polygonal structures, which we interpret as the canalicular regions.

      (6) Which fixative was used for 48h of postfixation (step 6) after MCNP injections?

      After MCNP injection, mouse livers were post-fixed in 4% paraformaldehyde (PFA) for 48 hours. This fixation condition effectively “locks” the MCNP particles within the vascular and biliary lumens, maintaining their spatial positions, while also being compatible with subsequent sectioning and multi-channel immunostaining analyses.

      The above information has been added to the Materials and Methods section

      (7) What is the 'desired thickness' in step 7? In the case of immunostained tissue, a 200-micrometer slice thickness is mentioned. However, based on the Methods, it is not completely clear what the actual thickness of the tissue was that was examined ultimately in the microscopes, and whether or not the clearing preceded the cutting or vice versa.

      We appreciate the reviewer’s question. The “desired thickness” referred to in step 7 of the manuscript corresponds to the thickness of tissue sections used for immunostaining and high-resolution microscopic imaging, which is typically around 200 µm. We selected 200 µm because this thickness is sufficient to observe the PLC structure in its entirety, allows efficient staining, and preserves tissue architecture well. Other researchers may choose different section thicknesses according to their experimental needs.

      In this study, the processing order for immunostained tissue samples was sectioning followed by clearing, as detailed below:

      Section Thickness

      To ensure antibody penetration and preservation of three-dimensional structure, tissue sections were typically cut to ~200 µm. Thicker sections can be used if more complete three-dimensional structures are required, but adjustments may be needed based on antibody penetration and fluorescence detection conditions.

      Clearing Sequence

      After sectioning, slices were processed using the Liver-CUBIC aqueous-based clearing system.

      (8) More information is needed concerning the 'deep-focus microscopy' (Keyence), the applied confocal system, and the THUNDER 'high resolution imaging system': basic technical information, resolutions, objectives (N.A., working distance), lasers/illumination, filters, etc.

      In this study, all liver lobes (left, right, caudate, and quadrate lobes) were subjected to Liver-CUBIC aqueous-based clearing to ensure uniform visualization of MCNP fluorescence and immunolabeling throughout the three-dimensional imaging of the entire liver.

      The above information has been added to the Materials and Methods section.

      Imaging Systems and Settings

      VHX-6000 Extended Depth-of-Field Microscope: Objective: VH-Z100R, 100×–1000×; resolution: 1 µm (typical); illumination: coaxial reflected; transmitted illumination on platform: ON.

      Zeiss Confocal Microscope (980): Objectives: 20× or 40×; image size: 1024 × 1024. Fluorescence detection was set up in three channels:

      - Channel 1: 639 nm laser, excitation 650 nm, emission 673 nm, detection range 673–758 nm, corresponding to Cy5-T1 (red).

      - Channel 2: 561 nm laser, excitation 548 nm, emission 561 nm, detection range 547–637 nm, corresponding to Cy3-T2 (orange).

      - Channel 3: 488 nm laser, excitation 493 nm, emission 517 nm, detection range 490–529 nm, corresponding to AF488-T3 (green).

      Leica THUNDER Imager 3D Tissue: Fluorescence detection in two channels:

      - Channel 1: FITC channel (excitation 488 nm, emission ~520 nm).

      - Channel 2: Orange-red channel (excitation/emission 561/640 nm).<br /> Equipped with matching filter sets to ensure signal separation.

      The above information has been added to the Materials and Methods section.

      (9) Liver-CUBIC, step 2: which lobe(s) did you clear (...whole liver lobes...).

      In this study, all liver lobes (left, right, caudate, and quadrate lobes) were subjected to Liver-CUBIC aqueous-based clearing to ensure uniform visualization of MCNP fluorescence and immunolabeling throughout the three-dimensional imaging of the entire liver.

      The above information has been added to the Materials and Methods section.

      (10) For the DAB and TSA IHC stainings, did you use free-floating slices, or did you mount the vibratome sections and do the staining on mounted sections?

      In this study, fixed livers were first sectioned into thick slices (~200 µm) using a vibratome. Subsequently, DAB and TSA immunohistochemical (IHC) staining were performed on free-floating sections. During the entire staining process, the slices were kept floating in the solutions, ensuring thorough antibody penetration in the thick sections while preserving the three-dimensional tissue architecture, thereby facilitating multiple rounds of staining and three-dimensional imaging.

      (11) Regarding the 'transmission quantification': this was measured on 1 mm thick slices. While it is interesting to make a comparison between different clearing methods in general, one must note that it is relatively easy to clear 1mm thick tissue slices with almost any kind of clearing technique and in any tissues. The 'real' differences come with thicker blocks, such as >5mm in the thinnest dimension. Do you have such experiences (e.g., comparison in whole 'left lateral liver lobes')?

      In this study, we performed three-dimensional visualization of entire liver lobes to depict the distribution of MCNPs and the overall spatial architecture of the vascular and biliary systems (Figure S1F). However, due to the limitations of the plate reader and fluorescence imaging systems in terms of spatial resolution and light penetration depth, quantitative analyses were conducted only on tissue sections approximately 1 mm thick.

      Regarding the comparative quantification of different clearing methods, as the reviewer noted, nearly all aqueous- or organic solvent–based clearing techniques can achieve relatively uniform transparency in 1 mm-thick tissue sections, so differences at this thickness are limited. We have not yet conducted systematic comparisons on whole-lobe sections thicker than 5 mm and therefore cannot provide “true” difference data for thicker tissues.

      (12) There is no method description for the ELMI studies in the Methods.

      Transmission Electron Microscopy (TEM) Analysis of MCNPs

      Before imaging, the MCNP dye solution was centrifuged at 14,000 × g for 10 minutes at 4 °C to remove aggregates and impurities. The supernatant was collected, diluted 50-fold, and 3–4 μL of the sample was applied onto freshly glow-discharged Quantifoil R1.2/1.3 copper grids (Electron Microscopy Sciences, 300 mesh). The sample was allowed to sit for 30 seconds to enable particle adsorption, after which excess liquid was gently wicked away with filter paper and the grid was air-dried at room temperature. The sample was then negatively stained with 1% uranyl acetate for 30 seconds and air-dried again before imaging.

      Negative-stain TEM images were acquired using a JEOL JEM-1400 transmission electron microscope operating at 120 kV and equipped with a CCD camera. Data acquisition followed standard imaging conditions.

      The above information has been added to the Materials and Methods section.

      (13) Please, provide a method description for the applied CCl4 cirrhosis model. This is completely missing.

      (1) Under a fume hood, carbon tetrachloride (CCl₄) was dissolved in corn oil at a 1:3 volume ratio to prepare a working solution, which was filtered through a 0.2 μm filter into a 30 mL glass vial. In our laboratory, to mimic chronic injury, mice in the experimental group were intraperitoneally injected at a dose of 1 mL/kg body weight per administration.

      (2) Mice were carefully removed from the cage and placed on a scale to record body weight for calculation of the injection volume.

      (3) The needle cap was carefully removed, and the required volume of the pre-prepared CCl₄ solution was drawn into the syringe. The syringe was gently flicked to remove any air bubbles.

      (4) Mice were placed on a textured surface (e.g., wire cage) and restrained. When the mouse was properly positioned, ideally with the head lowered about 30°, the left lower or right lower abdominal quadrant was identified.

      (5) Holding the syringe at a 45° angle, with the bevel facing up, the needle was inserted approximately 4–5 mm into the abdominal wall, and the calculated volume of CCl₄ was injected.

      (6) Mice were returned to their cage and observed for any signs of discomfort.

      (7) Needles and syringes were disposed of in a sharps container without recapping. A new syringe or needle was used for each mouse.

      (8) To establish a progressive liver fibrosis model, injections were administered twice per week (e.g., Monday and Thursday) for 3 or 6 consecutive weeks (n=3 per group). Control mice were injected with an equal volume of corn oil for 3 or 6 weeks (n=3 per group).

      (9) Forty-eight hours after the last injection, mice were euthanized by cervical dislocation, and livers were rapidly harvested. Portions of the liver were processed for paraffin embedding and histological sectioning, while the remaining tissue was either immediately frozen or used for subsequent molecular biology analyses.

      The above information has been added to the Materials and Methods section.

      (14) Please provide a method description for the quantifications reported in Figures 5D, 5F, and 6E.

      ImageJ software was used to analyze 3D stained images (Figs. 5F, 6E), and the ultra-depth-of-field 3D analysis module was used to analyze 3D DAB images (Fig. 5D). The specific steps are as follows:

      Figure 5D: DAB-stained 3D images from the control group and the CCl<sub>4</sub> 6-week (CCl<sub>4</sub>-6W) group were analyzed. For each group, 20 terminal bile duct branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. All measurements were plotted as scatter plots to reflect the spatial extension of bile ducts relative to the portal vein under different conditions.

      Figure 5F: TSA 3D multiplex-stained images from the control group, CCl<sub>4</sub> 3-week (CCl<sub>4</sub>-3W), and CCl<sub>4</sub> 6-week (CCl<sub>4</sub>-6W) groups were analyzed. For each group, 5 terminal bile duct branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. Measurements were plotted as scatter plots to illustrate bile duct spatial extension.

      Figure 6E: TSA 3D multiplex-stained images from the control, CCl<sub>4</sub>-3W, and CCl<sub>4</sub>-6W groups were analyzed. For each group, 5 terminal nerve branch nodes were randomly selected, and the actual path distance along the branch to the nearest portal vein surface was measured. Scatter plots were generated to depict the spatial distribution of nerves under different treatment conditions.

      (15) Please provide a method description for the human liver samples you used in Figure S6. Patient data, fixation, etc...

      The human liver tissue samples shown in Figure S6 were obtained from adjacent non-tumor liver tissues resected during surgical operations at West China Hospital, Sichuan University. All samples used were anonymized archived tissues, which were applied for scientific research in accordance with institutional ethical guidelines and did not involve any identifiable patient information. After being fixed in 10% neutral formalin for 24 hours, the tissues were routinely processed for paraffin embedding (FFPE), and sectioned into 4 μm-thick slices for immunostaining and fluorescence imaging.

      Results

      (16) While it is stated in the Methods that certain color MCNPs were used for labelling different structures (i.e., yellow: hepatic artery; green: bile duct; portal vein: pink; central veins: black), in some figures, apparently different color MCNPs are used for the respective structures. E.g., in Figure 1J, the artery is pink and the portal vein is green. Please clarify this.

      The color assignment of MCNP dyes is not fixed across different experiments or schematic illustrations. MCNP dyes of different colors are fundamentally identical in their physical and chemical properties and do not exhibit specific binding or affinity for particular vascular structures. We select different colors based on experimental design and imaging presentation needs to facilitate distinction and visualization, thereby enhancing recognition in 3D reconstruction and image display. Therefore, the color labeling in Figure 1F is primarily intended to illustrate the distribution of different vascular systems, rather than indicating a fixed correspondence to a specific dye or injection color.

      (17) In Figure 1J, the hepatic artery is extremely shrunk, while the portal vein is extremely dilated - compared to the physiological situation. Does it relate to the perfusion conditions?

      We appreciate the reviewer’s attention. In fact, under normal physiological conditions, the hepatic arteries labeled by CD31 are naturally narrow. Therefore, the relatively thin hepatic arteries and thicker portal veins shown in Figure 1J are normal and unrelated to the perfusion conditions. See figure 1E of Adori et al., 2021.

      (18) Re: MCNP-black labelled 'oval fenestrae': the Results state 50-100 nm, while they are apparently 5-10-micron diameter in Figure 1I. Accordingly, the comparison with the ELMI studies in the subsequent paragraph is inappropriate.

      We thank the reviewer for the correction. The previous statement was a typographical error. In fact, the diameter of the “elliptical windows” marked by MCNP-black is 5–10 μm, so the diameter of 5–10 μm shown in Figure 1I is correct.

      (19) Please, correct this erroneous sentence: 'Pink marked the hepatic arterial system by injection extrahepatic duct (Figure 2B).'

      Original sentence: “The hepatic arterial system was labeled in pink by injection through the extrahepatic duct (Figure 2B).”

      Revised sentence: “The hepatic arterial system was labeled in pink by injection through the left ventricle (Figure 2B).”

      (20) How do you define the 'primary portal vein tract'?

      We thank the reviewer for the question. The term “primary portal vein tract” refers to the first-order branches of the portal vein that enter the liver from the hepatic hilum. These are the major branches arising directly from the main portal vein trunk and are responsible for supplying blood to the respective hepatic lobes. This definition corresponds to the concept of the first-order portal vein in hepatic anatomy.

      (21) I am concerned that the 'periportal lamellar complex (PLC)' that the Authors describe really exists as a distinct anatomical or functional unit. I also see these in 3D scans - in my opinion, these are fine, lower-order portal vein branches that connect the portal veins to the adjacent sinusoid. The strong MCNP-labelling of these structures may be caused by the 'sticking' of the perfused MCNP solutions in these 'pockets' during the perfusion process. What do these structures look like with SMA or CD31 immunostaining? Also, one may consider that the anatomical evaluation of these structures may have limitations in tissue slices. Have you ever checked MCNP-perfused, cleared full live lobes in light sheet microscope scans? I think this would be very useful to have a comprehensive morphological overview. Unfortunately, based on the presented documentation, I am also not convinced that PLCs are 'co-localize' with fine terminal bile duct branches (Figure 3E, S3C), or with TH+ 'neuronal bead chain networks' (Fig 6C). More detailed and more convincing documentation is needed here.

      We thank the reviewer for the detailed comments. Regarding the existence and function of the periportal lamellar complex (PLC), our observations are based on MCNP-Pink labeling of the portal vein, through which we were able to identify the PLC structure surrounding the portal branches. It should be noted that the PLC represents a very small anatomical structure. Although we have not yet performed light-sheet microscopy scanning, we anticipate that such imaging would primarily visualize larger portal vein branches. Nevertheless, this does not affect our overall conclusions.

      We also appreciate the reviewer’s suggestion that the observed structures might result from MCNP adherence during perfusion. To verify the structural characteristics of the PLC, we performed immunostaining for SMA and CD31, which revealed a specific arrangement pattern of smooth muscle and endothelial markers rather than simple perfusion-induced deposition (Figures 4F and S6B).

      Regarding the apparent colocalization of the PLC with terminal bile duct branches (Figures 3E and S3C) and TH⁺ neuronal bead-like networks (Figure 6C), we acknowledge that current literature evidence remains limited. Therefore, we have carefully described these observations as possible spatial associations rather than definitive conclusions. Future studies integrating high-resolution three-dimensional imaging with functional analyses will help to further clarify the anatomical and physiological significance of the PLC.

      (22) 'Extended depth-of-field three-dimensional bright-field imaging revealed a strict 1:1 anatomical association between the primary portal vein trunk (diameter 280 {plus minus} 32 μm) and the first-order bile duct (diameter 69 {plus minus} 8 μm) (Figures 3A and S3A)'.

      How do you define '1:1 anatomical association'? How do you define and identify the 'order' (primary, secondary) of vessel and bile duct branches in 200-micrometer slices?

      We thank the reviewer for the question. In this study, the term “1:1 anatomical correlation” refers to the stable paired spatial relationship between the main portal vein trunk and its corresponding primary bile duct within the same portal territory. In other words, each main portal vein branch is accompanied by a primary bile duct of matching branching order and trajectory, together forming a “vascular–biliary bundle.”

      The definitions of “primary” and “secondary” branches were based on extended-depth 3D bright-field reconstructions, considering both branching hierarchy and vessel/duct diameters: primary branches arise directly from the main trunk at the hepatic hilum and exhibit the largest diameters (averaging 280 ± 32 μm for the portal vein and 69 ± 8 μm for the bile duct), whereas secondary branches extend from the primary branches toward the lobular interior with smaller calibers.

      (23) In my opinion, the applied methodical approach in the single cell transcriptomics part (data mining in the existing liver single cell database and performing Venn diagram intersection analysis in hepatic endothelial subpopulations) is largely inappropriate and thus, all the statements here are purely speculative. In my opinion, to identify the molecular characteristics of such small and spatially highly organized structures like those fine radial portal branches, the only way is to perform high-resolution spatial transcriptomic.

      We thank the reviewer for the comment. We fully acknowledge the importance of high-resolution spatial transcriptomics in identifying the fine structural characteristics of portal vein branches. Due to current funding and technical limitations, we were unable to perform such high-resolution spatial transcriptomic analyses. However, we validated the molecular features of the PLC using another publicly available liver single-cell RNA-sequencing dataset, which provided preliminary supporting evidence (Figures S6B and S6C). In the manuscript, we have carefully stated that this analysis is exploratory in nature and have avoided overinterpretation. In future studies, high-resolution spatial omics approaches will be invaluable for more precisely delineating the molecular characteristics of these fine structures.

      (24) 'How the autonomic nervous system regulates liver function in mice despite the apparent absence of substantive nerve fiber invasion into the parenchyma remains unclear.'

      Please consider the role of gap junctions between hepatocytes (e.g., Miyashita, 1991; Seseke, 1992).

      In this study, we analyzed the spatial distribution of hepatic nerves in mice using immunofluorescence staining and found that nerve fibers were almost exclusively confined to the portal vein region (Figure S6A). Notably, this distribution pattern differs markedly from that in humans. Previous studies have shown that, in human livers, nerves are not only located around the portal veins but also present along the central veins, interlobular septa, and within the parenchymal connective tissue (Miller et al., 2021; Yi, la Fleur, Fliers & Kalsbeek, 2010).

      Further research has provided a physiological explanation for this interspecies difference: even among species with distinct sympathetic innervation patterns in the parenchyma—i.e., with or without direct sympathetic input—the sympathetic efferent regulatory functions may remain comparable (Beckh, Fuchs, Ballé & Jungermann, 1990). This is because signals released from aminergic and peptidergic nerve terminals can be transmitted to hepatocytes through gap junctions as electrical signals (Hertzberg & Gilula, 1979; Jensen, Alpini & Glaser, 2013; Seseke, Gardemann & Jungermann, 1992; Taher, Farr & Adeli, 2017).

      However, the scarcity of nerve fibers within the mouse hepatic parenchyma suggests that the mechanisms by which the autonomic nervous system regulates liver function in mice may differ from those in humans. This observation prompted us to further investigate the potential role of PLC endothelial cells in this process.

      (25) Please, correct typos throughout the text.

      We thank the reviewer for this comment. We have carefully proofread the entire manuscript and corrected all typographical errors and minor language issues throughout the text.

      Reviewer #3 (Recommendations for the authors):

      (1) A strong recommendation - the authors ought to challenge their scRNAsq- re-analysis with another scRNAseq dataset, namely a recently published atlas of adult liver endothelial, but also mesenchymal, immune, and parenchymal cell populations https://pubmed.ncbi.nlm.nih.gov/40954217/, performed with Smart-seq2 approach, which is perfectly suitable as it brings higher resolution data, and extensive cluster identity validation with stainings. Pietilä et al. indicate a clear distinction of portal vein endothelial cells into two populations that express Adgrg6, Jag1 (e2c), from Vegfc double-positive populations (e5c and e2c). Moreover, the dataset also includes the arterial endothelial cells that were shown to be part of the PLC, but were not followed up with the scRNAseq analysis. This distinction could help the authors to further validate their results, better controlling for cross-contaminations that may occur during scRNAseq preparation.

      We thank the reviewer for the valuable suggestion. As noted, we have further validated the molecular characteristics of the PLC using a recently published atlas of adult liver endothelial cells (Pietilä et al., 2023, PMID: 40954217). This dataset, generated using the Smart-seq2 technique, provides high-resolution transcriptomic profiles. By analyzing this dataset, we identified a CD34⁺LY6A⁺ portal vein endothelial cell population within the e2 cluster, which is localized around the portal vein. We then examined pathways and gene expression patterns related to hematopoiesis, bile duct formation, and neural signaling within these cells. The results revealed gene enrichment patterns consistent with those observed in our primary dataset, further supporting the robustness of our analysis of the PLC’s molecular characteristics.

      (2) Improving the methods section is highly recommended, this includes more detailed information for material and protocols used - catalog numbers; protocol details of the usage - rocking platforms, timing, and tubes used for incubations; GitHub or similar page with code used for the scRNA seq re-analysis.

      We thank the reviewer for the valuable suggestion. We have added more detailed information regarding the materials and experimental procedures in the Methods section, including catalog numbers, incubation conditions (such as the type of shaker, incubation time, and tube specifications), and other relevant parameters.

      (3) In Figure 2A, the authors claim the size of the nanoparticle is 100nm, while based on the image, the size is ~150-180nm. A more thorough quantification of the particle size would help users estimate the usability of their method for further applications.

      We thank the reviewer for the comment. In the TEM image shown in Figure 2A, the nanoparticles indeed appear to be approximately 150–200 nm in size. We have re-verified the particle dimensions and will update the corresponding description in the Methods section to allow readers to more accurately assess the applicability of this approach.

      (4) In Figure 3E, it is not clear what is labeled by the pink signal. Please consider labeling the structures in the figure.

      We thank the reviewer for the valuable comment. The pink signal in Figure 3E was originally intended to label the hepatic artery. However, a slight spatial misalignment occurred during the labeling process, making its position appear closer to the central vein rather than the portal vein in the image. To avoid misunderstanding, we will add clear annotations to the image and clarify this deviation in the figure legend in the revised version. It should also be noted that this figure primarily aims to illustrate the spatial relationship between the bile duct and the portal vein, and this minor deviation does not affect the reliability of our experimental conclusions.

      (5) The following statement is not backed by quantification as it ought to be „Dual-channel three-dimensional confocal imaging combined with CK19 immunostaining revealed that the sites of dye leakage did not coincide with the CK19-positive terminal bile duct epithelium, but instead were predominantly localized within regions adjacent to the PLC structures".

      We thank the reviewer for the valuable comment. We have added the corresponding quantitative analysis to support this conclusion. Quantitative assessment of the extended-depth imaging data revealed that dye leakage predominantly occurred in regions adjacent to the PLC structure, rather than in the perivenous sinusoidal areas. The corresponding results have been presented in the revised Figure 3G.

      (6) Similarly, Figure 4F is central to the Sca1CD34 cell type identification but lacks any quantification, providing it would strengthen the key statement of the article. A possible way to approach this is also by FACS sorting the double-positive cells and bluk/qRT validation.

      We thank the reviewer for raising this point. We agree that quantitative validation of the Sca1⁺CD34⁺ population by FACS sorting could further support our conclusions. However, the primary focus of this study is on the spatial localization and transcriptional features of PLC endothelial cells. The identification of the Sca1⁺CD34⁺ subset is robustly supported by multiple complementary approaches, including three-dimensional imaging, co-staining with pan-endothelial markers, and projection mapping analyses. Collectively, these lines of evidence provide a solid basis for characterizing this unique endothelial population.

      (7) The images in Figure S4D are not comparable, as the Sca1-stained image shows a longitudinal section of the PV, but the other stainings are cross-sections of PVs.

      We thank the reviewer for the careful comment. We agree that the original Sca1-stained image, being a longitudinal section of the portal vein, was not optimal for direct comparison with other cross-sectional images. We have replaced it with a cross-sectional image of the portal vein to ensure comparability across all images. The updated image has been included in the revised Supplementary Figure S4D.

      (8) I might be wrong, but Figure 4J is entirely missing, and only a cartoon is provided. Either remove the results part or provide the data.

      We appreciate the reviewer’s careful observation. Figure 4J was intentionally designed as a schematic illustration to summarize the structural relationships and spatial organization of the portal vein, hepatic artery, and PLC identified in the previous panels (Figures 4A–4I). It does not represent newly acquired experimental data, but rather serves to provide a conceptual overview of the findings.

      To avoid misunderstanding, we have clarified this point in the figure legend and the main text, stating that Figure 4J is a schematic summary rather than an experimental image. Therefore, we respectfully prefer to retain the schematic figure to aid readers’ interpretation of the preceding results.

      (9) The methods section lacks information about the CCL4concentration, and it is thus hard to estimate the dosage of CCL4 received (ml/kg). This is important for the interpretation of the severity of the fibrosis and presence of cirrhosis, as different doses may or may not lead to cirrhosis within the short regimen performed by the authors [PMID: 16015684 DOI: 10.3748/wjg.v11.i27.4167]. Validation of the fibrosis/cirrhosis severity is, in this case, crucial for the correct interpretation of the results. If the level of cirrhosis is not confirmed, only progressive fibrosis should be mentioned in the manuscript, as these two terms cannot be used interchangeably.

      Thank you for the reviewer’s comment. We indeed omitted the information on the concentration of carbon tetrachloride (CCl<sub>4</sub>) in the Methods section. In our experiments, mice received intraperitoneal injections of CCl<sub>4</sub> at a dose of 1 mL/kg body weight, twice per week, for a total of six weeks. We have revised the manuscript accordingly, using the term “progressive fibrosis” to avoid confusion between fibrosis and cirrhosis.

      (10) The following statement is not backed by any correlation analysis: "Particularly during liver fibrosis progression, the PLC exhibits dynamic structural extension correlating with fibrosis severity,.. ".

      We thank the reviewer for the comment. The original statement that the “PLC correlates with fibrosis severity” lacked support from quantitative analysis. To ensure a precise description, we have revised the sentence as follows: “During liver fibrosis progression, the PLC exhibits dynamic structural extension.”

      (11) Similarly, the following statement is not followed by data that would address the impact of innervation on liver function: "How the autonomic nervous system regulates liver function in mice despite the apparent absence of substantive nerve fiber invasion into the parenchyma remains unclear.".

      This section has been revised. In this study, we analyzed the spatial distribution of nerves in the mouse liver using immunofluorescence staining. The results showed that nerve fibers were almost entirely confined to the portal vein region (Figure S6A). Notably, this distribution pattern differs significantly from that in humans. Previous studies have demonstrated that in the human liver, nerves are not only distributed around the portal vein but also present in the central vein, interlobular septa, and connective tissue of the hepatic parenchyma (Miller et al., 2021; Yi, la Fleur, Fliers & Kalsbeek, 2010).

      Previous studies have further explained the physiological basis for this difference: even among species with differences in parenchymal sympathetic innervation (i.e., species with or without direct sympathetic input), their sympathetic efferent regulatory functions may still be similar (Beckh, Fuchs, Ballé & Jungermann, 1990). This is because signals released by adrenergic and peptidergic nerve terminals can be transmitted to hepatocytes as electrical signals through intercellular gap junctions (Hertzberg & Gilula, 1979; Jensen, Alpini & Glaser, 2013; Seseke, Gardemann & Jungermann, 1992; Taher, Farr & Adeli, 2017). However, the scarcity of nerve fibers in the mouse hepatic parenchyma suggests that the mechanism by which the autonomic nervous system regulates liver function in mice may differ from that in humans. This finding also prompts us to further explore the potential role of PLC endothelial cells in this process.

      (12) Could the authors discuss their interpretation of the results in light of the fact that the innervation is lower in cirrhotic patients? https://pmc.ncbi.nlm.nih.gov/articles/PMC2871629/. Also, while ADGRG6 (Gpr126) may play important roles in liver Schwann cells, it is likely not through affecting myelination of the nerves, as the liver nerves are not myelinated https://pubmed.ncbi.nlm.nih.gov/2407769/ and https://www.pnas.org/doi/10.1073/pnas.93.23.13280.

      We have revised the text to state that although most hepatic nerves are unmyelinated, GPR126 (ADGRG6) may regulate hepatic nerve distribution via non-myelination-dependent mechanisms. Studies have shown that GPR126 exerts both Schwann cell–dependent and –independent functions during peripheral nerve repair, influencing axon guidance, mechanosensation, and ECM remodeling (Mogha et al., 2016; Monk et al., 2011; Paavola et al., 2014).

      (13) The manuscript would benefit from text curation that would:

      a) Unify the language describing the PLC, so it is clear that (if) it represents protrusions of the portal veins.

      We have standardized the description of the PLC throughout the manuscript, clearly specifying its anatomical relationship with the portal vein. Wherever appropriate, we indicate that the PLC represents protrusions associated with the portal vein, avoiding ambiguous or inconsistent statements.

      b) Increase the accuracy of the statements.

      Examples: "bile ducts, and the central vein in adult mouse livers."

      We have refined all statements for accuracy.

      c) Reduce the space given to discussion and results in the introduction, moving them to the respective parts. The same applies to the results section, where discussion occurs at more places than in the Discussion part itself.

      We have edited the Introduction, removing detailed results and functional explanations, and retaining only a concise overview.

      Examples: "The formation of PLC structures in the adventitial layer may participate in local blood flow regulation, maintenance of microenvironmental homeostasis, and vascular-stem cell interactions."

      "This finding suggests that PLC endothelial cells not only regulate the periportal microcirculatory blood flow, but also establish a specialized microenvironment that supports periportal hematopoietic regulation, contributing to stem cell recruitment, vascular homeostasis, and tissue repair. "

      "Together, these findings suggest the PLC endothelium may act as a key regulator of bile duct branching and fibrotic microenvironment remodeling in liver cirrhosis. " This one in particular would require further validation with protein stainings and similar, directly in your model.

      d) Provide a clear reference for the used scRNA seq so it's clear that the data were re-analyzed.

      Example: "single-cell transcriptomic analysis revealed significant upregulation of bile duct-related genes in the CD34<sup>+</sup>Sca-1<sup>+</sup> endothelium of PLC in cirrhotic liver, with notably high expression of Lgals1 (Galectin-1) and HGF(Figure 5G) "

      When describing the transcriptional analysis of PLC endothelial cells, we explicitly cited the original scRNA-seq dataset (Su et al., 2021), clarifying that these data were reanalyzed rather than newly generated.

      e) Introducing references for claims that, in places, are crucial for further interpretation of experiments.

      Examples: "It not only guides bile duct branching during development but also"; the authors show no data from liver development.

      Thank you for pointing this out. We have revised the relevant statement to ensure that the claim is accurate and well-supported.

      f) Results sentence "Instead, bile duct epithelial cells at the terminal ducts extended partially along the canalicular network without directly participating in the formation of the bile duct lumen." Lacks a callout to the respective Figure.

      We would like to thank the reviewers for pointing out this issue. In the revised manuscript, the relevant image (Figure 3D) has been clearly annotated with white arrows to indicate the phenomenon of terminal cholangiocytes extending along the bile canaliculi network. Additionally, the schematic diagram on the right side clearly shows the bile canaliculi, cholangiocytes, and bile flow direction using arrows and color coding, thus intuitively corresponding to the textual description.

      (14) Formal text suggestions: The manuscript text contains a lot of missed or excessive spaces and several typos that ought to be fixed. A few examples follow:

      a) "densely n organized vascular network "

      b) "analysis, while offering high spatial "

      c) "specific differences, In the human liver, "

      d) Figure 4F has a typo in the description.

      e) "generation of high signal-to-noise ratio, multi-target " SNR abbreviation was introduced earlier.

      f) Canals of Hering, CoH abbreviation comes much later than the first mention of the Canals of Hering.

      We thank the reviewer for the helpful comment regarding textual consistency. We have carefully reviewed and revised the entire manuscript to improve the accuracy, clarity, and consistency of the text.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Domínguez-Rodrigo and colleagues make a moderately convincing case for habitual elephant butchery by Early Pleistocene hominins at Olduvai Gorge (Tanzania), ca. 1.8-1.7 million years ago. They present this at the site scale (the EAK locality, which they excavated), as well as across the penecontemporaneous landscape, analyzing a series of findspots that contain stone tools and large-mammal bones. The latter are primarily elephants, but giraffids and bovids were also butchered in a few localities. The authors claim that this is the earliest well-documented evidence for elephant butchery; doing so requires debunking other purported cases of elephant butchery in the literature, or in one case, reinterpreting elephant bone manipulation as being nutritional (fracturing to obtain marrow) rather than technological (to make bone tools). The authors' critical discussion of these cases may not be consensual, but it surely advances the scientific discourse. The authors conclude by suggesting that an evolutionary threshold was achieved at ca. 1.8 ma, whereby regular elephant consumption rich in fats and perhaps food surplus, more advanced extractive technology (the Acheulian toolkit), and larger human group size had coincided.

      The fieldwork and spatial statistics methods are presented in detail and are solid and helpful, especially the excellent description (all too rare in zooarchaeology papers) of bone conservation and preservation procedures. However, the methods of the zooarchaeological and taphonomic analysis - the core of the study - are peculiarly missing. Some of these are explained along the manuscript, but not in a standard Methods paragraph with suitable references and an explicit account of how the authors recorded bone-surface modifications and the mode of bone fragmentation. This seems more of a technical omission that can be easily fixed than a true shortcoming of the study. The results are detailed and clearly presented.

      By and large, the authors achieved their aims, showcasing recurring elephant butchery in 1.8-1.7 million-year-old archaeological contexts. Nevertheless, some ambiguity surrounds the evolutionary significance part. The authors emphasize the temporal and spatial correlation of (1) elephant butchery, (2) Acheulian toolkits, and (3) larger sites, but do not actually discuss how these elements may be causally related. Is it not possible that larger group size or the adoption of Acheulian technology have nothing to do with megafaunal exploitation? Alternative hypotheses exist, and at least, the authors should try to defend the causation, not just put forward the correlation. The only exception is briefly mentioning food surplus as a "significant advantage", but how exactly, in the absence of food-preservation technologies? Moreover, in a landscape full of aggressive scavengers, such excess carcass parts may become a death trap for hominins, not an advantage. I do think that demonstrating habitual butchery bears very significant implications for human evolution, but more effort should be invested in explaining how this might have worked.

      Overall, this is an interesting manuscript of broad interest that presents original data and interpretations from the Early Pleistocene archaeology of Olduvai Gorge. These observations and the authors' critical review of previously published evidence are an important contribution that will form the basis for building models of Early Pleistocene hominin adaptation.

      This is a good example of the advantages of the eLife reviewing process. It has become much too common, among traditional peer-reviewing journals, to reject articles when there is no coincident agreement in the reviews, regardless of the heuristics (i.e., empirically-supported weight) of the arguments on both reviewers. Reviewers 1 and 2 provide contrasting evaluations, and the eLife dialogue between authors and reviewers enable us to address their comments differentially. Reviewer 1 (R1), whose evaluation is overall positive, remarks that the methods of the zooarchaeological and taphonomic analysis are missing. We have added them now in the revised version of our manuscript. R1 also remarks that our work highlights correlation of events, but not necessarily causation. We did not establish causation because such interpretations bear a considerable amount of speculation (and they might have fostered further criticism by R2); however, in the revised version, we expanded our discussion of these issues substantially. Establishing causation among the events described is impossible, but we certainly provide arguments to link them.

      Reviewer #2 (Public review):

      The authors argue that the Emiliano Aguirre Korongo (EAK) assemblage from the base of Bed II at Olduvai Gorge shows systematic exploitation of elephants by hominins about 1.78 million years ago. They describe it as the earliest clear case of proboscidean butchery at Olduvai and link it to a larger behavioral shift from the Oldowan to the Acheulean.

      The paper includes detailed faunal and spatial data. The excavation and mapping methods appear to be careful, and the figures and tables effectively document the assemblage. The data presentation is strong, but the behavioral interpretation is not supported by the evidence.

      The claim for butchery is based mainly on the presence of green-bone fractures and the proximity of bones and stone artifacts. These observations do not prove human activity. Fractures of this kind can form naturally when bones break while still fresh, and spatial overlap can result from post-depositional processes. The studies cited to support these points, including work by Haynes and colleagues, explain that such traces alone are not diagnostic of butchery, but this paper presents them as if they were.

      The spatial analyses are technically correct, but their interpretation extends beyond what they can demonstrate. Clustering indicates proximity, not behavior. The claim that statistical results demonstrate a functional link between bones and artifacts is not justified. Other studies that use these methods combine them with direct modification evidence, which is lacking in this case.

      The discussion treats different bodies of evidence unevenly. Well-documented cut-marked specimens from Nyayanga and other sites are described as uncertain, while less direct evidence at EAK is treated as decisive. This selective approach weakens the argument and creates inconsistency in how evidence is judged.

      The broader evolutionary conclusions are not supported by the data. The paper presents EAK as marking the start of systematic megafaunal exploitation, but the evidence does not show this. The assemblage is described well, but the behavioral and evolutionary interpretations extend far beyond what can be demonstrated.

      We disagree with the arguments provided by Reviewer 2 (R2). The arguments are based on two issues: bone breakage and spatial association. We will treat both separately here.

      Bone breakage

      R2 argues that:

      “The claim for butchery is based mainly on the presence of green-bone fractures and the proximity of bones and stone artifacts. These observations do not prove human activity. Fractures of this kind can form naturally when bones break while still fresh, and spatial overlap can result from post-depositional processes. The studies cited to support these points, including work by Haynes and colleagues, explain that such traces alone are not diagnostic of butchery, but this paper presents them as if they were.”

      In our manuscript, we argued that green-breakage provides an equally good (or even  better) taphonomic evidence of butchery if documented following clear taphonomic indicators. Not all green breaks are equal and not all “cut marks” are unambiguously identifiable as such. First, “natural” elephant long limb breaks have been documented only in pre/peri-mortem stages when an elephant breaks a leg. As a matter of fact, they have only been reported in publication on femora, the thinnest long bone (Haynes et al., 2021). Unfortunately, they have been studied many months after the death of the individuals, and the published diagnosis is made under the assumption that no other process intervened in the modification of those bones during this vast time span. Most of the breaks resulting from pre-mortem fractures produce long smooth, oblique/helical outlines. Occasionally, some flake scarring may occur on the cortical surface. This has been documented as uneven, small-sized, spaced, and we are not sure if it resulted from rubbing of broken fragments while the animal was alive and attempting to walk or some may have resulted from dessication of the bone after one year. When looking at them in detail, such breaks contain sometimes step-microfractures and angular (butterfly-like) outlines. Sometimes, they may be accompanied by pseudo-notches, which are distinct and not comparable to the deep notches that hammerstone breaking generates on the same types of bones. Commonly, the edges of the breaks show some polishing, probably from separate break planes rubbing against each other. It should be emphasized that the experimental work on hammerstone breaking documented by Haynes et al. (2021) is based on bone fracture properties of bones that are no longer completely green. The cracking documented in their hammerstone experimentation, with very irregular outlines differs from the cracking that we are documented in butchery of recently dead elephants.

      All this contrasts with the overlapping notches and flake scars (mostly occurring on the medullary side of the bone), both of them bigger in size, with clear smooth, spiral and longitudinal trajectories, with a more intensive modification on the medullary surface, and with sharp break edges resulting from hammerstone breaking of the green bone. No “natural” break has been documented replicating the same morphologies displayed in the Supplementary File to our paper. We display specimens with inflection points, hackle marks on the breaks, overlapping scarring on the medullary surface, with several specimens displaying percussion marks and pitting (also most likely percussion marks). Most importantly, we document this patterned modification on elements other than femora, for which no example has been documented of purported morphological equifinality caused by pre-mortem “natural” breaking. In contrast, such morphologies are documented in hammerstone-broken completely green bones (work in progress). We cited the works of Haynes to support this, because they do not show otherwise. As a matter of fact, Haynes himself had the courtesy of making a thorough reading of our manuscript and did not encounter any contradiction with his work. 

      Spatial association

      R2 argues in this regard:

      “The spatial analyses are technically correct, but their interpretation extends beyond what they can demonstrate. Clustering indicates proximity, not behavior. The claim that statistical results demonstrate a functional link between bones and artifacts is not justified. Other studies that use these methods combine them with direct modification evidence, which is lacking in this case.”

      We should emphasize that there is some confusion in the use and interpretation of clustering by R2 when applied to EAK. R2 appears to interpret clustering as the typical naked-eye perception of the spatial association of different items. In contrast, we rely on the statistical concept of clustering, more specifically on spatial interdependence or covariance, which is different. Items may appear visually clustered but still be statistically independent. This could, for example, result from two independent depositional episodes that happen to overlap spatially. In such cases, the item-to-item relationship does not necessarily show any spatial interdependence between classes other than simple clustering (i.e., spatial coincidence in intensity).

      Spatial statistical interdependence, on the other hand, reflects a spatial relationship or co-dependence between different items. This goes beyond the mere fact that classes appear clustered: items between classes may show specific spatial relationships — they may avoid each other or occupy distinct positions in space (regular co-dependence), or they may interact within the same spatial area (clustering co-dependence). Our tests indicate the latter for EAK.

      Such patterns are difficult to explain when depositional events are unrelated, since the probability that two independent events would generate identical spatial patterns in the same loci is very low. They are also difficult to reconcile when post-depositional processes intervene and resediment part of the assemblage (Domínguez-Rodrigo et al. 2018).

      Finally, R2 concludes:

      “The discussion treats different bodies of evidence unevenly. Well-documented cut-marked specimens from Nyayanga and other sites are described as uncertain, while less direct evidence at EAK is treated as decisive. This selective approach weakens the argument and creates inconsistency in how evidence is judged.”

      The Nyayanga hippo remains bearing modifications have not been well-documented cut marks. Neither R2 nor we can differentiate those marks from those inflicted by natural abrasive processes in coarse-grained sedimentary contexts, where the carcasses are found. The fact that the observable microscopic features (through low-quality photographs as appear in the original publication) differ between the cut marks documented on smaller animals and those inferred for the hippo remains makes them even more ambiguous. Nowhere in our manuscript do we treat the EAK evidence (or any other evidence) as decisive, but as the most likely given the methods used and the results reported.

      References

      Haynes G, Krasinski K, Wojtal P. 2021. A Study of Fractured Proboscidean Bones in Recent and Fossil Assemblages. Journal of Archaeological Method and Theory 28:956–1025.

      Domínguez-Rodrigo, M., Cobo-Sánchez, L., yravedra, J., Uribelarrea, D., Arriaza, C., Organista, E., Baquedano, E. 2018. Fluvial spatial taphonomy: a new method for the study of post-depositional processes. Archaeological and Anthropological Sciences 10: 1769-1789.

      Recommendations for authors:

      Reviewer #1 (Recommendations for the authors):

      I have several recommendations that, in my opinion, could enhance the communication of this study to the readers. The first point is the only crucial one.

      (1) A detailed zooarchaeological methods section must be added, with explanations (or references to them) of precisely how the authors defined and recorded bone-surface modifications and mode of bone fragmentation.

      This appears in the revised version of the manuscript in the form of a new sub-section within the Methods section.

      (2) The title could be improved to better represent the contents of the paper. It contains two parts: the earliest evidence for elephant butchery (that's ok), and revealing the evolutionary impact of megafaunal exploitation. The latter point is not actually revealed in the manuscript, just alluded to here and there (see also below).

      We have elaborated on this in the revised version, linking megafaunal exploitation and anatomical changes (which appear discussed in much more detail in the references indicated).

      (3) The abstract does not make it clear whether the authors think that the megafaunal adaptation strongly correlates with the Acheulian technocomplex. It seems that they do, so please make this point apparent in the abstract.

      From a functional point of view, we document the correlation, but do not believe in the causation, since most butchering tools around these megafaunal carcasses are typologically non Acheulian. We have indicated so in the abstract.

      (4) Please define what you mean by "megafauna". How large should an animal be to be considered as megafauna in this particular context?

      We have added this definition: we identify as “megafauna” those animals heavier than 800 kg.

      (5) In the literature survey, consider also this Middle Pleistocene case-study of elephant butchery, including a probable bone tool: Rabinovich, R., Ackermann, O., Aladjem, E., Barkai, R., Biton, R., Milevski, I., Solodenko, N., and Marder, O., 2012. Elephants at the middle Pleistocene Acheulian open-air site of Revadim Quarry, Israel. Quaternary International, 276, pp.183-197.

      Added to the revised version

      (6) The paragraph in lines 123-160 is unclear. Do the authors argue that the lack of evidence for processing elephant carcasses for marrow and grease is universal? They bring forth a single example of a much later (MIS 5) site in Germany. Then, the authors state the huge importance of fats for foragers (when? Where? Surely not in all latitudes and ecosystems). This left me confused - what exactly are you trying to claim here?

      We have explained this a little more in the revised text. What we pointed out was that most prehistoric (and modern) elephant butchery sites leave grease-containing long bones intact. Evidence of anthropogenic breakage of these elements is rather limited. The most probably reason is the overabundance of meat and fat from the rest of the carcass and the time-consuming effort needed to access the medullary cavity of elephant long bones.

      (7) The paragraph in lines 174-187 disrupts the flow of the text, contains previously mentioned information, ends with an unclear sentence, and could be cut.

      (8) Results: please provide the MNI for the EAK site (presumably 1, but this is never mentioned).

      Done in the revised version.

      (9) Lines 292 - 295: The authors found no traces of carnivoran activity (carnivoran remains, coprolites, or gnawing marks on the elephant bones), yet they attribute the absence of some non-dense skeletal elements to carnivore ravaging. I cannot understand this rationale, given that other density-mediated processes could have deleted the missing bones and epiphysis.

      This interpretation stems from our observations of several elephant carcasses in the Okavango delta in Botswana. Those that were monitored showed deletion of remains (i.e., disappearance of certain bones, like feet) without necessarily imprinting damage on the rest of the carcass. Carnivore intervention in an elephant death site can result in deletion of a few remains without much damage (if any), or if hyena clans access the carcass, much more conspicuous damage can be documented. There is a whole range of carnivore signatures in between. We are currently working on our study of several elephant carcasses subjected to these highly variable degrees of carnivore impact.

      (10) Lines 412 - 422: "The clustering of the elephant (and hippopotamus) carcasses in the areas containing the highest densities of landscape surface artifacts is suggestive of a hominin agency in at least part of their consumption and modification." - how so? It could equally suggest that both hominins and elephants were drawn to the same lush environments.

      We agree. Both hominins and megafauna must have been drawn to the same ecological loci for interaction to emerge. However, the fact that the highest density clusters of artifacts coincide with the highest density of carcasses “showing evidence of having been broken”, is suggestive of hominin use and consumption.

      (11) Discussion: I suggest starting the Discussion with a concise appraisal of the lines of evidence detailed in the Results and their interpretation, and only then, the critical reassessment of other studies. Similarly, a new topic starts in line 508, but without any subheading or an introductory sentence that could assist the readers.

      We added the introductory lines of the former Conclusion section to the revised Discussion section, as suggested by R1.

      (12) Line 607: Neumark-Nord are Late Pleistocene sites (MIS 5), not Middle Pleistocene.

      Corrected.

      (13) Regarding the ambiguity in how megafaunal exploitation may be causally related to the other features of the early Acheulian, the authors can develop the discussion. Alternatively, they should explicitly state that correlation is not causation, and that the present study adds the megafaunal exploitation element to be considered in future discussion of the shifts in lifestyles 1.8 million years ago.

      We have done so.

      Reviewer #2 (Recommendations for the authors):

      The following detailed comments are provided to help clarify arguments, ensure accurate representation of cited literature, and strengthen the logical and methodological framing of the paper. Line numbers refer to the version provided for review.

      (1) Line 55: Such concurrency (sometimes in conjunction with other variables)

      The term "other variables" is very vague. I would suggest expanding on this or taking it out altogether.

      (2) Line 146: Megafaunal long bone green breakage (linked to continuous spiral fractures on thick cortical bone) is probably a less ambiguous trace of butchery than "cut marks", since many of the latter could be equifinal and harder to identify, especially in contexts of high abrasion and trampling (Haynes et al., 2021, 2020).

      This reasoning is not supported by the evidence or the cited sources. Green-bone spiral fractures only show that a bone broke while it was fresh and do not reveal who or what caused it. Carnivore feeding, trampling, and natural sediment pressure can all create the same patterns, so these fractures are not clearer evidence of butchery than cut marks. Cut marks, when they are preserved and morphologically clear, remain the most reliable indicator of human activity. The Haynes papers actually show the opposite of what is claimed here. They warn that spiral fractures and surface marks can form naturally and that fracture patterns alone cannot be used to infer butchery. This section should be revised to reflect what those studies actually demonstrate.

      The reasoning referred to in line 146 is further explained below in the original text as follows:

      “Despite the occurrence of green fractures on naturally-broken bones, such as those trampled by elephants (Haynes et al., 2020), those occurring through traumatic fracturing or gnawed by carnivores (Haynes and Hutson, 2020), these fail to reproduce the elongated, extensive, or helicoidal spiral fractures (uninterrupted by stepped sections), accompanied by the overlapping conchoidal scars (both cortical and medullary), the reflected scarring, the inflection points, or the impact hackled break surfaces and flakes typical of dynamic percussive breakage. Evidence of this type of green breakage had not been documented earlier for the Early Pleistocene proboscidean or hippopotamid carcasses, beyond the documentation of flaked bone with the purpose of elaboration of bone tools (Backwell and d’Errico, 2004; Pante et al., 2020; Sano et al., 2020).”

      The problem in the way that R2 uses Haynes et al.´s works is that R2 uses features separately. Natural breaks occurring while the bone is green can generate spiral smooth breaks, for example, but it is not the presence of a single feature that invalidates the diagnosis of agency or that is taphonomically relevant, but the concurrence of several of them. The best example of a naturally (pre-mortem) broken bone was published by Haynes et al.

      The natural break shows helical fractures, subjugated to linear (angular) fracture outlines. Notice how the crack displays a zig-zag. The break is smooth but most damage occurs on the cortical surface, with flaking adjacent to the break and step micro-fracturing on the edges. The cortical scarring is discontinuous (almost marginal) and very small, almost limited to the very edge of the break. No modification occurs on the medullary surface. No extensive conchoidal fractures are documented, and certainly none inside the medullary surface of the break.

      Compare with Figure S8, S10, S17 and S34 (all specimens are shown in their medullary surface):

      In these examples, we see clearly modified medullary surfaces with multiple green breaks and large-sized step fractures, accompanied in some examples by hackle marks. Some show large overlapping scars (of substantially bigger size than those documented in the natural break image). Not a single example of naturally-broken bones has been documented displaying these morphologies simultaneously. It is the comprehensive analysis of the co-occurrence of these features and not their marginal and isolated occurrence in naturally-broken bones that make a difference in the attribution of agency. Likewise, no example of naturally-broken bone has been published that could mimic any of the two green-broken bones documented at EAK. In contrast, we do have bones from our on-going experimentation with green elephant carcasses that jointly reproduce these features. See also Figure 6 of the article to find another example without any modern referent in the naturally-broken bones documented.

      We should emphasize that R2 is inaccurately portraying what Haynes et al.´s results really document. Contrary to R2´s assertion, trampling does not reproduce any of the examples shown above. Neither do carnivores. It should be stressed that Haynes & Harrod only document similar overlapping scarring on the medullary surface of bones, when using much smaller animals. In all the carnivore damage repertoire that they document for elephants, durophagous spotted hyenas can only inflict furrowing on the ends of the biggest long bones, especially if they are adults. Long bone midshafts remain inaccessible to them. The mid-shaft portions of bones that we document in our Supplementary File and at EAK cannot be the result of hyena (or carnivore damage) for this reason, and also because their intense gnawing on elephant bones leaves tooth marking on most of the elements that they modify, being absent in our sample.

      (3) Line 176: other than hominins accessed them in different taphonomically-defined stages- stages - the "Stages" is repeated twice

      Defined in the revised version

      (4) Line 174: Regardless of the type of butchery evidence - and with the taphonomic caveat that no unambiguous evidence exists to confirm that megafaunal carcasses were hunted or scavenged other than hominins accessed them in different taphonomically-defined stages- stages - the principal reasons for exploring megafaunal consumption in early human evolution is its origin, its episodic or temporally-patterned occurrence, its impact on hominin adaptation to certain landscapes, and its reflection on hominin group size and site functionality.

      This sentence is confusing and needs to be rewritten for clarity. It tries to combine too many ideas at once, and the phrasing makes it hard to tell what the main point is. The taphonomic caveat in the middle interrupts the sentence and obscures the argument. It should be broken into separate, clearer statements that distinguish what evidence exists, what remains uncertain, and what the broader goals of the discussion are.

      We believe the ideas are displayed clearly

      (5) Line 179: landscapes, and its reflection on hominin group size and site functionality. If hominins actively sought the exploitation of megafauna, especially if targeting early stages of carcass consumption, the recovery of an apparent surplus of resources reflects a substantially different behavior from the small-group/small-site pattern documented at several earlier Oldowan anthropogenic sites (Domínguez-Rodrigo et al., 2019) -or some modern foragers, like the Hadza, who only exploit megafaunal carcasses very sporadically, mostly upon opportunistic encounters (Marlowe, 2010; O'Connell et al., 1992; Wood, 2010; Wood and Marlowe, 2013).

      This sentence makes a reasonable point, but is written in a confusing way. The idea that early, deliberate access to megafauna would represent a different behavioral pattern from smaller Oldowan or modern foraging contexts is valid, but the sentence is awkward and hard to follow. It should be rephrased to make the logic clearer and more direct.

      We believe the ideas are displayed clearly

      (6) Line 186: When the process started of becoming megafaunal commensal started has major implications for human evolution.

      This sentence is awkward and needs to be rewritten for clarity. The phrasing "when the process started of becoming megafaunal commensal started" is confusing and grammatically incorrect. It could be revised to something like "Determining when hominins first began to interact regularly with megafauna has major implications for human evolution," or another version that clearly identifies the process being discussed.

      Modified in the revised version

      (7) Line189: The multiple taphonomic biases intervening in the palimpsestic nature of most of these butchery sites often prevent the detection of the causal traces linking megafaunal carcasses and hominins. Functional links have commonly been assumed through the spatial concurrence of tools and carcass remains; however, this perception may be utterly unjustified as we argued above. Functional association of both archaeological elements can more securely be detected through objective spatial statistical methods. This has been argued to be foundational for heuristic interpretations of proboscidean butchery sites (Giusti, 2021). Such an approach removes ambiguity and solidifies spatial functional association, as demonstrated at sites like Marathousa 1 (Konidaris et al., 2018) or TK Sivatherium (Panera et al., 2019). This method will play a major role in the present study.

      This section overstates what spatial analysis can demonstrate and misrepresents the cited studies. The works by Giusti (2021), Konidaris et al. (2018), and Panera et al. (2019) do use spatial statistics to examine relationships between artifacts and faunal remains, but they explicitly caution that spatial overlap alone does not prove functional or behavioral association. These studies argue that clustering can support such interpretations only when combined with detailed taphonomic and stratigraphic evidence. None of them claims that spatial analysis "removes ambiguity" or "solidifies" functional links. The text should be revised to reflect the more qualified conclusions of those papers and to avoid implying that spatial statistics can establish behavioral causation on their own.

      We disagree. Both works (Giusti and Panera) use spatial statistical tools to create an inferential basis reinforcing a functional association of lithics and bones. In both cases, the anthropogenic agency inferred is based on that. We should stress that this only provides a basis for argumentation, not a definitive causation. Again, those analyses show much more than just apparent visual clustering.

      (8) Line 200: Here, we present the discovery of a new elephant butchery site (Emiliano Aguirre Korongo, EAK), dated to 1.78 Ma, from the base of Bed II at Olduvai Gorge. It is the oldest unambiguous proboscidean butchery site at Olduvai.

      It is fine to state the main finding in the introduction, but the phrasing here is too strong. Calling EAK "the oldest unambiguous proboscidean butchery site" asserts certainty before the evidence is presented. The claim should be stated more cautiously, for example, "a new site that provides early evidence for proboscidean butchery," so that the language reflects the strength of the data rather than pre-judging it.

      We understand the caution by R2, but in this case, EAK is the oldest taphonomically-supported evidence of elephant butchery at Olduvai (see discussion about FLK North in the text). Whether this is declared at the beginning or the end of the text is irrelevant.

      (9) Line 224: The drying that characterizes Bed II had not yet taken place during this moment.

      This sentence reads like a literal translation. It should be rewritten for clarity.

      Modified in the revised version

      (10) Line 233: During the recent Holocene, the EAK site was affected by a small landslide which displaced the...

      This section contains far more geological detail than is needed for the argument. The reader only needs to know that the site block was displaced by a small Holocene landslide but retains its stratigraphic integrity. The extended discussion of regional faults, seismicity, and slope processes goes well beyond what is necessary for context and distracts from the main focus of the paper.

      We disagree. The geological information is what is most commonly missing from most archaeological reports. Here, it is relevant because of the atypical process and because it has been documented only twice with elephant butchery sites. Explaining the dynamic geological process that shaped the site helps to understand its spatial properties.

      (11) Line 264: In June 2022, a partial elephant carcass was found at EAK on a fragmented stratigraphic block...

      This section reads like field notes rather than a formal site description. Most of the details about the discovery sequence, trench setup, and excavation process are unnecessary for the main text. Only the basic contextual information about the find location, stratigraphic position, and anatomical composition is needed. The rest could be condensed or moved to the methods or supplementary material.

      We disagree. See reply above.

      (12) Line 291: hominins or other carnivores. Ongoing restoration work will provide an accurate estimate of well-preserved and modified fractions of the assemblage.

      This sentence is unclear and needs to specify what kind of restoration work is being done and what is meant by well-preserved and modified fractions. It is not clear whether modified refers to surface marks, diagenetic alteration, or something else. If the bones are still being cleaned or prepared, the analysis is incomplete, and the counts cannot be considered final. If restoration only means conservation or stabilization, that should be stated clearly so the reader understands that it does not affect the results. As written, it is not clear whether the data presented here are preliminary or complete.

      We added: For this reason, until restoration is concluded, we cannot produce any asssertion about the presence or absence of bone surface modifications.

      (13) Line 294: The tibiae were well preserved, but the epiphyseal portions of the femora were missing, probably removed by carnivores, which would also explain why a large portion of the rib cage and almost all vertebrae are missing.

      This explanation is not well supported. The missing elements could be the result of other forms of density-mediated destruction, such as sediment compaction or post-depositional fragmentation, especially since no tooth marks were found. Given the low density of ribs, vertebrae, and femoral epiphyses, these processes are more likely explanations than carnivore removal. The text should acknowledge these alternatives rather than attributing the pattern to carnivore activity without direct evidence.

      Sediment compaction and post-depositional can break bones but cannot make them disappear. Our excavation process was careful enough to detect bone if present. Their absence indicates two possibilities: erosion through the years at the front of the excavation or carnivore intervention. Carnivores can take elephant bones without impacting the remaining assemblage (see our reply above to a similar comment).

      (14) Line 304: The fact that the carcass was moved while encased in its sedimentary context, along with the close association of stone tools with the elephant bones, is in agreement with the inference that the animal was butchered by hominins. A more objective way to assess this association is through spatial statistical analysis.

      The authors state that "the carcass was moved while encased in its sedimentary context, along with the close association of stone tools with the elephant bones, is in agreement with the inference that the animal was butchered by hominins." This does not logically follow. Movement of the block explains why the bones and tools remain together, not how that association was created. The preserved association alone does not demonstrate butchery, especially in the absence of cut marks or other direct evidence of hominin activity.

      Again, we are sorry that R2 is completely overlooking the strong signal detected by the spatial statistical analysis. The way that the block moved, it preserved the original association of bones and tools. This statement is meant to clarify that despite the allochthonous nature of the block, the original autochthonous depositional process of both types of archaeological materials has been preserved. The spatial association, as statistically demonstrated, indicates that the functional link is more likely than any other alternative process. The additional fact that nowhere else in that portion of the outcrop do we identify scatters of tools (all appear clustered at a landscape scale with the elephant) adds more support to this interpretation. This would have been further supported by the presence of cut marks, no doubt, but their absence does not indicate lack of functional association, since as Haynes´ works have clearly shown, most bulk defleshing of modern elephant leaves no traces on most bones.

      (15) Line 370: This also shows that the functional connection between the elephant bones and the tools has been maintained despite the block post-sedimentary movement.

      The spatial analyses appear to have been carried out appropriately, and the interpretations of clustering and segregation are consistent with the reported results. However, the conclusion that the "functional connection" between bones and tools has been maintained goes beyond what spatial correlation alone can demonstrate. These analyses show spatial proximity and scale-dependent clustering but cannot, by themselves, confirm a behavioral or functional link.

      R2 is making this comment repeatedly and we have addressed it more than once above. We disagree and we refer to our replies above to sustain it.

      (16) Line 412: The clustering of the elephant (and hippopotamus) carcasses in the areas containing the highest densities of landscape surface artifacts is suggestive of a hominin agency in at least part of their consumption and modification. The presence of green broken elephant long bone elements in the area surveyed is only documented within such clusters, both for lower and upper Bed II. This constitutes inverse negative evidence for natural breaks occurring on those carcasses through natural (i.e., non-hominin) pre- and peri-mortem limb breaking (Haynes et al., 2021, 2020; Haynes and Hutson, 2020). In this latter case, it would be expected for green-broken bones to show a more random landscape distribution, and occur in similar frequencies in areas with intense hominin landscape use (as documented in high density artifact deposition) and those with marginal or non-hominin intervention (mostly devoid of anthropogenic lithic remains).

      The clustering of green-bone fractures with stone tools is intriguing but should be interpreted cautiously. The Haynes references are misrepresented here. Those studies address both cut marks and green-bone (spiral) fractures, emphasizing that each can arise through non-hominin processes such as trampling, carcass collapse, and sediment loading. They do not treat green fractures as clearer evidence of butchery; in fact, they caution that such breakage patterns can occur naturally and even form clustered distributions in areas of repeated animal activity. The claim that these studies support spiral fractures as unambiguous indicators of hominin activity, or that natural breaks would be randomly distributed, is not accurate.

      We would like to emphasize again that the Haynes´references are not misrepresented here. See our extensive reply above. If R2 can provide evidence of natural breakage patterns resulting from pre-mortem limb breaking or post-mortem trampling resulting in all limb bones being affected by these processes and resulting in smooth spiral breaks, accompanied with extensive and overlapping scarring on the medullary surface, in conjunction with the other features described in our replies above, then we would be willing to reconsider. With the evidence reported until now, that does not occur simultaneously on specimens resulting from studies on modern elephant bones.

      R2 seems to contradict him(her)self here by saying that Haynes studies show that cut marks are not reliable because they can also be reproduced via trampling. Until this point, R2 had been saying that only cut marks could demonstrate a functional link and support butchery. Haynes´ studies do not deal experimentally with sediment loading.

      (17) Line 424: This indicates that from lower Bed II (1.78 Ma) onwards, there is ample documented evidence of anthropogenic agency in the modification of proboscidean bones across the Olduvai paleolandscapes. The discovery of EAK constitutes, in this respect, the oldest evidence thereof at the gorge. The taphonomic evidence of dynamic proboscidean bone breaking across time and space supports, therefore, the inferences made by the spatial statistical analyses of bones and lithics at the site.

      This conclusion is overstated. The claim of "ample documented evidence of anthropogenic agency" is too strong, given that the main support comes from indirect indicators like green-bone fractures and spatial clustering rather than clear butchery marks. It would be more accurate to say that the evidence suggests or is consistent with possible hominin involvement. The final sentence also conflates association with causation; spatial and taphonomic data can indicate a relationship, but do not confirm that the carcasses were butchered by hominins.

      The evidence is based on spatially clustering (at a landscape scale) of tools and elephant (and other megafaunal taxa) bones, in conjunction with a large amount of green-broken elements. This interpretation, if we compare it against modern referents is supported even stronger. In the past few years, we have been conducting work on modern naturally dead elephant carcasses in Botswana and Zambia, and of the several carcasses that we have seen, we have not identified a single case of long bone shaft breaks like those described by Haynes as natural or like those we describe here as anthropogenic. This probably means that they are highly unlikely or marginal occurrences at a landscape scale. This seems to be supported by Haynes´ work too. Out of the hundreds of elephant carcasses that he has monitored and studied over the years for different works, we have managed to identify only two instances where he described natural pre-mortem breaks. This certainly qualifies as extremely marginal. 

      Most of the Results section is clearly descriptive, but beginning with "The clustering of the elephant (and hippopotamus) carcasses..." the text shifts from reporting observations to drawing behavioral conclusions. From this point on, it interprets the data as evidence of hominin activity rather than simply describing the patterns. This part would be more appropriate for the Discussion, or should be rewritten in a neutral, descriptive way if it is meant to stay in the Results.

      This appears extensively discussed in the Discussion section, but the data presented in the results is also interpreted in that section, following a clear argumental chain.

      (18) Line 433: A recent discovery of a couple of hippopotamus partial carcasses at the 3.0-2.6 Ma site of Nyayanga (Kenya), spatially concurrent with stone artifacts, has been argued to be causally linked by the presence of cut marks on some bones (Plummer et al., 2023). The only evidence published thereof is a series of bone surface modifications on a hippo rib and a tibial crest, which we suggest may be the result of byproduct of abiotic abrasive processes; the marks contrast noticeably with the well-defined cut marks found on smaller mammal bones (Plummer et al. ́s 2023: Figure 3C, D) associated with the hippo remains (Plummer et al., 2023).

      The authors suggest that the Nyayanga marks could result from abiotic abrasion, but this claim does not engage with the detailed evidence presented by Plummer et al. (2023). Plummer and colleagues documented well-defined, morphologically consistent cut marks and considered the sedimentary context in their interpretation. Raising abrasion as a general possibility without addressing that analysis gives the impression of selective skepticism rather than an evaluation grounded in the published data.

      We disagree again on this matter. R2 does not clarify what he/she means by well-defined or morphologically consistent. We provide an alternative interpretation of those marks that fit their morphology and features and that Plummer at al did not successfully exclude. We also emphasize that the interpretation of the Nyayanga marks was made descriptively, without any analytical approach and with a high degree of subjectivity by the researcher. All of this disqualifies the approach as well defined and keeps casting an old look at modern taphonomy. Descriptive taphonomy is a thing of the 1980´s. Today there are a plethora of analytical methods, from multivariate statistics, to geometric morphometrics to AI computer vision (so far the most reliable) which represent how taphonomy (and more specifically, analysis of bone surface modifications) should be conducted in the XXI century. This approaches would reinforce interpretations as preliminarily published by Plummer et al, provided they reject alternative explanations like those that we have provided.

      (19) Line 459: It would have been essential to document that the FLK N6 tools associated with the elephant were either on the same depositional surface as the elephant bones and/or on the same vertical position. The ambiguity about the FLK N6 elephant renders EAK the oldest secure proboscidean butchery evidence at Olduvai, and also probably one of the oldest in the early Pleistocene elsewhere in Africa.

      The concern about vertical mixing is fair, but the tone makes it sound like the association is definitely not real. It would be more accurate to say that the evidence is ambiguous, not that it should be dismissed altogether.

      We have precisely done so. We do not dismiss it, but we cannot take it for anything solid since we excavated the site and show how easily one could make functional associations if forgetting about the third dimension. It is not a secure butchery site. This is what we said and we stick to this statement.

      (20) Line 479: In all cases, these wet environments must have been preferred places for water-dependent megafauna, like elephants and hippos, and their overlapping ecological niches are reflected in the spatial co-occurrence of their carcasses. Both types of megafauna show traces of hominin use through either cutmarked or percussed bones, green-broken bones, or both (Supplementary Information).

      The environmental part is good, but the behavioral interpretation is too strong. Saying elephants and hippos "must have been" drawn to these areas is too certain, and claiming that both "show traces of hominin use" makes it sound like every carcass was modified. It should be clearer that only some have possible evidence of this.

      The sentence only refers to both types of fauna taxonomically. No inference can be drawn therefor that all carcasses are modified.

      (21) Line 496: In most green-broken limb bones, we document the presence of a medullary cavity, despite the continuous presence of trabecular bone tissue on its walls.

      This sentence is confusing and doesn't seem to add anything meaningful. All limb bones naturally have a medullary cavity lined with trabecular bone, so it's unclear why this is noted as significant. The authors should clarify what they mean here or remove it if it's simply describing normal bone structure.

      No. Modern elephant long bones do not have a hollow medullary cavity. All the medullary volume is composed of trabecular tissue. Some elephants in the past had hollow medullary cavities, which probably contained larger amounts of marrow and fat. 

      (22) Line 518: We are not confident that the artefacts reported by de la Torre et al are indeed tools.

      While I generally agree with this statement, the paragraph reads as defensive rather than comparative. It would help if they briefly summarized what de la Torre et al. actually argued before explaining why they disagree.

      We devote two full pages of the Discussion section to do so precisely.

      (23) Lines 518-574: They are similar to the green-broken specimens that we have reported here...

      This part is very detailed but inconsistent. They argue that the T69 marks could come from natural processes, but they use similar evidence (green fractures, overlapping scars) to argue for human activity at EAK. If equifinality applies to one, it applies to both.

      We are confused by this misinterpretation. Features like green fractures and overlapping scars (among others) can be used to detect anthropogenic agency in elephant bone breaking; that is, any given specimen can be determined to have been an “artifact” (in the sense of human-created item), but going from there to interpreting an artifact as a tool, there is a large distance. Whereas an artifact (something made by a human) can be created indirectly through several processes (for example, demarrowing a bone resulting in long bone fragments), a tool suggest either intentional manufacture and use or both. That is the difference between de la Torre et al.´s interpretation and ours. We believe that they are showing anthropogenically-made items, but they have provided no proof that they were tools.

      (24) Line 576: A final argument used by the authors to justify the intentional artifactual nature of their bone implements is that the bone tools were found in situ within a single stratigraphic horizon securely dated to 1.5 million years ago, indicating systematic production rather than episodic use. This is taphonomically unjustified.

      The reasoning here feels uneven in how clustering evidence is used. At EAK, clustering of bones and artifacts is taken as meaningful evidence of hominin activity, but here the same pattern at T69 is treated as a natural by-product of butchery or carnivore activity. If clustering alone cannot distinguish between intentional and incidental association, the authors should clarify why it is interpreted as diagnostic in one case but not in the other.

      Again, we are confused by this misinterpretation. It applies to two different scenarios/questions:

      a) is there a functional link between tools and bones at EAK and T69? We have statistically demonstrated that at EAK and we think de la Torre et al. is trying to do the same for T69, although using a different method. 

      b) Are the purported tools at T69 tools? Are those that we report here tools? In this regard there is no evidence for either case and given that several bones from T69 come from animals smaller than elephants, we do not discard that carnivores might have been responsible for those, whereas hominin butchery might have been responsible for the intense long limb breaking at that site. It remains to be seen how many (if any) of those specimens were tools.

      (25) Line 600: If such a bone implement was a tool, it would be the oldest bone tool documented to date (>1.7 Ma).

      The comparison to prior studies is useful, and the point about missing use-wear traces is well taken. However, the last lines feel speculative. If no clear use evidence has been found, it's premature to suggest that one specimen "would be the oldest bone tool." That claim should be either removed or clearly stated as hypothetical.

      It clearly reads as hypothetical.

      (26) Line 606: Evidence documents that the oldest systematic anthropogenic exploitation of proboscidean carcasses are documented (at several paleolandscape scales) in the Middle Pleistocene sites of Neumark-Nord (Germany)(Gaudzinski-Windheuser et al., 2023a, 2023b).

      This is the first and only mention of Neumark-Nord in the paper, and it appears without any prior discussion or connection to the rest of the study. If this site is being used for comparison or as part of a broader temporal framework, it needs to be introduced and contextualized earlier. As written, it feels out of place and disconnected from the rest of the argument.

      This is a Late Pleistocene site and we do not see the need to present it earlier, given that the scope of this work is Early Pleistocene.

      (27) Line 608: Evidence of at least episodic access to proboscidean remains goes back in time (see review in Agam and Barkai, 2018; Ben-Dor et al., 2011; Haynes, 2022).

      The distinction between "systematic" and "episodic" exploitation is useful, but the authors should clarify what criteria define each. The phrase "episodic access...goes back in time" is vague and could be replaced with a clearer statement summarizing the nature of the earlier evidence.

      It is self-explanatory

      (28) Line 610: Redundant megafaunal exploitation is well documented at some early Pleistocene sites from Olduvai Gorge (Domínguez-Rodrigo et al., 2014a, 2014b; Organista et al., 2019, 2017, 2016).

      The phrase "redundant megafaunal exploitation" needs clarification. "Redundant" is not standard terminology in this context. Does this mean repeated, consistent, or overlapping behaviors? Also, while these same Olduvai sites are mentioned earlier, this phrasing also introduces new interpretive language not used before and implies a broader behavioral generalization than what the data actually show.

      Webster: Redundant means repetitive, occurring multiple times.

      (29) Line 612: At the very same sites, the stone artifactual assemblages, as well as the site dimensions, are substantially larger than those documented in the Bed I Oldowan sites (Diez-Martín et al., 2024, 2017, 2014, 2009).

      The placement and logic of this comparison are unclear. The discussion moves from Middle Pleistocene Neumark-Nord to early Pleistocene Olduvai sites, then to Bed I Oldowan contexts without clearly signaling the temporal or geographic transitions. If the intent is to contrast Acheulean vs. Oldowan site scale or organization, that connection needs to be made explicit. As written, it reads as a disjointed shift rather than a continuation of the argument.

      We disagree. Here, we finalize by bringing in some more recent assemblages where hominin agency is not in question.

      (30) Line 616: Here, we have reported a significant change in hominin foraging behaviors during Bed I and Bed II times, roughly coinciding with the replacement of Oldowan industries by Acheulian tool kits -although during Bed II, both industries co-existed for a substantial amount of time (Domínguez-Rodrigo et al., 2023; Uribelarrea et al., 2019, 2017).

      This section should be restructured for flow. The reference to behavioral change during Bed I-II and the overlap of Oldowan and Acheulean industries is important, but feels buried after a long detour. Consider moving this earlier or rephrasing so the main conclusion (behavioral change across Beds I-II) is clearly stated first, followed by supporting examples.

      It is not within the scope of this work and is properly described in the references mentioned.

      (31) Line 620: The evidence presented here, together with that documented by de la Torre et al. (2025), represents the most geographically extensive documentation of repeated access to proboscidean and other megafaunal remains at a single fossil locality.

      The phrase "most geographically extensive documentation of repeated access" overstates what has been demonstrated. The evidence presented is site-specific and does not justify such a broad superlative. This should be toned down or supported with comparative quantitative data.

      We disagree. There is no other example where such an abundant record of green-broken elements from megafauna is documented. Neumark-Nord is more similar because it shows extensive evidence of butchery, but not so much about degreasing.

      (32) Line 623: The transition from Oldowan sites, where lithic and archaeofaunal assemblages are typically concentrated within 30-40 m2 clusters, to Acheulean sites that span hundreds or even over 1000 m2 (as in BK), with distinct internal spatial organization and redundancy in space use across multiple archaeological layers spanning meters of stratigraphic sequence (Domínguez-Rodrigo et al., 2014a, 2009b; Organista et al., 2017), reflects significant behavioral and technological shifts.

      This sentence about site size and spatial organization repeats earlier claims without adding new insight. If it's meant as a synthesis, it should explicitly say how the spatial expansion relates to changes in behavior or mobility, not just describe the difference.

      In the Conclusion section these correlations have been explained in more detail to add some causation.

      (33) Line 628: This pattern likely signifies critical innovations in human evolution, coinciding with major anatomical and physiological transformations in early hominins (Dembitzer et al., 2022; Domínguez-Rodrigo et al., 2021, 2012).

      The conclusion that this "signifies critical innovations in human evolution" is too sweeping, given the data presented. It introduces physiological and anatomical transformation without connecting it to any evidence in this paper. Either cite the relevant findings or limit the claim to behavioral implications.

      The references cited elaboration in extension this. The revised version of the Conclusion section also elaborates on this.

      Overall, the conclusions section reads as a loosely connected set of assertions rather than a focused synthesis. It introduces new interpretations and terminology not supported or developed earlier in the paper, and the argument jumps across temporal and geographic scales without clear transitions. The discussion should be restructured to summarize key results, clarify the scope of interpretation, and avoid speculative or overstated claims about evolutionary significance.

      We have done so, supported by the references used in addition to extending some of the arguments

      (34) Line 639: The systematic excavation of the stratigraphic layers involved a small crew.

      This sentence is not necessary.

      No comment

      (35) Line 643: The orientation and inclination of the artifacts were recorded using a compass and an inclinometer, respectively.

      What were these measurements used for (e.g., post-depositional movement analysis, spatial patterning)? A short note on the purpose would make this more meaningful.

      Fabric analysis has been added to the revised version.

      (36) Line 659: Restoration of the EAK elephant bones

      This section could be streamlined and clarified. It includes procedural detail that doesn't contribute to scientific replicability (e.g., the texture of gauze, number of consolidant applications), while omitting some key information (such as how restoration may have affected analytical results). It also contains interpretive comments ("most of the assemblage has been successfully studied") that don't belong in Methods.

      No comment

      (37) Line 689: In the field laboratory, cleaning of the bone remains was carried out, along with adhesion of fragments and their consolidation when necessary.

      Clarify whether cleaning or adhesion treatments might obscure or alter bone surface modifications, as this has analytical implications.

      These protocols do not impact bone like that anymore.

      (38) Line 711: (b) Percussion Tools - Includes hammerstones or cobbles exhibiting diagnostic battering, pitting, and/or impact scars consistent with percussive activities.

      Define how diagnostic features (battering, pitting) were identified - visual inspection, magnification, or quantitative criteria?

      Both macro and microscopically

      (39) Line 734: We conducted the analysis in three different ways after selecting the spatial window, i.e., the analysed excavated area (52.56 m2).

      Clarify why the 52.56 m<sup>2</sup> spatial window was chosen. Was this the total excavated area or a selected portion?

      It was what was left of the elephant accumulation after erosion.

      (40) Line 728: The spatial statistical analyses of EAK.

      Adding one or two sentences at the start explaining the analytical objective, such as testing spatial association between faunal and lithic materials, would help readers understand how each analysis relates to the broader research questions.

      This is well explained in the main text

      (41) Line 782: An intensive survey seeking stratigraphically-associated megafaunal bones was carried out in the months of June 2023 and 2024.

      It would help to specify whether the same areas were resurveyed in both field seasons or if different zones were covered each year. This information is important for understanding sampling consistency and potential spatial bias.

      Both areas were surveyed in both field seasons. We were very consistent.

      (42) Line 787: We focused on proboscidean bones and used hippopotamus bones, some of the most abundant in the megafaunal fossils, as a spatial control.

      Clarify how the hippopotamus remains functional as a "spatial control." Are they used as a proxy for water-associated taxa to test habitat patterning, or as a baseline for comparing carcass distribution? The meaning of "control" in this context is ambiguous.

      As a proxy for megafaunal distribution given their greater abundance over any other megafaunal taxa.

      (43) Line 789: Stratigraphic association was carried out by direct observation of the geological context and with the presence of a Quaternary geologist during the whole survey.

      This is good methodological practice, but it would be helpful to describe how stratigraphic boundaries were identified in the field (for example, by reference to tuffs or marker beds). That information would make the geological framework more replicable.

      This is basic geological work. Of course, both tuffs and marker beds were followed.

      (44) Line 791: When fossils found were ambiguously associated with specific strata, these were excluded from the present analysis.

      You might specify what proportion of the total finds were excluded due to uncertain stratigraphic association. Reporting this would indicate the strength of the stratigraphic control.

      This was not quantified but it was a very small amount compared to those whose stratigraphic provenience was certain.

      (45) Line 799: The goals of this survey were: a) collect a spatial sample of proboscidean and megafaunal bones enabling us to understand if carcasses on the Olduvai paleolandscapes were randomly deposited or associated to specific habitats.

      You might clarify how randomness or habitat association was tested.

      Randomness was tested spatially and comparing density according to ecotone. Same for habitat association.

      (46) The Methods section provides detailed information about excavation, restoration, and spatial analyses but omits critical details about the zooarchaeological and taphonomic procedures. There is no explanation of how faunal remains were analyzed once recovered, including how cut marks, percussion marks, or green bone fractures were identified or what magnification or diagnostic criteria were used. The authors also do not specify the analytical unit used for faunal quantification (e.g., NISP, MNI, MNE, or other), making it unclear how specimen counts were generated for spatial or taphonomic analyses. Even if these details are provided in the Supplementary Information, the main text should include at least a concise summary describing the analytical framework, the criteria for identifying surface modifications and fracture morphology, and the quantification system employed. This information is essential for transparency, replicability, and proper evaluation of the behavioral interpretations.

      See reply above. There is a new subsection on taphonomic methods now.

      Supplementary information:

      (47) The Supplementary Information includes a large number of green-broken proboscidean specimens from other Olduvai localities (BK, LAS, SC, FLK West), but it is never explained why these are shown or how they relate to the EAK study. The main analysis focuses entirely on the EAK elephant, including so much unrelated material without any stated purpose, which makes the supplement confusing. If these examples are meant only to illustrate the appearance of green fractures, that should be stated. Otherwise, the extensive inclusion of non-EAK material gives the impression that they were part of the analyzed assemblage when they were not.

      This is stated in the opening paragraph to the section.

      (48) Line 96: A small collection of green-broken elephant bones was retrieved from the lower and upper Bed II units.

      It would help to clarify whether these specimens are part of the EAK assemblage or derive from other Bed II localities. As written, it is not clear whether this description refers to material analyzed in the main text or to comparative examples shown only in the Supplementary Information.

      No, EAK only occupies the lower Bed II section. They belong in the Bed II paleolandscape units.

      (49) Line 97: One of them, a proximal femoral shaft found within the LAS unit, has all the traces of having been used as a tool (Figure 6).

      This says the bone tool in Figure 6 is from LAS, but the main text caption identifies it as from EAK. If I am not mistaken, EAK is a site at the base of Bed II, and LAS is a separate stratigraphic unit higher in the sequence, so the authors should clarify which is correct.

      Our mistake. It provenience is from LAS in the vicinity of EAK.

      (50) Line 186: Figure S20. Example of other megafaunal long bone shafts showing green breaks.

      Not cited in text or SI narrative. No indication where these bones come from or why they are relevant.

      It appears justified in the revised version.

      (51) Line 474: Figure S28-S30. Hyena-ravaged giraffe bones from Chobe (Botswana).

      These figures are not discussed in the text or SI, and their relevance to the study is unclear. The authors should explain why these modern comparative examples were included and how they inform interpretations of the Olduvai assemblages.

      It appears justified in the revised version.

      (52) Line 498: Figure S31. Bos/Bison bone from Bois Roche (France).

      This figure is not mentioned in the text or Supplementary Information. The authors should specify why this specimen is shown and how it contributes to the study's taphonomic or behavioral comparisons.

      It appears justified in the revised version.

      (53) Line 504: Figure S32. Miocene Gomphotherium femur from Spain.

      This figure is never referenced in the paper. The authors should clarify the purpose of including a Miocene specimen from outside Africa and explain what it adds to the interpretation of Bed II material.

      It appears justified in the revised version.

      (54) Line 508: Figure S33. Elephant femoral shaft from BK (Olduvai).

      This figure appears to show comparative material but is not cited or discussed in the text. The authors should explain why the BK material is presented here and how it relates to EAK or the broader analysis.

      There are two figures labeled S33.

      It appears justified in the revised version.

      (55) Line 515: Figure S33. Tibia fragment from a large medium-sized bovid displaying multiple overlapping scars on both breakage planes inflicted by carnivore damage.

      Because this figure repeats the S33 label and is not cited or explained in the text, it is unclear why this specimen is included or how it contributes to the study. The authors should correct the duplicate numbering and clarify the purpose of this figure.

      It appears justified in the revised version.

      (56) Line 522: Same specimen as shown in Figure S30, viewed on its medial side.

      This is not the same bone as S30. This figure is not discussed in the text or Supplementary Information. The authors should clarify why it is included and how it relates to the rest of the analysis.

      It appears justified in the revised version.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews: 

      Reviewer #1 (Public review): 

      Summary: 

      This paper focuses on understanding how covalent inhibitors of peroxisome proliferator-activated receptor-gamma (PPARg) show improved inverse agonist activities. This work is important because PPARg plays essential roles in metabolic regulation, insulin sensitization, and adipogenesis. Like other nuclear receptors, PPARg, is a ligand-responsive transcriptional regulator. Its important role, coupled with its ligand-sensitive transcriptional activities, makes it an attractive therapeutic target for diabetes, inflammation, fibrosis, and cancer. Traditional non-covalent ligands like thiazolininediones (TZDs) show clinical benefit in metabolic diseases, but utility is limited by off-target effects and transient receptor engagement. In previous studies, the authors characterized and developed covalent PPARg inhibitors with improved inverse agonist activities. They also showed that these molecules engage unique PPARg ligand binding domain (LBD) conformations whereby the c-terminal helix 12 penetrates into the orthosteric binding pocket to stabilize a repressive state. In the nuclear receptor superclass of proteins, helix 12 is an allosteric switch that governs pharmacologic responses, and this new conformation was highly novel. In this study, the authors did a more thorough analysis of how two covalent inhibitors, SR33065 and SR36708 influence the structural dynamics of PPARg LBD. 

      Strengths: 

      (1) The authors employed a compelling integrated biochemical and biophysical approach.  

      (2) The cobinding studies are unique for the field of nuclear receptor structural biology, and I'm not aware of any similar structural mechanism described for this class of proteins.  

      (3) Overall, the results support their conclusions.  

      (4) The results open up exciting possibilities for the development of new ligands that exploit the potential bidirectional relationship between the covalent versus non-covalent ligands studied here. 

      Weaknesses: 

      (1) The major weakness in this work is that it is hard to appreciate what these shifting allosteric ensembles actually look like on the protein structure. Additional graphical representations would really help convey the exciting results of this study. 

      We thank the review for the comments. In response to the specific recommendations below, we added two new figures—Figure 1 and Figure 8 in this resubmission—that hopefully address the weakness identified by the reviewer.

      Reviewer #2 (Public review): 

      Summary: 

      The authors use ligands (inverse agonists, partial agonists) for PPAR, and coactivators and corepressors, to investigate how ligands and cofactors interact in a complex manner to achieve functional outcomes (repressive vs. activating). 

      Strengths: 

      The data (mostly biophysical data) are compelling from well-designed experiments. Figures are clearly illustrated. The conclusions are supported by these compelling data. These results contribute to our fundamental understanding of the complex ligand-cofactor-receptor interactions. 

      Weaknesses: 

      This is not the weakness of this particular paper, but the general limitation in using simplified models to study a complex system. 

      We appreciate the reviewer’s comments. Breaking down a complex system into a simpler model system, when possible, provides a unique lens with which to probe systems with mechanistic insight. While simplified models may not always explain the complexity of systems in cells, for example, our recently published work showed that a simplified model system — biochemical assays using reconstituted PPARγ ligand-binding domain (LBD) protein and peptides derived from coregulator proteins (similar to the assays in this current work) and protein NMR structural biology studies using PPARγ LBD — can explain the activity of ligand-induced PPARγ activation and repression to a high degree (pearson/spearman correlation coefficients ~0.7-0.9):

      MacTavish BS, Zhu D, Shang J, Shao Q, He Y, Yang ZJ, Kamenecka TM, Kojetin DJ. Ligand efficacy shifts a nuclear receptor conformational ensemble between transcriptionally active and repressive states. Nat Commun. 2025 Feb 28;16(1):2065. doi: 10.1038/s41467-025-57325-4. PMID: 40021712; PMCID: PMC11871303.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors): 

      (1) More set-up is needed in the results section. The first paragraph is unclear on what is new to this study versus what was done previously. Likewise, a brief description of the assays used and the meaning behind differences in signals would help the general reader along. 

      We modified the last paragraph of the introduction and first results section to hopefully better set the stage for what was done previously vs. what is new/recollected in this study. In our results section, we also include more description about what the assays measure.

      (2) Since this paper is building on previous work, additional figures are needed in the introduction and discussion. Graphical depictions of what was found in the first study on how these ligands uniquely influence PPARg LBD conformation. A new model/depiction in the discussion for what was learned and its context with the rest of the field. 

      Our revised manuscript includes a new Figure 1 describing the possible allosteric mechanism by which a covalent ligand inhibits binding of other non-covalent ligands that was inferred from our previous study; and a new Figure 8 with a model for what has been learned.

      (3) It is stated that the results shown are representative data for at least two biological replicates. However, I do not see the other replicates shown in the supplementary information. 

      We appreciate the Reviewer’s emphasis on data reproducibility and rigor. We confirm that the biochemical and cellular assay data presented are indeed representative of consistent findings observed across two or more biological replicates—and we show representative data in our figures but not the extensive replicate data in supplementary information consistent with standard practices.

      (4) Figure 1a could benefit from labels of antagonists, inverse agonist, etc., next to each chemical structure. Likewise, if any co-crystal or other models are available it would be helpful to include those for comparison. 

      We added the pharmacological labels to Figure 2a (old Figure 1a).

      (5) The figure legends don't seem to match up completely with the figures. For example, Figure 2b states that fitted Ki values +/- standard deviation. are stated in the legend, but it's shown as the log Ki. 

      We revised the figure legends to ensure they display the appropriate errors as reported from the data fitting.

      (6) EC50, IC50, Ki, and Kd values alongside reported errors and R2 values for the fits should be reported in a table. 

      Our revised manuscript now includes a Source Data file (Figure 5—source data 1.xlsx) of the data (n=2) plotted in Figure 5 (old Figure 4) so that readers can regenerate the plots and calculate the errors and R2 values if desired. Otherwise, fitted values and errors are reported in figures when fitting in Prism permitted and reported errors; when Prism was unable to fit data or fit the error, n.d. (not determined) is specified.

      (7) Statistical analysis is missing in some places, for example, Figure 1b. 

      We revised Figure 2b (old Figure 1b) to include statistical testing.

      Reviewer #2 (Recommendations for the authors): 

      I suggest that the authors discuss the following points to broaden the significance of the results: 

      (1) The two partial agonists MRL24 and nTZDpa) are "partial" in the coactivator and corepressor recruitment assays, but are "complete" in the TR-FRET ligand displacement assay (Figure 2). Please explain that a partial agonist is defined based on the functional outcome (cofactor recruitment in this study) but not binding affinity/efficacy. 

      We added the following sentence to describe the partial agonist activity of these compounds: “These high affinity ligands are partial agonists as defined on their functional outcome in coregulator recruitment and cellular transcription; i.e., they are less efficacious than full agonists at recruiting peptides derived from coactivator proteins in biochemical assays (Chrisman et al., 2018; Shang et al., 2019; Shang and Kojetin, 2024) and increasing PPARγ-mediated transcription (Acton et al., 2005; Berger et al., 2003).“

      (2) Will the discovery reported here be broadly applicable? 

      (a) Applicable if other partial agonists and inhibitors are used? 

      (b) Applicable if different coactivators/corepressors, or different segments of the same cofactor, are used?

      (c) Applicable to other NRs (their AF-2 are similar but with sequence variation)?

      (d) The term "allosteric" might mean different things to different people - many readers might think that it means a "distal and unrelated" binding pocket. It might be helpful to point out that in this study, the allosteric site is actually "proximal and related". 

      We expanded our introduction and/or discussion sections to expand upon these concepts; specific answers as follows:

      (a) Orthosteric partial agonists?—yes, because helix 12 would clash with an orthosteiric ligand; other covalent inhibitors?—it depends on whether the covalent inhibitor stabilizes helix 12 in the orthosteric pocket.

      (b) yes with some nuanced exceptions where certain segments of the same coregulator protein bind with high affinity and others apparently do not bind or bind with low affinity

      (c) it is not clear yet if other NRs share a similar ligand-induced conformational ensemble to PPARγ

      (d) we addressed this point in the 4th paragraph of the introduction “...the non-covalent ligand binding event we previously described at the alternate/allosteric site, which is proximal to the orthosteric ligand-binding pocket, …”

    1. Reviewer #1 (Public review):

      Summary:

      Matsen et al. describe an approach for training an antibody language model that explicitly tries to remove effects of "neutral mutation" from the language model training task, e.g. learning the codon table, which they claim results in biased functional predictions. They do so by modeling empirical sequence-derived likelihoods through a combination of a "mutation" model and a "selection" model; the mutation model is a non-neural Thrifty model previously developed by the authors, and the selection model is a small Transformer that is trained via gradient descent. The sequence likelihoods themselves are obtained from analyzing parent-child relationships in natural SHM datasets. The authors validate their method on several standard benchmark datasets and demonstrate its favorable computational cost. They discuss how deep learning models explicitly designed to capture selection and not mutation, trained on parent-child pairs, could potentially apply to other domains such as viral evolution or protein evolution at large.

      Strengths:

      Overall, we think the idea behind this manuscript is really clever and shows promising empirical results. Two aspects of the study are conceptually interesting: the first is factorizing the training likelihood objective to learn properties that are not explained by simple neutral mutation rules, and the second is training not on self-supervised sequence statistics but on the differences between sequences along an antibody evolutionary trajectory. If this approach generalizes to other domains of life, it could offer a new paradigm for training sequence-to-fitness models that is less biased by phylogeny or other aspects of the underlying mutation process.

      Weaknesses:

      Some claims made in the paper are weakly or indirectly supported by the data. In particular, the claim that learning the codon table contributes to biased functional effect predictions may be true, but requires more justification. Additionally, the paper could benefit from additional benchmarking and comparison to enhanced versions of existing methods, such as AbLang plus a multi-hit correction. Further descriptions of model components and validation metrics could help make the manuscript more readable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1:

      In this well-written and timely manuscript, Rieger et al. introduce Squidly, a new deep learning framework for catalytic residue prediction. The novelty of the work lies in the aspect of integrating per-residue embeddings from large protein language models (ESM2) with a biology-informed contrastive learning scheme that leverages enzyme class information to rationally mine hard positive/negative pairs. Importantly, the method avoids reliance on the use of predicted 3D structures, enabling scalability, speed, and broad applicability. The authors show that Squidly outperforms existing ML-based tools and even BLAST in certain settings, while an ensemble with BLAST achieves state-of-the-art performance across multiple benchmarks. Additionally, the introduction of the CataloDB benchmark, designed to test generalization at low sequence and structural identity, represents another important contribution of this work.

      We thank the reviewer for their constructive and encouraging assessment of the manuscript. We appreciate the recognition of Squidly’s biology-informed contrastive learning framework with ESM2 embeddings, its scalability through the avoidance of predicted 3D structures, and the contribution of the CataloDB benchmark. We are pleased that the reviewer finds these aspects to be of value, and their comments will help us in further clarifying the strengths and scope of the work.

      The manuscript acknowledges biases in EC class representation, particularly the enrichment for hydrolases. While CataloDB addresses some of these issues, the strong imbalance across enzyme classes may still limit conclusions about generalization. Could the authors provide per-class performance metrics, especially for underrepresented EC classes?

      We thank the reviewer for raising this point. We agree that per-class performance metrics provide important insight into generalizability across underrepresented EC classes. In response, we have updated Figure 3 to include two additional panels: (i) per-EC F1, precision and recall scores, and (ii) a relative display of true positives against the total number of predictable catalytic residues. These additions allow the class imbalance to be more directly interpretable. We have also revised the text between lines 316-321 to better contextualize our generalizability claims in light of these results.

      An ablation analysis would be valuable to demonstrate how specific design choices in the algorithm contribute to capturing catalytic residue patterns in enzymes.

      We agree an ablation analysis is beneficial to show the benefits of a specific approach. We consider the main design choice in Squidly to be how we select the training pairs, hence we chose a standard design choice for the contrastive learning model. We tested the effect of different pair schemes on performance and report the results in Figure 2A and lines 244258. These results are a targeted ablation in which we evaluate Squidly against AEGAN using the AEGAN training and test datasets, while systematically varying the ESM2 model size and pair-mining scheme. As a baseline, we included the LSTM trained directly on ESM2 embeddings and random pair selection.  We showed that indeed the choice of pairs has a large impact on performance, which is significantly improved when compared to naïve pairing. This comparison suggests that performance gains are attributable to reactioninformed pair-mining strategies. We recognize that the way these results were originally presented made this ablation less clear. We have revised the wording in the Results section (lines 244-247) and updated the caption to Figure 2A to emphasize the purpose of this section of the paper.

      The statement that users can optionally use uncertainty to filter predictions is promising but underdeveloped. How should predictive entropy values be interpreted in practice? Is there an empirical threshold that separates high- from low-confidence predictions? A demonstration of how uncertainty filtering shifts the trade-off between false positives and false negatives would clarify the practical utility of this feature.

      Thank you for the suggestion. Your comment prompted us to consider what is the best way to represent the uncertainty and, additionally, what is the best metric to return to users and how to visualize the results. Based on this, we included several new figures (Figure 3H and Supplementary Figures S3-5). We used these figures to select the cutoffs (mean prediction of 0.6, and variance < 0.225) which were then set as the defaults in Squidly, and used in all subsequent analyses. The effect of these cutoffs is most evident in the tradeoff of precision and recall. Hence users may opt to select their own filters based on the mean prediction and variance across the predictions, and these cutoffs can be passed as command line parameters to Squidly. The choice to use a consistent default cutoff selected using the Uni3175 benchmark has slightly improved the reported performance for the benchmarks seen in table 1, and figure 3C. However, our interpretation remains the same.

      The excerpt highlights computational efficiency, reporting substantial runtime improvements (e.g., 108 s vs. 5757 s). However, the comparison lacks details on dataset size, hardware/software environment, and reproducibility conditions. Without these details, the speedup claim is difficult to evaluate. Furthermore, it remains unclear whether the reported efficiency gains come at the expense of predictive performance

      Thank you for pointing out this limitation in how we presented the runtime results. We have rerun the tests and updated the table. An additional comment is added underneath, which details the hardware/software environment used to run both tools, as well as that the Squidly model is the ensemble version. As per the relationship between efficiency gains and predictive performance, both 3B and 15B models are benchmarked side by side across the paper.

      Compared to the tools we were able to comprehensively benchmark, it does not come at a cost. However, we note that the increased benefits in runtime assume that a structure must be folded, which is not the case for enzymes already present in the PDB. If that is the case, then it is likely already annotated and, in those cases, we recommend using BLAST which is superior in terms of run time than either Squidly or a structure-based tool and highly accurate for homologous or annotated sequences.

      Given the well-known biases in public enzyme databases, the dataset is likely enriched for model organisms (e.g., E. coli, yeast, human enzymes) and underrepresents enzymes from archaea, extremophiles, and diverse microbial taxa. Would this limit conclusions about Squidly's generalizability to less-studied lineages?

      The enrichment for model organisms in public enzyme databases may indeed affect both ESM2 and Squidly when applied to underrepresented lineages such as archaea, extremophiles, and diverse microbial taxa. We agree that this limitation is significant and have adjusted and expanded the previous discussion of benchmarking limitations accordingly (lines 358, 369). We thank the reviewer for highlighting this issue, which has helped us to improve the transparency and balance of the manuscript.

      Reviewer #2:

      The authors aim to develop Squidly, a sequence-only catalytic residue prediction method. By combining protein language model (ESM2) embedding with a biologically inspired contrastive learning pairing strategy, they achieve efficient and scalable predictions without relying on three-dimensional structure. Overall, the authors largely achieved their stated objectives, and the results generally support their conclusions. This research has the potential to advance the fields of enzyme functional annotation and protein design, particularly in the context of screening large-scale sequence databases and unstructured data. However, the data and methods are still limited by the biases of current public databases, so the interpretation of predictions requires specific biological context and experimental validation.

      Strengths:

      The strengths of this work include the innovative methodological incorporation of EC classification information for "reaction-informed" sample pairing, thereby enhancing the discriminative power of contrastive learning. Results demonstrate that Squidly outperforms existing machine learning methods on multiple benchmarks and is significantly faster than structure prediction tools, demonstrating its practicality.

      Weaknesses:

      Disadvantages include the lack of a systematic evaluation of the impact of each strategy on model performance. Furthermore, some analyses, such as PCA visualization, exhibit low explained variance, which undermines the strength of the conclusions.

      We thank the reviewer for their comments and feedback. 

      The authors state that "Notably, the multiclass classification objective and benchmarks used to evaluate EasIFA made it infeasible to compare performance for the binary catalytic residue prediction task." However, EasIFA has also released a model specifically for binary catalytic site classification. The authors should include EasIFA in their comparisons in order to provide a more comprehensive evaluation of Squidly's performance.

      We thank the reviewer for raising this point. EasIFA’s binary classification task includes catalytic, binding, and “other” residues, which differs from Squidly’s strict catalytic residue prediction. This makes direct comparison non-trivial, which is why we originally had opted to not benchmark against EasIFA and instead highlight it in our discussion.

      Given your comment, we did our best to include a benchmark that could give an indication of a comparison between the two tools. To do this, we filtered EasIFA’s multiclass classification test dataset for a non-overlapping subset with Squidly and AEGAN training data and <40% sequence identity to all training sets. This left only 66 catalytic residue– containing sequences that we could use as a held-out test set from both tools. We note it is not directly equal as Squidly and AEGAN had lower average identity to this subset (8.2%) than EasIFA (23.8%), placing them at a relative disadvantage.

      We also identified a potential limitation in EasIFA’s original recall calculation, where sequences lacking catalytic residues were assigned a recall of 0. We adapted this to instead consider only the sequences which do have catalytic residues, which increased recall across all models. With the updated evaluation, EasIFA continues to show strong performance, consistent with it being SOTA if structural inputs are available. Squidly remains competitive given it operates solely from sequence and has a lower sequence identity to this specific test set.

      Due to the small and imbalanced benchmark size, differences in training data overlap, and differences in our analysis compared with the original EasIFA analysis, we present this comparison in a new section (A.4) of the supplementary information rather than in the main text. References to this section have been added in the manuscript at lines 265-268. Additionally, we do update the discussion and emphasize the potential benefits of using EasIFA at lines (353-356).

      The manuscript proposes three schemes for constructing positive and negative sample pairs to reduce dataset size and accelerate training, with Schemes 2 and 3 guided by reaction information (EC numbers) and residue identity. However, two issues remain:

      (a) The authors do not systematically evaluate the impact of each scheme on model performance.

      (b) In the benchmarking results, it is not explicitly stated which scheme was used for comparison with other models (e.g., Table 1, Figure 6, Figure 8). This lack of clarity makes it difficult to interpret the results and assess reproducibility.

      (c) Regarding the negative samples in Scheme 3 in Figure 1, no sampling patterns are shown for residue pairs with the same amino acid, different EC numbers, and both being catalytic residues.

      We thank the reviewer for these suggestions, which enabled us to improve the clarity and presentation of the manuscript. Please find our point by point response:

      (a) We thank the reviewer for highlighting the lack of clarity in the way we have presented our evaluation in the section describing the Uni3175 benchmark. We aimed to systematically evaluate the impact of each scheme using the Uni3175 benchmark and refer to these results at lines 244-258, Additionally, we have adjusted the presentation of this section at lines 244-247 also in line with related comments from reviewer 1 in order to make the intention of this section and benchmark results to allow a comparison of each scheme to baseline models and AEGAN. These results led us to use Scheme 3 in both models for the other benchmarks in Figures 2 and 3. Please let us know if there is anything we can do to further improve the interpretability of Squidly’s performance.

      (b) We thank the reviewer for highlighting this issue and improving the clarity of our manuscript. We agree that after the Uni3175 benchmark was used to evaluate the schemes, we did not clearly state in the other benchmarks that scheme 3 was chosen for both the 3B and 15B models. We have made changes in table 1 and the Figure legends of Figures 2 and 3 to state that scheme 3 was used. In addition, we integrated related results into panel figures (e.g. Figures 2 and 3 now show models trained and tested on consistent benchmark datasets) and standardized figure colors and legend formatting throughout. Furthermore, we suspect that the previous switch from using the individual vs ensembled Squidly models during the paper was not well indicated, and likely to confuse the reader. Therefore, we decided to consistently report the ensembled Squidly models for all benchmarks except in the ablation study (Figure 2A). In line with this, we altered the overview Figure 1A, so that it is clearer that the default and intended version of Squidly is the ensemble.

      (c) We appreciate the reviewer pointing this out. You’re correct, we explicitly did not sample the negatives described by the reviewer in scheme 3 as our focus was on the hard negatives that relate most to the binary objective.  We do think this is a great idea and would be worth exploring further in future versions of Squidly, where we will be expanding the label space used for hard-negative sampling and including binding sites in our prediction. We have updated the discussion at lines 395-396 to highlight this potential direction.

      The PCA visualization (Figure 3) explains very little variance (~5% + 1.8%), but its use to illustrate the separability of embedding and catalytic residues may overinterpret the meaning of the low-dimensional projection. We question whether this figure is appropriate for inclusion in the main text and suggest that it be moved to the Supporting Information.

      We thank the reviewer for this suggestion. We had discussed this as well, and in the end decided to include it in the main manuscript. We agree that the explained variance is low. However, when we first saw the PCA we were surprised that there was any separation at all. This then prompted us to investigate further, so we kept it in the manuscript to be true to the scientific story. However, we do agree that our interpretation could be interpreted as overly conclusive given the minimal variance explained by the top 2 PCs. Therefore, we agree with the assessment that the figure, alongside the accompanying results section, is more appropriately placed in the supplementary information. We moved this section (A.1) to the appendix to still explain the exploratory data analysis process that we used to tackle this problem, so that the general thought process behind Squidly is available for further reading.  

      Minor Comments:

      (1) Figure Quality and Legends a) In Figure 4, the legend is confusing: "Schemes 2 and 3 (S1 and S2) ..." appears inconsistent, and the reference to Scheme 3 (S3) is not clearly indicated.

      (b) In Figure 6, the legend overlaps with the y-axis labels, reducing readability. The authors should revise the figures to improve clarity and ensure consistent notation.

      The reviewer correctly notes inconsistencies in figure presentation. We have revised the legend of Figure 4 (now 2A) to ensure schemes are referred to consistently and Scheme 3 (S3) is clearly indicated. We also adjusted Figure 6 (now 2c) to remove the overlap between the legend and y-axis labels.  

      Conclusion

      We thank the reviewers and editor again for their constructive input. We believe the revisions and clarifications substantially strengthened the manuscript and the resource

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This study used explicit-solvent simulations and coarse-grained models to identify the mechanistic features that allow for the unidirectional motion of SMC on DNA. Shorter explicit-solvent models describe relevant hydrogen bond energetics, which were then encoded in a coarse-grained structure-based model. In the structure-based model, the authors mimic chemical reactions as signaling changes in the energy landscape of the assembly. By cycling through the chemical cycle repeatedly, the authors show how these time-dependent energetic shifts naturally lead SMC to undergo translocation steps along DNA that are on a length scale that has been identified.

      Strengths:

      Simulating large-scale conformational changes in complex assemblies is extremely challenging. This study utilizes highly-detailed models to parameterize a coarse-grained model, thereby allowing the simulations to connect the dynamics of precise atomistic-level interactions with a large-scale conformational rearrangement. This study serves as an excellent example for this overall methodology, where future studies may further extend this approach to investigated any number of complex molecular assemblies.

      We thank the reviewer for careful reading of our manuscript and highlighting the value of our bottom-up multiscale simulation approach.

      Weaknesses:

      The only relative weakness is that the text does not always clearly communicate which aspects of the dynamics are expected to be robust. That is, which aspects of the dynamics/energetics are less precisely described by this model? Where are the limits of the models, and why should the results be considered within the range of applicability of the models?

      We appreciate this insightful comment and agree that it is important to more explicitly describe the robustness and limitations of the simulation model used in this study. In response to this comment, we have revised the Discussion section of our manuscript.

      First, to clarify the robust aspects of our model, we have added a new subsection titled “Parametric choices and robustness of simulation model” to the Discussion, which is as follows:

      “The switching Gō approach adopted in this study is a powerful tool for providing the relationship between known large-scale conformational changes and the resulting functional and mechanical dynamics of the molecular machine (Brandani and Takada, 2018b; Koga and Takada, 2006b; Nagae et al., 2025). In this study, we mimic conformational change induced by ATP binding and hydrolysis events by instantaneously switching the potential energy function from one that stabilized a given conformation to another that stabilized a different conformation. This drives the protein to undergo a conformational transition toward the minimum of the new energy landscape.

      This approach is particularly well suited to investigate whether a given conformational change in a subunit of a molecular machine can produce the overall motion observed, and whether this process is mechanically feasible. Therefore, the fundamental mechanisms identified in this study, i.e., DNA segment capture mechanism, the correlation between step size and loop length, and the unidirectional translocation mechanism originating from the asymmetric kleisin path, can be considered as robust, as they emerge directly from the structural and topological constraints of the SMC-kleisin architecture rather than from tuned parameters.”

      Additionally, to more clearly define the limits of our model, we have expanded the "Limitations in current simulations" subsection. Specifically, we have added a detailed discussion regarding the energetics and transition pathways inherent to the switching Gō approach, which is as follows:

      “First, use of switching potentials to trigger conformational changes impose a limitation on predictive power for energetics and transition pathways. The switching of potentials is akin to a “vertical excitation” from one energy landscape to another, rather than a thermally activated crossing of an energy barrier. Consequently, the model cannot provide quantitative predictions of the transition rates or the free energy barriers associated with these changes. Furthermore, while the subsequent relaxation follows the new potential landscape, it is not guaranteed to reproduce the unique, physically correct transition pathway. Nevertheless, this simplification is justified because conformational changes within the protein are expected to occur on a much faster timescale than the large-scale motion of the DNA. Thus, this simplification has a limited impact on our main conclusions regarding the functional DNA dynamics driven by these large-scale conformational changes.”

      We have not made any additions regarding the timescale and dwell times for each ATP state, as these were already discussed in the original manuscript.

      Reviewer #2 (Public review):

      Summary:

      The authors perform coarse grained and all atom simulations to provide a mechanism for loop extrusion that is involved in genome compaction.

      Strengths:

      The simulations are very thoughtful. They provide insights into the translocation process, which is only one of the mechanisms. Much of the analyses is very good. Over all the study advances the use of simulations in this complicated systems.

      We sincerely thank the reviewer for their thoughtful and encouraging comments.

      Weaknesses:

      Even the authors point out several limitations, which cannot be easily overcome in the paper because of the paucity of experimental data. Nevertheless, the authors could have done so to illustrate the main assertion that loop extrusion occurs by the motor translocating on DNA. They should mention more clearly that there are alternative theories that have accounted for a number of experimental data.

      We thank the reviewer for these constructive suggestions. As the reviewer pointed out, it is important to state more explicitly how the unidirectional DNA translocation revealed in this study relates to the widely recognized loop-extrusion hypothesis of genome organization and situate our findings with the context of major alternative theories.

      To address this, we first clarify the relationship between the translocation mechanism we observed and the phenomenon of loop extrusion. We emphasize that our simulations were designed to elucidate the core motor activity of the SMC complex, and we explicitly state our view that loop extrusion is a functional consequence of this motor activity when the complex is anchored to DNA.

      Second, as the reviewer also suggested, we addressed alternative models of loop extrusion that also have experimental support in more details. We have revised the Discussion accordingly to provide a more balanced and comprehensive context. Further details are provided in our separate response to the comment below.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, Yamauchi and colleagues combine all-atom and coarse-grained MD simulations to investigate the mechanism of DNA translocation by prokaryotic SMC complexes. Their multiscale approach is well-justified and supports a segment-capture model in which ATP-dependent conformational changes lead to the unidirectional translocation of DNA. A key insight from the study is that asymmetry in the kleisin path enforces directionality. The work introduces an innovative computational framework that captures key features of SMC motor action, including DNA binding, conformational switching, and translocation.

      This work is well executed and timely, and the methodology offers a promising route for probing other large molecular machines where ATP activity is essential.

      Strengths:

      This manuscript introduces an innovative yet simple method that merges all-atom and coarse-grained, purely equilibrium, MD simulations to investigate DNA translocation by SMC complexes, which is triggered by activated ATP processes. Investigating the impact of ATP on large molecular motors like SMC complexes is extremely challenging, as ATP catalyses a series of chemical reactions that take and keep the system out of equilibrium. The authors simulate the ATP cycle by cycling through distinct equilibrium simulations where the force field changes according to whether the system is assumed to be in the disengaged, engaged, and V-shaped states; this is very clever as it avoids attempting to model the non-equilibrium process of ATP hydrolysis explicitly. This equilibrium switching approach is shown to be an effective way to probe the mechanistic consequences of ATP binding and hydrolysis in the SMC complex system.

      The simulations reveal several important features of the translocation mechanism. These include identifying that a DNA segment of ~200 bp is captured in the engaged state and pumped forward via coordinated conformational transitions, yielding a translocation step size in good agreement with experimental estimates. Hydrogen bonding between DNA and the top of the ATPase heads is shown to be critical for segment capturtrans, as without it, translocation is shown to fail. Finally, asymmetry in the kleisin subunit path is shown to be responsible for unidirectionally.

      This work highlights how molecular simulations are an excellent complement to experiments, as they can exploit experimental findings to provide high-resolution mechanistic views currently inaccessible to experiments. The findings of these simulations are plausible and expand our understanding of how ATP hydrolysis induces directional motion of the SMC complex.

      We thank the reviewer for the thoughtful and encouraging assessment of our work. We appreciate the reviewer’s summary of our key contributions, especially our switching Gō strategy, the segment-capture mechanism of SMC translocation, and the role of kleisin-path asymmetry in ensuring unidirectionality.

      Weaknesses:

      There are aspects of the methodology and modelling assumptions that are not clear and could be better justified. The major ones are listed below:

      (1) The all-atom MD simulations involve a 47-bp DNA duplex interacting with the ATPase heads, from which key residues involved in hydrogen bonding are identified. However, DNA mechanics-including flexibility and hydrogen bond formation-are known to be sequence-dependent. The manuscript uses a single arbitrary sequence but does not discuss potential biases. Could the authors comment on how sequence variability might affect binding geometry or the number of hydrogen bonds observed?

      We thank the reviewer for this insightful comment regarding the potential effects of DNA sequence.

      The primary biological role of the SMC complex is to organize genome architecture on a global scale; as such, its fundamental interaction with DNA is considered not to be sequence-specific. Our all-atom MD simulations and analysis pipeline were designed to probe the nature of this general interaction. Our approach confirms this rationale: the analysis exclusively identified hydrogen bonds formed between amino acid residues and the phosphate groups of the DNA's sugar-phosphate backbone. As shown in Figs. 1B and 1C, the results confirm that the key stabilizing interactions occur between basic residues on the SMC head surface and the DNA backbone. Since the backbone is chemically uniform, the stable binding mode we characterized is inherently sequence-independent.

      While the final bound state is likely sequence-independent, we agree that sequence-dependent properties such as local DNA flexibility or intrinsic curvature could influence the kinetics of the binding process. For example, the rate of initial recognition or the ease of DNA bending on the head surface might vary between AT-rich and GC-rich regions. However, once the DNA is bound, we expect the stable binding geometry and the identity of the key interacting residues to be conserved across different sequences.

      Therefore, we are confident that using a single, representative DNA sequence is a valid approach for elucidating the fundamental, non-sequence-specific aspects of SMC-DNA interaction and does not alter the general validity of the translocation mechanism proposed in this work.

      (2) A key feature of the coarse-grained model is the inclusion of a specific hydrogen-bonding potential between DNA and residues on the ATPase heads. The authors select the top 15 hydrogen-bond-forming residues from the all-atom simulations (with contact probability > 0.05), but the rationale for this cutoff is not explained. Also, the strength of hydrogen bonds in coarse-grained models can be sensitive to context. How did the authors calibrate the strength of this interaction relative to electrostatics, and did they test its robustness (e.g., by varying epsilon or residue set)? Could this interaction be too strong or too weak under certain ionic conditions? What happens when salt is changed?

      Thank you for these comments. We provide our rationale for the parameter choices below.

      The contact probability cutoff of 0.05 was chosen to create a comprehensive set of residues that form physically robust interactions with DNA. To establish this robustness, we performed a parallel set of all-atom simulations using a different force field (see Fig. S2). This cross-validation revealed two key points. First, the top six residues (Arg120, Arg123, Ile63, Arg111, Arg62, and Lys56), which include experimentally confirmed DNA-binding sites, consistently exhibited the highest contact probabilities in both force fields, confirming the reliability of our identification. Second, and just as importantly, many residues with lower contact probabilities (e.g., Trp115, Tyr107, Arg105, Ser124, and Ser54) were also consistently detected across both simulations. This reproducibility suggests that these interactions are physically robust and not artifacts of a specific force field. We therefore concluded that a 0.05 cutoff is a well-balanced threshold that ensures the inclusion of not only the primary anchor residues but also the secondary, moderately interacting residues that are crucial for cooperatively stabilizing the DNA. We discussed this point in Method in the revised manuscript, which is as follows:

      “The rationale for this cutoff is the physical robustness of the identified interactions; all-atom simulations using a different force field confirmed that the same set of key interacting residues, including both strong and moderate binders, was consistently identified (Fig. S2).”

      The strength of the hydrogen bond potential was set to ϵ = 4.0 k​T (≈2.4 kcal/mol), a physically plausible value corresponding to an ideal hydrogen bond. To test the robustness of this parameterization, we performed preliminary simulations where we varied these parameters by (i) reducing the value of ϵ and (ii) restricting the interaction to only the top six anchor residues. In both test cases, while a short DNA duplex (47 bp) could still bind to the ATPase heads, simulations with a long DNA (800 bp) failed to form a stable DNA loop after initial docking. These tests demonstrated that a larger set of cooperative interactions with a physically realistic strength was necessary for the full segment capture mechanism. Our final parameter set (15 residues at ϵ = 4.0 k​T) was thus chosen as the parameter set required to capture both the initial anchoring of DNA and the subsequent cooperative stabilization of the captured loop.

      As correctly pointed out, ionic conditions are a critical factor. Our simulations revealed that the salt concentration had a more pronounced effect on the kinetics of the DNA finding its correct binding site rather than on the thermodynamic stability of the final bound state. During our parameter tuning, we found that at physiological salt conditions (150 mM), long-range electrostatic interactions become dominant. This caused the DNA to be non-specifically captured by positively charged patches on the sides of the heads, which are not the functional binding sites. This off-pathway trapping kinetically prevented the DNA from reaching its proper location within the simulation timeframe. In contrast, the high-salt conditions (300 mM) used in this study screen these long-range interactions, suppressing non-specific trapping and allowing the DNA to efficiently explore the protein surface. This enables the correct binding to be established via the specific, short-range hydrogen bonds. Therefore, the ion concentration in our model is more as a crucial kinetic control factor to reproduce correct binding pathway within a realistic simulation timeframe. This point is discussed in the new subsection entitled “Parametric choices and robustness of simulation model”.

      (3) To enhance sampling, the translocation simulations are run at 300 mM monovalent salt. While this is argued to be physiological for Pyrococcus yayanosii, such a concentration also significantly screens electrostatics, possibly altering the interaction landscape between DNA and protein or among protein domains. This may significantly impact the results of the simulations. Why did the authors not use enhanced sampling methods to sample rare events instead of relying on a high-salt regime to accelerate dynamics?

      We agree that enhanced sampling methods are powerful for exploring rare events. However, many of these techniques require the pre-definition of a suitable, low-dimensional reaction coordinate (RC) to guide the simulation. The primary goal of our study was to discover the DNA translocation mechanism as it emerges naturally from fundamental physical interactions, without imposing a priori assumptions about the specific pathway.

      The DNA segment capture process is complex, involving the coordinated motion of a long DNA polymer and multiple protein domains. Defining a simple RC in advance was not feasible and would have carried a significant risk of biasing the system toward an artificial pathway. Therefore, to avoid such bias, we chose to perform direct, unbiased molecular dynamics simulations. Using a physiologically relevant high-salt concentration (300 mM) for Pyrococcus yayanosii was a strategy to accelerate the system's natural dynamics, allowing us to observe these unbiased trajectories within a feasible computational timescale.

      Because our current work has elucidated the fundamental steps of this mechanism, we agree that this work provides a foundation for more quantitative analyses. As suggested, future studies using methods like Markov State Model analysis or enhanced sampling techniques, guided by more sophisticated RCs defined from the insights of this work, would be a valuable next step for characterizing the free-energy landscape of the process or longer time scale dynamics.

      (4) Only a small fraction of the simulated trajectories complete successful translocation (e.g., 45 of 770 in one set), and this is attributed to insufficient simulation time. While the authors are transparent about this, it raises questions about the reliability of inferred success rates and about possible artefacts (e.g., DNA trapping in coiled-coil arms). Could the authors explore or at least discuss whether alternative sampling strategies (e.g., Markov State Models, transition path sampling) might address this limitation more systematically?

      We thank the reviewer for raising this point that is crucial for considering limitations and future directions of our study.

      As we noted in a previous response, the primary reason we did not employ such enhanced sampling methods was the limited prior knowledge available to define previously uncharacterized DNA translocation process. Therefore, we first try to define the key conformational states and transitions without the potential bias of a predefined model or reaction coordinate. This approach was successful, as it allowed us to identify critical on-pathway states like “DNA segment capture” and significant off-pathway or kinetically trapped states such as 'DNA trapping' between the coiled-coil arms.

      We fully agree that the low success rate observed is a key finding that points to significant kinetic bottlenecks, and that a more systematic analysis is required. Having identified the essential states, applying techniques such as Markov State Models (MSMs) or transition path sampling represents a powerful and logical next step. These methods, using a state-space definition based on our findings, will enable a quantitative characterization of the free-energy landscape and the transition rates between states. This will provide a rigorous understanding of the kinetic factors, such as the depth of the trapped-state energy well, that underlie the low translocation efficiency.

      In the revised manuscript, we discuss the application of these advanced sampling methods as a feasible and promising future direction, which is as follows:

      “Future studies can leverage the insights from this work to overcome the current timescale limitations. Techniques such as Markov state modeling (Husic and Pande, 2018; Prinz et al., 2011) or enhanced sampling methods (Hénin et al., 2022) may be employed to quantitatively characterize the free-energy landscape and transition rates. Such an approach would provide a rigorous understanding of the kinetic barriers, such as the stability of the trapped state, that govern the efficiency of SMC translocation.”

      Reviewer #1 (Recommendations for the authors):

      As noted in the public review, there could be a more systematic description of the limits of the model. The model appears to be carefully crafted, though every model has limits. It could be helpful for the general readership to give some idea of which parametric choices are more critical, and which mechanistic features should be robust to minor changes in parameters.

      We sincerely thank the reviewer for this constructive comment. We agree that clarifying which aspects of our model is robust and sensitive to specific parameter choices is crucial for the reader's understanding.

      We have expanded the Discussion to clarify how specific simulation parameters affect the efficiency and success rate of DNA translocation in our coarse-grained simulations. In particular, we have added a description of the parametric choices for (i) selection and strength of hydrogen bonds, (ii) ionic strength, and (iii) interaction strength between the coiled-coil arms. The discussion can be found in subsection entitled “Parametric choice and robustness of simulation model” in the Discussion, which is as follows:

      “On the other hand, the efficiency and success rate of DNA translocation in our simulations are more sensitive to certain parametric choices. For instance, the selection and strength of hydrogen bond-like interactions are a key factor. Our model incorporates specific hydrogen bonds between the upper surface of the ATPase heads and DNA, based on all-atom simulations. These interactions are essential for initiating segment capture; without them, DNA fails to migrate to the correct binding surface. While the identification of these key residues is a robust finding—persisting across different all-atom force fields (Fig. S2)—their strength and number in the coarse-grained potential are critical parameters that directly influence the probability and kinetics of DNA capture. Another critical parameter is the ionic strength. We performed translocation simulations at an ionic strength of 300 mM to accelerate DNA dynamics. At lower concentrations, non-specific electrostatic interactions between DNA and positively charged patches on the sides of the ATPase heads or coiled-coil arm became dominant, hindering the efficient migration of DNA to its functional binding site. Using a higher-than-physiological ionic strength is a justified practice in coarse-grained simulations employing the Debye-Hückel approximation, as it serves as a first-order correction to mimic the strong local charge screening by condensed counterions that is not explicitly captured by the mean-field model (Brandani et al., 2021; Niina et al., 2017b). Finaly, the interaction strength between the coiled-coil arms is also important. In our model, once the arms closed during the transition from the V-shaped to the disengaged state, they remained closed on the simulated timescale, frequently trapping DNA pushed from the hinge and thereby leading to failed translocation. This behavior suggests that the arm–arm interactions may be overestimated. A parameterization that allows for more frequent, transient opening of the arms could increase the success rate of DNA pumping.”

      Reviewer #2 (Recommendations for the authors):

      This paper reports simulations (all atom and coarse grained) to provide molecular details of loop extrusion. In general, it is a well done paper. There are a few issues that the authors should address.

      (1) The study supposes that loop extrusion occurs by translocation. Although they point out alternate models like scrunching (C Dekker; the theory by Takaki is also based on the scrunching model that the authors should mention), they should discuss this further. After all, the Takaki theory does predict several experimental outcomes very accurately. The precise mechanism has not been nailed down - The paper by Terakawa in Science suggests the extrusion is by translocation, but the evidence is not clear.

      We thank the reviewer for this insightful comment. We agree that our discussion should briefly acknowledge alternative models such as scrunching. We have therefore revised the manuscript to mention the theory by Takaki et al. (Nat. Commun., 2021), which reproduces several experimental outcomes.

      Because our present work specifically addresses the translocation mechanism based on DNA segment capture, we now state that scrunching and related models represent alternative proposals for loop extrusion.

      In this revision, we have added discussion to the end of the subsection titled "DNA segment capture as the mechanism of the DNA translocation by SMC complexes." in the Discussion section, which is as follows:

      “Turning to loop extrusion mechanisms, alternative mechanisms have been proposed in addition to the DNA-segment capture model. For example, Takaki et al. developed a scrunching-based theory that quantitatively accounts for several experimental observations, including force-velocity relationships and step-size distributions. While our present study focuses on the DNA translocation mechanism via segment capture, it is important to note that scrunching and other models remain plausible alternatives for loop extrusion. The precise mechanism may depends on the specific SMC complex and their subunits and remains to be fully resolved.”

      (2) It is unclear how one can say from Figure 4I and J that translocation has taken place. These panels show that the base pair length increases. This should be explained more clearly. They should also simultaneously plot the location of the heads (2D plot).

      Thank you for this valuable suggestion. In response to the comment on how translocation is presented in Fig. 4I and J, we have revised the text to make it clear that the SMC complex moves along DNA in subsection entitled “DNA translocation via DNA-segment capture”, as follows:

      “Fig. 4I represents the one-dimensional contour coordinate of the DNA molecule, indexed by base pairs (1-800). In this plot, translocation is visualized as a discontinuous shift in the range of base-pair indices that the SMC complex contacts over one complete ATP cycle”

      “This translocation is recorded in Fig. 4I as the average coordinate of the kleisin contact region (red dots) jumps from ~400 bp before the cycle to ~600bp after, which corresponds to a translocation event of ~200 bp”

      We believe that adding this explanation makes it clearer to readers that Fig. 4I and 4J provide direct evidence for unidirectional translocation of the SMC complex.

      (3) The transitions between the states are very abrupt (see Figure 2). Please explain. Also, in which state does extrusion take place? What is the role of the V-shape - is it part of the ATPase cycle?

      We thank the reviewer for raising these questions.

      In our simulation, we implemented ATP-binding state change by instantaneously switching the structure-based (Gō-type) potential between reference conformations for the disengaged (apo), engaged (ATP-bound), and V-shaped (ADP-bound) states at predetermined times. The system rapidly relaxes along the new funnel-shaped potential energy surface toward its minimum. This rapid relaxation is why the transition appears abrupt in metrics such as the Q-score in Fig.2.

      The V-shaped state corresponds to a key ADP-bound intermediate within the ATP hydrolysis cycle. Its primary role in our model is preparatory; it establishes the necessary open geometry that allows for the subsequent "zipping" of the coiled-coil arms. Crucially, unidirectional pumping motion is generated during the transition from the V-shaped state to the disengaged state. That is, the zipping motion of the coiled-coil arm pushes the captured DNA segment forward, resulting in a net translocation along the DNA.

      (4) It appears the heads do not move between the disengaged to engaged states. Why not in their model?

      Thank you for pointing out the lack of clarity in explanation of the SMC head movement in our simulations.

      In our model, the transition from the disengaged to the engaged state involves a dynamic rearrangement of the SMC heads. Specifically, one ATPase head slides (~10 Å) and rotates (~85°) relative to the other ATPase head to re-associate at a new dimer interface. This movement drives the global conformational change of the complex from a rod-like shape to an open ring, a mechanism proposed in a previous structural study (Diebold-Durand et al., Mol. Cell, 2017).

      As reviewer 2 noted, this crucial motion, which is reflected in the changing head-head distance and hinge angle in Fig. 2A, was not sufficiently highlighted in the text. We have therefore revised the manuscript to explicitly describe this head rearrangement to improve clarity, which is as follows:

      “Upon transition to the engaged state, the two ATPase heads were quickly rearranged to form the new inter-subunit contacts. Specifically, this rearrangement involves one ATPase head sliding by approximately 10 Å and rotating by 85° relative to the other, allowing it to associate through a different interface (Diebold-Durand et al., 2017b). The fractions of formed contacts, Q-scores, that exist at the disengaged (engaged) states quickly decreased (increased) (Fig. 2A, top two plots).”

      (5) What is pumping - it has been used in Marko NAR in the DNA capture model. How is that illustrated in the simulations?

      We thank the reviewer for raising this point. In the context of the DNA segment-capture model by Marko et al. (NAR, 2019), "pumping" refers to the conceptual process where a DNA loop, captured in an upper compartment of the SMC ring, is transferred to a lower compartment, resulting in net translocation.

      Our simulations provide a direct, molecular-resolution visualization of the physical mechanism underlying this concept. We illustrate that the "pumping" action is not a passive transfer but an active, mechanical process driven by a specific conformational change. This occurs during the transition from the V-shaped (ADP-bound) to the disengaged state. As shown in our trajectories, the two coiled-coil arms close in a zipper-like manner, beginning from the hinge and progressing toward the ATPase heads. This zipping motion physically pushes the captured DNA segment from the hinge region toward the kleisin ring.

      This process is visualized in our simulations as a clear, unidirectional translocation step (see Figs. 4B–D, 4I, and S6). The result is a net forward movement of the DNA by a distance that corresponds to the length of the initially captured loop, a key prediction of the Marko’s model that we quantify in our step-size analysis (Figs. 4K–L and S8).

      To make this point clearer for the reader, we have revised the manuscript. We have explicitly defined this "zipping and pushing" action as the physical basis for the "pumping" mechanism in the subsection titled "Zipping motion of coiled-coil arms pushes the DNA from hinge domain toward kleisin ring", which is as follows:.

      “This active, mechanical pushing of the DNA loop, driven by the sequential closing of the coiled-coil arm, constitutes the physical basis of the “pumping” mechanism that drives unidirectional translocation. Our simulations thus provide a concrete, molecular-level visualization for this key step in the DNA segment-capture model.”

      (6) The length of DNA simulated is small for understandable reasons. Both experiments and theory show that loop extrusion sizes can be very large, far exceeding the sizes of the SMA complex. Could the small size of DNA be affecting the results?

      We thank the reviewer for this important comment. The relationship between our simulated system size and the large-scale phenomena observed experimentally is a key point.

      Our study was specifically designed to elucidate the fundamental mechanism of the elementary, single-cycle translocation step at near-atomic resolution. For this purpose, the 800 bp DNA length was sufficient. The observed translocation step size per cycle was 216 ± 71 bp, which is substantially smaller than the total length of the simulated DNA. This confirms that the boundaries of our system did not artificially constrain the core translocation process we aimed to investigate. Therefore, we think that the DNA length used in this study did not systematically bias our main findings regarding the motor mechanism itself.

      As the reviewer pointed out, on the other hand, our current setup cannot reproduce the formation of kilobase-scale loops. We hypothesize that these large-scale events are intrinsically linked to the stochastic nature of the ATP hydrolysis cycle, which was simplified in our simulation model. We used fixed durations for each state for computational feasibility. In a more realistic scenario, a stochastically prolonged engaged state would provide a larger duration time for a captured DNA loop to grow via thermal diffusion. This could lead to occasional, much larger translocation steps upon ATP hydrolysis, contributing to the large loop sizes seen experimentally.

      (7) Minor point: The first CG model using three sites was introduced in PNAS vol 102, 6789 2005. The authors should consider citing it.

      Thank you for this suggestion. We have now cited the paper the reviewer recommended. Please find subsection entitled Coarse-grained simulations in Materials and Methods.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      Summary:

      Wu and colleagues aimed to explain previous findings that adolescents, compared to adults, show reduced cooperation following cooperative behaviour from a partner in several social scenarios. The authors analysed behavioural data from adolescents and adults performing a zero-sum Prisoner's Dilemma task and compared a range of social and non-social reinforcement learning models to identify potential algorithmic differences. Their findings suggest that adolescents' lower cooperation is best explained by a reduced learning rate for cooperative outcomes, rather than differences in prior expectations about the cooperativeness of a partner. The authors situate their results within the broader literature, proposing that adolescents' behaviour reflects a stronger preference for self-interest rather than a deficit in mentalising.

      Strengths:

      The work as a whole suggests that, in line with past work, adolescents prioritise value accumulation, and this can be, in part, explained by algorithmic differences in weighted value learning. The authors situate their work very clearly in past literature, and make it obvious the gap they are testing and trying to explain. The work also includes social contexts that move the field beyond non-social value accumulation in adolescents. The authors compare a series of formal approaches that might explain the results and establish generative and modelcomparison procedures to demonstrate the validity of their winning model and individual parameters. The writing was clear, and the presentation of the results was logical and well-structured.

      We thank the reviewer for recognizing the strengths of our work.

      Weaknesses:

      (1) I also have some concerns about the methods used to fit and approximate parameters of interest. Namely, the use of maximum likelihood versus hierarchical methods to fit models on an individual level, which may reduce some of the outliers noted in the supplement, and also may improve model identifiability.

      We thank the reviewer for this suggestion. Following the comment, we added a hierarchical Bayesian estimation. We built a hierarchical model with both group-level (adolescent group and adult group) and individual-level structures for the best-fitting model. Four Markov chains with 4,000 samples each were run, and the model converged well (see Figure supplement 7).

      We then analyzed the posterior parameters for adolescents and adults separately. The results were consistent with those from the MLE analysis. These additional results have been included in the Appendix Analysis section (also see Figure supplement 5 and 7). In addition, we have updated the code and provided the link for reference. We appreciate the reviewer’s suggestion, which improved our analysis.

      (2) There was also little discussion given the structure of the Prisoner's Dilemma, and the strategy of the game (that defection is always dominant), meaning that the preferences of the adolescents cannot necessarily be distinguished from the incentives of the game, i.e. they may seem less cooperative simply because they want to play the dominant strategy, rather than a lower preferences for cooperation if all else was the same.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. 

      However, our computational modeling explicitly addressed this possibility. Model 4 (inequality aversion) captures decisions that are driven purely by self-interest or aversion to unequal outcomes, including a parameter reflecting disutility from advantageous inequality, which represents self-oriented motives. If participants’ behavior were solely guided by the payoff-dominant strategy, this model should have provided the best fit. However, our model comparison showed that Model 5 (social reward) performed better in both adolescents and adults, suggesting that cooperative behavior is better explained by valuing social outcomes beyond payoff structures.

      Besides, if adolescents’ lower cooperation is that they strategically respond to the payoff structure by adopting defection as the more rewarding option. Then, adolescents should show reduced cooperation across all rounds. Instead, adolescents and adults behaved similarly when partners defected, but adolescents cooperated less when partners cooperated and showed little increase in cooperation even after consecutive cooperative responses. This pattern suggests that adolescents’ lower cooperation cannot be explained solely by strategic responses to payoff structures but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded our Discussion to acknowledge this important point and to clarify how the behavioral and modeling results address the reviewer’s concern.

      “Overall, these findings indicate that adolescents’ lower cooperation is unlikely to be driven solely by strategic considerations, but may instead reflect differences in the valuation of others’ cooperation or reduced motivation to reciprocate. Although defection is the payoff-dominant strategy in the Prisoner’s Dilemma, the selective pattern of adolescents’ cooperation and the model comparison results indicate that their reduced cooperation cannot be fully explained by strategic incentives, but rather reflects weaker valuation of social reciprocity.”

      Appraisal & Discussion:

      (3) The authors have partially achieved their aims, but I believe the manuscript would benefit from additional methodological clarification, specifically regarding the use of hierarchical model fitting and the inclusion of Bayes Factors, to more robustly support their conclusions. It would also be important to investigate the source of the model confusion observed in two of their models.

      We thank the reviewer for this comment. In the revised manuscript, we have clarified the hierarchical Bayesian modeling procedure for the best-fitting model, including the group- and individual-level structure and convergence diagnostics. The hierarchical approach produced results that fully replicated those obtained from the original maximumlikelihood estimation, confirming the robustness of our findings. Please also see the response to (1).

      Regarding the model confusion between the inequality aversion (Model 4) and social reward (Model 5) models in the model recovery analysis, both models’ simulated behaviors were best captured by the baseline model. This pattern arises because neither model includes learning or updating processes. Given that our task involves dynamic, multi-round interactions, models lacking a learning mechanism cannot adequately capture participants’ trial-by-trial adjustments, resulting in similar behavioral patterns that are better explained by the baseline model during model recovery. We have added a clarification of this point to the Results:

      “The overlap between Models 4 and 5 likely arises because neither model incorporates a learning mechanism, making them less able to account for trial-by-trial adjustments in this dynamic task.”

      (4) I am unconvinced by the claim that failures in mentalising have been empirically ruled out, even though I am theoretically inclined to believe that adolescents can mentalise using the same procedures as adults. While reinforcement learning models are useful for identifying biases in learning weights, they do not directly capture formal representations of others' mental states. Greater clarity on this point is needed in the discussion, or a toning down of this language.

      We sincerely thank the reviewer for this professional comment. We agree that our prior wording regarding adolescents’ capacity to mentalise was somewhat overgeneralized. Accordingly, we have toned down the language in both the Abstract and the Discussion to better align our statements with what the present study directly tests. Specifically, our revisions focus on adolescents’ and adults’ ability to predict others’ cooperation in social learning. This is consistent with the evidence from our analyses examining adolescents’ and adults’ model-based expectations and self-reported scores on partner cooperativeness (see Figure 4). In the revised Discussion, we state:

      “Our results suggest that the lower levels of cooperation observed in adolescents stem from a stronger motive to prioritize self-interest rather than a deficiency in predicting others’ cooperation in social learning”.

      (5) Additionally, a more detailed discussion of the incentives embedded in the Prisoner's Dilemma task would be valuable. In particular, the authors' interpretation of reduced adolescent cooperativeness might be reconsidered in light of the zero-sum nature of the game, which differs from broader conceptualisations of cooperation in contexts where defection is not structurally incentivised.

      We thank the reviewer for this comment and agree that adolescents’ lower cooperation may partly reflect a rational response to the incentive structure of the Prisoner’s Dilemma. However, our behavioral and computational evidence suggests that this pattern cannot be explained solely by strategic responses to payoff structures, but rather reflects a reduced sensitivity to others’ cooperative behavior or weaker social reciprocity motives. We have expanded the Discussion to acknowledge this point and to clarify how both behavioral and modeling results address the reviewer’s concern (see also our response to 2).

      (6) Overall, I believe this work has the potential to make a meaningful contribution to the field. Its impact would be strengthened by more rigorous modelling checks and fitting procedures, as well as by framing the findings in terms of the specific game-theoretic context, rather than general cooperation.

      We thank the reviewer for the professional comments, which have helped us improve our work.

      Reviewer #2 (Public review):

      Summary:

      This manuscript investigates age-related differences in cooperative behavior by comparing adolescents and adults in a repeated Prisoner's Dilemma Game (rPDG). The authors find that adolescents exhibit lower levels of cooperation than adults. Specifically, adolescents reciprocate partners' cooperation to a lesser degree than adults do. Through computational modeling, they show that this relatively low cooperation rate is not due to impaired expectations or mentalizing deficits, but rather a diminished intrinsic reward for reciprocity. A social reinforcement learning model with asymmetric learning rate best captured these dynamics, revealing age-related differences in how positive and negative outcomes drive behavioral updates. These findings contribute to understanding the developmental trajectory of cooperation and highlight adolescence as a period marked by heightened sensitivity to immediate rewards at the expense of long-term prosocial gains.

      Strengths:

      (1) Rigid model comparison and parameter recovery procedure.

      (2) Conceptually comprehensive model space.

      (3) Well-powered samples.

      We thank the reviewer for highlighting the strengths of our work.

      Weaknesses:

      A key conceptual distinction between learning from non-human agents (e.g., bandit machines) and human partners is that the latter are typically assumed to possess stable behavioral dispositions or moral traits. When a non-human source abruptly shifts behavior (e.g., from 80% to 20% reward), learners may simply update their expectations. In contrast, a sudden behavioral shift by a previously cooperative human partner can prompt higher-order inferences about the partner's trustworthiness or the integrity of the experimental setup (e.g., whether the partner is truly interactive or human). The authors may consider whether their modeling framework captures such higher-order social inferences. Specifically, trait-based models-such as those explored in Hackel et al. (2015, Nature Neuroscience)-suggest that learners form enduring beliefs about others' moral dispositions, which then modulate trial-bytrial learning. A learner who believes their partner is inherently cooperative may update less in response to a surprising defection, effectively showing a trait-based dampening of learning rate.

      We thank the reviewer for this thoughtful comment. We agree that social learning from human partners may involve higher-order inferences beyond simple reinforcement learning from non-human sources. To address this, we had previously included such mechanisms in our behavioral modeling. In Model 7 (Social Reward Model with Influence), we tested a higher-order belief-updating process in which participants’ expectations about their partner’s cooperation were shaped not only by the partner’s previous choices but also by the inferred influence of their own past actions on the partner’s subsequent behavior. In other words, participants could adjust their belief about the partner’s cooperation by considering how their partner’s belief about them might change. Model comparison showed that Model 7 did not outperform the best-fitting model, suggesting that incorporating higher-order influence updates added limited explanatory value in this context. As suggested by the reviewer, we have further clarified this point in the revised manuscript.

      Regarding trait-based frameworks, we appreciate the reviewer’s reference to Hackel et al. (2015). That study elegantly demonstrated that learners form relatively stable beliefs about others’ social dispositions, such as generosity, especially when the task structure provides explicit cues for trait inference (e.g., resource allocations and giving proportions). By contrast, our study was not designed to isolate trait learning, but rather to capture how participants update their expectations about a partner’s cooperation over repeated interactions. In this sense, cooperativeness in our framework can be viewed as a trait-like latent belief that evolves as evidence accumulates. Thus, while our model does not include a dedicated trait module that directly modulates learning rates, the belief-updating component of our best-fitting model effectively tracks a dynamic, partner-specific cooperativeness, potentially reflecting a prosocial tendency.

      This asymmetry in belief updating has been observed in prior work (e.g., Siegel et al., 2018, Nature Human Behaviour) and could be captured using a dynamic or belief-weighted learning rate. Models incorporating such mechanisms (e.g., dynamic learning rate models as in Jian Li et al., 2011, Nature Neuroscience) could better account for flexible adjustments in response to surprising behavior, particularly in the social domain.

      We thank the reviewer for the suggestion. Following the comment, we implemented an additional model incorporating a dynamic learning rate based on the magnitude of prediction errors. Specifically, we developed Model 9:  Social reward model with Pearce–Hall learning algorithm (dynamic learning rate), in which participants’ beliefs about their partner’s cooperation probability are updated using a Rescorla–Wagner rule with a learning rate dynamically modulated by the Pearce–Hall (PH) Error Learning mechanism. In this framework, the learning rate increases following surprising outcomes (larger prediction errors) and decreases as expectations become more stable (see Appendix Analysis section for details).

      The results showed that this dynamic learning rate model did not outperform our bestfitting model in either adolescents or adults (see Figure supplement 6). We greatly appreciate the reviewer’s suggestion, which has strengthened the scope of our analysis. We now have added these analyses to the Appendix Analysis section (see Figure Supplement 6) and expanded the Discussion to acknowledge this modeling extension and further discuss its implications.

      Second, the developmental interpretation of the observed effects would be strengthened by considering possible non-linear relationships between age and model parameters. For instance, certain cognitive or affective traits relevant to social learning-such as sensitivity to reciprocity or reward updating-may follow non-monotonic trajectories, peaking in late adolescence or early adulthood. Fitting age as a continuous variable, possibly with quadratic or spline terms, may yield more nuanced developmental insights.

      We thank the reviewer for this professional comment. In addition to the linear analyses, we further conducted exploratory analyses to examine potential non-linear relationships between age and the model parameters. Specifically, we fit LMMs for each of the four parameters as outcomes (α+, α-, β, and ω). The fixed effects included age, a quadratic age term, and gender, and the random effects included subject-specific random intercepts and random slopes for age and gender. Model comparison using BIC did not indicate improvement for the quadratic models over the linear models for α<sup>+</sup> (ΔBIC<sub>quadratic-linear</sub> = 5.09), α− (ΔBICquadratic-linear = 3.04), β (ΔBICquadratic-linear = 3.9), or ω (ΔBICquadratic-linear = 0). Moreover, the quadratic age term was not significant for α<sup>+</sup>, α<sup>−</sup>, or β (all ps > 0.10). For ω, we observed a significant linear age effect (b = 1.41, t = 2.65, p = 0.009) and a significant quadratic age effect (b = −0.03, t = −2.39, p = 0.018; see Author response image 1). This pattern is broadly consistent with the group effect reported in the main text. The shaded area in the figure represents the 95% confidence interval. As shown, the interval widens at older ages (≥ 26 years) due to fewer participants in that range, which limits the robustness of the inferred quadratic effect. In consideration of the limited precision at older ages and the lack of BIC improvement, we did not emphasize the quadratic effect in the revised manuscript and present these results here as exploratory.

      Author response image 1.

      Linear and quadratic model fits showing the relationship between age and the ω parameter, with 95% confidence intervals.<br />

      Finally, the two age groups compared - adolescents (high school students) and adults (university students) - differ not only in age but also in sociocultural and economic backgrounds. High school students are likely more homogenous in regional background (e.g., Beijing locals), while university students may be drawn from a broader geographic and socioeconomic pool. Additionally, differences in financial independence, family structure (e.g., single-child status), and social network complexity may systematically affect cooperative behavior and valuation of rewards. Although these factors are difficult to control fully, the authors should more explicitly address the extent to which their findings reflect biological development versus social and contextual influences.

      We appreciate this comment. Indeed, adolescents (high school students) and adults (university students) differ not only in age but also in sociocultural and socioeconomic backgrounds. In our study, all participants were recruited from Beijing and surrounding regions, which helps minimize large regional and cultural variability. Moreover, we accounted for individual-level random effects and included participants’ social value orientation (SVO) as an individual difference measure. 

      Nonetheless, we acknowledge that other contextual factors, such as differences in financial independence, socioeconomic status, and social experience—may also contribute to group differences in cooperative behavior and reward valuation. Although our results are broadly consistent with developmental theories of reward sensitivity and social decisionmaking, sociocultural influences cannot be entirely ruled out. Future work with more demographically matched samples or with socioeconomic and regional variables explicitly controlled will help clarify the relative contributions of biological and contextual factors. Accordingly, we have revised the Discussion to include the following statement:  “Third, although both age groups were recruited from Beijing and nearby regions, minimizing major regional and cultural variation, adolescents and adults may still differ in socioeconomic status, financial independence, and social experience. Such contextual differences could interact with developmental processes in shaping cooperative behavior and reward valuation. Future research with demographically matched samples or explicit measures of socioeconomic background will help disentangle biological from sociocultural influences.”

      Reviewer #3 (Public review):

      Summary:

      Wu and colleagues find that in a repeated Prisoner's Dilemma, adolescents, compared to adults, are less likely to increase their cooperation behavior in response to repeated cooperation from a simulated partner. In contrast, after repeated defection by the partner, both age groups show comparable behavior.

      To uncover the mechanisms underlying these patterns, the authors compare eight different models. They report that a social reward learning model, which includes separate learning rates for positive and negative prediction errors, best fits the behavior of both groups. Key parameters in this winning model vary with age: notably, the intrinsic value of cooperating is lower in adolescents. Adults and adolescents also differ in learning rates for positive and negative prediction errors, as well as in the inverse temperature parameter.

      Strengths: 

      The modeling results are compelling in their ability to distinguish between learned expectations and the intrinsic value of cooperation. The authors skillfully compare relevant models to demonstrate which mechanisms drive cooperation behavior in the two age groups.

      We thank the reviewer’s recognition of our work’s strengths.

      Weaknesses:

      Some of the claims made are not fully supported by the data:

      The central parameter reflecting preference for cooperation is positive in both groups. Thus, framing the results as self-interest versus other-interest may be misleading.

      We thank the reviewer for this insightful comment. In the social reward model, the cooperation preference parameter is positive by definition, as defection in the repeated rPDG always yields a +2 monetary advantage regardless of the partner’s action. This positive value represents the additional subjective reward assigned to mutual cooperation (e.g., reciprocity value) that counterbalances the monetary gain from defection. Although the estimated social reward parameter ω was positive, the effective advantage of cooperation is Δ=p×ω−2. Given participants’ inferred beliefs p, Δ was negative for most trials (p×ω<2), indicating that the social reward was insufficient to offset the +2 advantage of defection. Thus, both adolescents and adults valued cooperation positively, but adolescents’ smaller ω and weaker responsiveness to sustained partner cooperation suggest a stronger weighting on immediate monetary payoffs. 

      In this light, our framing of adolescents as more self-interested derives from their behavioral pattern: even when they recognized sustained partner cooperation and held high expectations of partner cooperation, adolescents showed lower cooperative behavior and reciprocity rewards compared with adults. Whereas adults increased cooperation after two or three consecutive partner cooperations, this pattern was absent among adolescents. We therefore interpret their behavior as relatively more self-interested, reflecting reduced sensitivity to the social reward from mutual cooperation rather than a categorical shift from self-interest to other-interest, as elaborated in the Discussion.

      It is unclear why the authors assume adolescents and adults have the same expectations about the partner's cooperation, yet simultaneously demonstrate age-related differences in learning about the partner. To support their claim mechanistically, simulations showing that differences in cooperation preference (i.e., the w parameter), rather than differences in learning, drive behavioral differences would be helpful.

      We thank the reviewer for raising this important point. In our model, both adolescents and adults updated their beliefs about partner cooperation using an asymmetric reinforcement learning (RL) rule. Although adolescents exhibited a higher positive and a lower negative learning rate than adults, the two groups did not differ significantly in their overall updating of partner cooperation probability (Fig. 4a-b). We then examined the social reward parameter ω, which was significantly smaller in adolescents and determined the intrinsic value of mutual cooperation (i.e., p×ω). This variable differed significantly between groups and closely matched the behavioral pattern.

      Following the reviewer’s suggestion, we conducted additional simulations varying one model parameter at a time while holding the others constant. The difference in mean cooperation probability between adults and adolescents served as the index (positive = higher cooperation in adults). As shown in the Author response image 2, decreases in ω most effectively reproduced the observed group difference (shaded area), indicating that age-related differences in cooperation are primarily driven by variation in the social reward parameter ω rather than by others.

      Author response image 2.

      Simulation results showing how variations in each model parameter affect the group difference in mean cooperation probability (Adults – Adolescents). Based on the best-fitting Model 8 and parameters estimated from all participants, each line represents one parameter (i.e., α+, α-, ω, β) systematically varied within the tested range (α±:0.1–0.9; ω, β:1–9) while other parameters were held constant. Positive values indicate higher cooperation in adults. Smaller ω values most strongly reproduced the observed group difference, suggesting that reduced social reward weighting primarily drives adolescents’ lower cooperation.

      Two different schedules of 120 trials were used: one with stable partner behavior and one with behavior changing after 20 trials. While results for order effects are reported, the results for the stable vs. changing phases within each schedule are not. Since learning is influenced by reward structure, it is important to test whether key findings hold across both phases.

      We thank the reviewer for this thoughtful and professional comment. In our GLMM and LMM analyses, we focused on trial order rather than explicitly including the stable vs. changing phase factor, due to concerns about multicollinearity. In our design, phases occur in specific temporal segments, which introduces strong collinearity with trial order. In multi-round interactions, order effects also capture variance related to phase transitions. 

      Nonetheless, to directly address this concern, we conducted additional robustness analyses by adding a phase variable (stable vs. changing) to GLMM1, LMM1, and LMM3 alongside the original covariates. Across these specifications, the key findings were replicated (see GLMM<sub>sup</sub>2 and LMM<sub>sup</sub>4–5; Tables 9-11), and the direction and significance of main effects remained unchanged, indicating that our conclusions are robust to phase differences.

      The division of participants at the legal threshold of 18 years should be more explicitly justified. The age distribution appears continuous rather than clearly split. Providing rationale and including continuous analyses would clarify how groupings were determined.

      We thank the reviewer for this thoughtful comment. We divided participants at the legal threshold of 18 years for both conceptual and practical reasons grounded in prior literature and policy. In many countries and regions, 18 marks the age of legal majority and is widely used as the boundary between adolescence and adulthood in behavioral and clinical research. Empirically, prior studies indicate that psychosocial maturity and executive functions approach adult levels around this age, with key cognitive capacities stabilizing in late adolescence (Icenogle et al., 2019; Tervo-Clemmens et al., 2023). We have clarified this rationale in the Introduction section of the revised manuscript.

      “Based on legal criteria for majority and prior empirical work, we adopt 18 years as the boundary between adolescence and adulthood (Icenogle et al., 2019; Tervo-Clemmens et al., 2023).”

      We fully agree that the underlying age distribution is continuous rather than sharply divided. To address this, we conducted additional analyses treating age as a continuous predictor (see GLMM<sub>sup</sub>1 and LMM<sub>sup</sub>1–3; Tables S1-S4), which generally replicated the patterns observed with the categorical grouping. Nevertheless, given the limited age range of our sample, the generalizability of these findings to fine-grained developmental differences remains constrained. Therefore, our primary analyses continue to focus on the contrast between adolescents and adults, rather than attempting to model a full developmental trajectory.

      Claims of null effects (e.g., in the abstract: "adults increased their intrinsic reward for reciprocating... a pattern absent in adolescents") should be supported with appropriate statistics, such as Bayesian regression.

      We thank the reviewer for highlighting the importance of rigor when interpreting potential null effects. To address this concern, we conducted Bayes factor analyses of the intrinsic reward for reciprocity and reported the corresponding BF10 for all relevant post hoc comparisons. This approach quantifies the relative evidence for the alternative versus the null hypothesis, thereby providing a more direct assessment of null effects. The analysis procedure is now described in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Once claims are more closely aligned with the data, the study will offer a valuable contribution to the field, given its use of relevant models and a well-established paradigm.

      We are grateful for the reviewer’s generous appraisal and insightful comments.

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):

      I commend the authors on a well-structured, clear, and interesting piece of work. I have several questions and recommendations that, if addressed, I believe will strengthen the manuscript.

      We thank the reviewer for commending the organization of our paper.

      Introduction: - Why use a zero-sum (Prisoner's Dilemma; PD) versus a mixed-motive game (e.g. Trust Task) to study cooperation? In a finite set of rounds, the dominant strategy can be to defect in a PD.

      We thank the reviewer for this helpful comment. We agree that both the rationale for using the repeated Prisoner’s Dilemma (rPDG) and the limitations of this framework should be clarified. We chose the rPDG to isolate the core motivational conflict between selfinterest and joint welfare, as its symmetric and simultaneous structure avoids the sequential trust and reputation dependencies/accumulation inherent to asymmetric tasks such as the Trust Game (King-Casas et al., 2005; Rilling et al., 2002).

      Although a finitely repeated rPDG theoretically favors defection, extensive prior research shows that cooperation can still emerge in long repeated interactions when players rely on learning and reciprocity rather than backward induction (Rilling et al., 2002; Fareri et al., 2015). Our design employed 120 consecutive rounds, allowing participants to update expectations about partner behavior and to establish stable reciprocity patterns over time. We have added the following clarification to the Introduction:

      “The rPDG provides a symmetric and simultaneous framework that isolates the motivational conflict between self-interest and joint welfare, avoiding the sequential trust and reputation dynamics characteristic of asymmetric tasks such as the Trust Game (Rilling et al., 2002; King-Casas et al., 2005)”

      Methods:

      Did the participants know how long the PD would go on for?

      Were the participants informed that the partner was real/simulated?

      Were the participants informed that the partner was going to be the same for all rounds?

      We thank the reviewer for the meticulous review work, which helped us present the experimental design and reporting details more clearly. the following clarifications: I. Participants were not informed of the total number of rounds in the rPDG. This prevented endgame expectations and avoided distraction from counting rounds, which could introduce additional effects. II. Participants were told that their partner was another human participant in the laboratory. However, the partner’s behavior was predetermined by a computer program. This design enabled tighter experimental control and ensured consistent conditions across age groups, supporting valid comparisons. III. Participants were informed that they would interact with the same partner across all rounds, aligning with the essence of a multiround interaction paradigm and stabilizing partner-related expectations. For transparency, we have clarified these points in the Methods and Materials section:

      “Participants were told that their partner was another human participant in the laboratory and that they would interact with the same partner across all rounds. However, in reality, the actions of the partner were predetermined by a computer program. This setup allowed for a clear comparison of the behavioral responses between adolescents and adults. Participants were not informed of the total number of rounds in the rPDG.”

      The authors mention that an SVO was also recorded to indicate participant prosociality. Where are the results of this? Did this track game play at all? Could cooperativeness be explained broadly as an SVO preference that penetrated into game-play behaviour?

      We thank the reviewer for pointing this out. We agree that individual differences in prosociality may shape cooperative behavior, so we conducted additional analyses incorporating SVO. Specifically, we extended GLMM1 and LMM3 by adding the measured SVO as a fixed effect with random slopes, yielding GLMM<sub>sup</sub>3 and LMM<sub>sup</sub>6 (Tables 12–13). The results showed that higher SVO was associated with greater cooperation, whereas its effect on the reward for reciprocity was not significant. Importantly, the primary findings remained unchanged after controlling for SVO. These results indicate that cooperativeness in our task cannot be explained solely by a broad SVO preference, although a more prosocial orientation was associated with greater cooperation. We have reported these analyses and results in the Appendix Analysis section.

      Why was AIC chosen rather an BIC to compare model dominance?

      Sorry for the lack of clarification. Both the Akaike Information Criterion (AIC, Akaike, 1974) and Bayesian Information Criterion (BIC, Schwarz, 1978) are informationtheoretic criterions for model comparison, neither of which depends on whether the models to be compared are nested to each other or not (Burnham et al., 2002). We have added the following clarification into the Methods.

      “We chose to use the AICc as the metric of goodness-of-fit for model comparison for the following statistical reasons. First, BIC is derived based on the assumption that the “true model” must be one of the models in the limited model set one compares (Burnham et al., 2002; Gelman & Shalizi, 2013), which is unrealistic in our case. In contrast, AIC does not rely on this unrealistic “true model” assumption and instead selects out the model that has the highest predictive power in the model set (Gelman et al., 2014). Second, AIC is also more robust than BIC for finite sample size (Vrieze, 2012).”

      I believe the model fitting procedure might benefit from hierarchical estimation, rather than maximum likelihood methods. Adolescents in particular seem to show multiple outliers in a^+ and w^+ at the lower end of the distributions in Figure S2. There are several packages to allow hierarchical estimation and model comparison in MATLAB (which I believe is the language used for this analysis; see https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007043).

      We thank the reviewer for this helpful comment and for referring us to relevant methodological work (Piray et al., 2019). We have addressed this point by incorporating hierarchical Bayesian estimation, which effectively mitigates outlier effects and improves model identifiability. The results replicated those obtained with MLE fitting and further revealed group-level differences in key parameters. Please see our detailed response to Reviewer#1 Q1 for the full description of this analysis and results.

      Results: Model confusion seems to show that the inequality aversion and social reward models were consistently confused with the baseline model. Is this explained or investigated? I could not find an explanation for this.

      The apparent overlap between the inequality aversion (Model 4) and social reward (Model 5) models in the recovery analysis likely arises because neither model includes a learning mechanism, making them unable to capture trial-by-trial adjustments in this dynamic task. Consequently, both were best fit by the baseline model. Please see Response to Reviewer #1 Q3 for related discussion.

      Figures 3e and 3f show the correlation between asymmetric learning rates and age. It seems that both a^+ and a^- are around 0.35-0.40 for young adolescents, and this becomes more polarised with age. Could it be that with age comes an increasing discernment of positive and negative outcomes on beliefs, and younger ages compress both positive and negative values together? Given the higher stochasticity in younger ages (\beta), it may also be that these values simply represent higher uncertainty over how to act in any given situation within a social context (assuming the differences in groups are true).

      We appreciate this insightful interpretation. Indeed, both α+ and α- cluster around 0.35–0.40 in younger adolescents and become increasingly polarized with age, suggesting that sensitivity to positive versus negative feedback is less differentiated early in development and becomes more distinct over time. This interpretation remains tentative and warrants further validation. Based on this comment, we have revised the Discussion to include this developmental interpretation.

      We also clarify that in our model β denotes the inverse temperature parameter; higher β reflects greater choice precision and value sensitivity, not higher stochasticity. Accordingly, adolescents showed higher β values, indicating more value-based and less exploratory choices, whereas adults displayed relatively greater exploratory cooperation. These group differences were also replicated using hierarchical Bayesian estimation (see Response to Reviewer #1 Q1). In response to this comment, we have added a statement in the Discussion highlighting this developmental interpretation.

      “Together, these findings suggest that the differentiation between positive and negative learning rates changes with age, reflecting more selective feedback sensitivity in development, while higher β values in adolescents indicate greater value sensitivity. This interpretation remains tentative and requires further validation in future research.”

      A parameter partial correlation matrix (off-diagonal) would be helpful to understand the relationship between parameters in both adolescents and adults separately. This may provide a good overview of how the model properties may change with age (e.g. a^+'s relation to \beta).

      We thank the reviewer for this helpful comment. We fully agree that a parameter partial correlation matrix can further elucidate the relationships among parameters. Accordingly, we conducted a partial correlation analysis and added the visually presented results to the revised manuscript as Figure 2-figure supplement 4.

      It would be helpful to have Bayes Factors reported with each statistical tests given that several p-values fall within the 0.01 and 0.10.

      We thank the reviewer for this important recommendation. We have conducted Bayes factor analyses and reported BF10 for all relevant post hoc comparisons. We also clarified our analysis in the Methods and Materials section: 

      “Post hoc comparisons were conducted using Bayes factor analyses with MATLAB’s bayesFactor Toolbox (version v3.0, Krekelberg, 2024), with a Cauchy prior scale σ = 0.707.”

      Discussion: I believe the language around ruling out failures in mentalising needs to be toned down. RL models do not enable formal representational differences required to assess mentalising, but they can distinguish biases in value learning, which in itself is interesting. If the authors were to show that more complex 'ToM-like' Bayesian models were beaten by RL models across the board, and this did not differ across adults and adolescents, there would be a stronger case to make this claim. I think the authors either need to include Bayesian models in their comparison, or tone down their language on this point, and/or suggest ways in which this point might be more thoroughly investigated (e.g., using structured models on the same task and running comparisons: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0087619).

      We thank the reviewer for the comments. Please see our response to Reviewer 1 (Appraisal & Discussion section) for details.

      Reviewer #2 (Recommendations for the authors):

      The authors may want to show the winning model earlier (perhaps near the beginning of the Results section, when model parameters are first mentioned).

      We thank the reviewer for this suggestion. We agree that highlighting the winning model early improves clarity. Currently, we have mentioned the winning model before the beginning of the Results section. Specifically, in the penultimate paragraph of the Introduction we state:

      “We identified the asymmetric RL learning model as the winning model that best explained the cooperative decisions of both adolescents and adults.”

      Reviewer #3 (Recommendations for the authors):

      In addition to the points mentioned above, I suggest the following:

      (1) Clarify plots by clearly explaining each variable. In particular, the indices 1 vs. 1,2 vs. 1,2,3 were not immediately understandable.

      We thank the reviewer for this suggestion. We agree that the indices were not immediately clear. We have revised the figure captions (Figure 1 and 4) to explicitly define these terms more clearly: 

      “The x-axis represents the consistency of the partner’s actions in previous trials (t<sub>−1</sub>: last trial; t<sub>−1,2</sub>: last two trials; t<sub>−1,2,3</sub>: last three trials).”

      It's unclear why the index stops at 3. If this isn't the maximum possible number of consecutive cooperation trials, please consider including all relevant data, as adolescents might show a trend similar to adults over more trials.

      We thank the reviewer for raising this point. In our exploratory analyses, we also examined longer streaks of consecutive partner cooperation or defection (up to four or five trials). Two empirical considerations led us to set the cutoff at three in the final analyses. First, the influence of partner behavior diminished sharply with temporal distance. In both GLMMs and LMMs, coefficients for earlier partner choices were small and unstable, and their inclusion substantially increased model complexity and multicollinearity. This recency pattern is consistent with learning and decision models emphasizing stronger weighting of recent evidence (Fudenberg & Levine, 2014; Fudenberg & Peysakhovich, 2016). Second, streaks longer than three were rare, especially among some participants, leading to data sparsity and inflated uncertainty. Including these sparse conditions risked biasing group estimates rather than clarifying them. Balancing informativeness and stability, we therefore restricted the index to three consecutive partner choices in the main analyses, which we believe sufficiently capture individuals’ general tendencies in reciprocal cooperation.

      The term "reciprocity" may not be necessary. Since it appears to reflect a general preference for cooperation, it may be clearer to refer to the specific behavior or parameter being measured. This would also avoid confusion, especially since adolescents do show negative reciprocity in response to repeated defection.

      We thank you for this comment. In our work, we compute the intrinsic reward for reciprocity as p × ω, where p is the partner cooperation expectation and ω is the cooperation preference. In the rPDG, this value framework manifests as a reciprocity-derived reward: sustained mutual cooperation maximizes joint benefits, and the resulting choice pattern reflects a value for reciprocity, contingent on the expected cooperation of the partner. This quantity enters the trade-off between U<sub>cooperation</sub> and U<sub>defection</sub>and captures the participant’s intrinsic reward for reciprocity versus the additional monetary reward payoff of defection. Therefore, we consider the term “reciprocity” an acceptable statement for this construct.

      Interpretation of parameters should closely reflect what they specifically measure.

      We thank the reviewer for pointing this out. We have refined the relevant interpretations of parameters in the current Results and Discussion sections.

      Prior research has shown links between Theory of Mind (ToM) and cooperation (e.g., Martínez-Velázquez et al., 2024). It would be valuable to test whether this also holds in your dataset.

      We thank the reviewer for this thoughtful comment. Although we did not directly measure participants’ ToM, our design allowed us to estimate participants’ trial-by-trial inferences (i.e., expectations) about their partner’s cooperation probability. We therefore treat these cooperation expectations as an indirect representation for belief inference, which is related to ToM processes. To test whether this belief-inference component relates to cooperation in our dataset, we further conducted an exploratory analysis (GLMM<sub>sup</sub>4) in which participants’ choices were regressed on their cooperation expectations, group, and the group × cooperation-expectation interaction, controlling for trial number and gender, with random effects. Consistent with the ToM–cooperation link in prior research (MartínezVelázquez et al., 2024), participants’ expectations about their partner’s cooperation significantly predicted their cooperative behavior (Table 14), suggesting that decisions were shaped by social learning about others’ inferred actions. Moreover, the interaction between group and cooperation expectation was not significant, indicating that this inference-driven social learning process likely operates similarly in adolescents and adults. This aligns with our primary modeling results showing that both age groups update beliefs via an asymmetric learning process. We have reported these analyses in the Appendix Analysis section.

      More informative table captions would help the reader. Please clarify how variables are coded (e.g., is female = 0 or 1? Is adolescent = 0 or 1?), to avoid the need to search across the manuscript for this information.

      We thank the reviewer for raising this point. We have added clear and standardized variable coding in the table notes of all tables to make them more informative and avoid the need to search the paper. We have ensured consistent wording and formatting across all tables.

      I hope these comments are helpful and support the authors in further strengthening their manuscript.

      We thank the three reviewers for their comments, which have been helpful in strengthening this work.

      Reference

      (1) Fudenberg, D., & Levine, D. K. (2014). Recency, consistent learning, and Nash equilibrium. Proceedings of the National Academy of Sciences of the United States of America, 111(Suppl. 3), 10826–10829. https://doi.org/10.1073/pnas.1400987111

      (2) Fudenberg, D., & Peysakhovich, A. (2016). Recency, records, and recaps: Learning and nonequilibrium behavior in a simple decision problem. ACM Transactions on Economics and Computation, 4(4), Article 23, 1–18. https://doi.org/10.1145/2956581

      (3) Hackel, L., Doll, B., & Amodio, D. (2015). Instrumental learning of traits versus rewards: Dissociable neural correlates and effects on choice. Nature Neuroscience, 18, 1233– 1235. https://doi.org/10.1038/nn.4080

      (4) Icenogle, G., Steinberg, L., Duell, N., Chein, J., Chang, L., Chaudhary, N., Di Giunta, L.,Dodge, K. A., Fanti, K. A., Lansford, J. E., Oburu, P., Pastorelli, C., Skinner, A. T.,Sorbring, E., Tapanya, S., Uribe Tirado, L. M., Alampay, L. P., Al-Hassan, S. M.,Takash, H. M. S., & Bacchini, D. (2019). Adolescents’ cognitive capacity reaches adult levels prior to their psychosocial maturity: Evidence for a “maturity gap” in a multinational, cross-sectional sample. Law and Human Behavior, 43(1), 69–85. https://doi.org/10.1037/lhb0000315

      (5) Krekelberg, B. (2024). Matlab Toolbox for Bayes Factor Analysis (v3.0) [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.13744717

      (6) Martínez-Velázquez, E. S., Ponce-Juárez, S. P., Díaz Furlong, A., & Sequeira, H. (2024). Cooperative behavior in adolescents: A contribution of empathy and emotional regulation? Frontiers in Psychology, 15, 1342458. https://doi.org/10.3389/fpsyg.2024.1342458

      (7) Tervo-Clemmens, B., Calabro, F. J., Parr, A. C., et al. (2023). A canonical trajectory of executive function maturation from adolescence to adulthood. NatureCommunications, 14, 6922. https://doi.org/10.1038/s41467-023-42540-8

      (8) King-Casas, B., Tomlin, D., Anen, C., Camerer, C. F., Quartz, S. R., & Montague, P. R. (2005). Getting to know you: reputation and trust in a two-person economic exchange. Science, 308(5718), 78-83. https://doi.org/10.1126/science.1108062

      (9) Rilling, J. K., Gutman, D. A., Zeh, T. R., Pagnoni, G., Berns, G. S., & Kilts, C. D. (2002). A neural basis for social cooperation. Neuron, 35(2), 395-405. https://doi.org/10.1016/s0896-6273(02)00755-9

      (10) Fareri, D. S., Chang, L. J., & Delgado, M. R. (2015). Computational substrates of social value in interpersonal collaboration. Journal of Neuroscience, 35(21), 8170-8180. https://doi.org/10.1523/JNEUROSCI.4775-14.2015

      (11) Akaike, H. (2003). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723. https://doi.org/10.1109/TAC.1974.1100705

      (12) Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461464. https://doi.org/10.1214/aos/1176344136

      (13) Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.https://doi.org/10.1007/b97636

      (14) Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66(1), 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x

      (15) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018

      (16) Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Psychological Methods, 17(2), 228–243. https://doi.org/10.1037/a0027127.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary: 

      Zhang and colleagues examine neural representations underlying abstract navigation in the entorhinal cortex (EC) and hippocampus (HC) using fMRI. This paper replicates a previously identified hexagonal modulation of abstract navigation vectors in abstract space in EC in a novel task involving navigating in a conceptual Greeble space. In HC, the authors claim to identify a three-fold signal of the navigation angle. They also use a novel analysis technique (spectral analysis) to look at spatial patterns in these two areas and identify phase coupling between HC and EC. Finally, the authors propose an EC-HPC PhaseSync Model to understand how the EC and HC construct cognitive maps. While the wide array of techniques used is impressive and their creativity in analysis is admirable, overall, I found the paper a bit confusing and unconvincing. I recommend a significant rewrite of their paper to motivate their methods and clarify what they actually did and why. The claim of three-fold modulation in HC, while potentially highly interesting to the community, needs more background to motivate why they did the analysis in the first place, more interpretation as to why this would emerge in biology, and more care taken to consider alternative hypotheses seeped in existing models of HC function. I think this paper does have potential to be interesting and impactful, but I would like to see these issues improved first.

      General comments:

      (1) Some of the terminology used does not match the terminology used in previous relevant literature (e.g., sinusoidal analysis, 1D directional domain).

      We thank the reviewer for this valuable suggestion, which helps to improve the consistency of our terminology with previous literature and to reduce potential ambiguity. Accordingly, we have replaced “sinusoidal analysis” with “sinusoidal modulation” (Doeller et al., 2010; Bao et al., 2019; Raithel et al., 2023) and “1D directional domain” with “angular domain of path directions” throughout the manuscript.

      (2) Throughout the paper, novel methods and ideas are introduced without adequate explanation (e.g., the spectral analysis and three-fold periodicity of HC).

      We thank the reviewer for raising this important point. In the revised manuscript, we have substantially extended the Introduction (paragraphs 2–4) to clarify our hypothesis, explicitly explaining why the three primary axes of the hexagonal grid cell code may manifest as vector fields. We have also revised the first paragraph of the “3-fold periodicity in the HPC” section in the Results to clarify the rationale for using spectral analysis. Please refer to our responses to comment 2 and 3 below for details.

      Reviewer #2 (Public review):

      The authors report results from behavioral data, fMRI recordings, and computer simulations during a conceptual navigation task. They report 3-fold symmetry in behavioral and simulated model performance, 3-fold symmetry in hippocampal activity, and 6-fold symmetry in entorhinal activity (all as a function of movement directions in conceptual space). The analyses are thoroughly done, and the results and simulations are very interesting.

      We sincerely thank the reviewer for the positive and encouraging comments on our study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) This paper has quite a few spelling and grammatical mistakes, making it difficult to understand at times.

      We apologize for the wordings and grammatical errors. We have thoroughly re-read and carefully edited the entire manuscript to correct typographical and grammatical errors, ensuring improved clarity and readability.

      (2) Introduction - It's not clear why the three primary axes of hexagonal grid cell code would manifest as vector fields.

      We thank the reviewer for raising this important point. In the revised Introduction (paragraphs 2, 3, and 4), we now explicitly explain the rationale behind our hypothesis that the three primary axes of the hexagonal grid cell code manifest as vector fields.

      In paragraph 2, we present empirical evidence from rodent, bat, and human studies demonstrating that mental simulation of prospective paths relies on vectorial representations in the hippocampus (Sarel et al., 2017; Ormond and O’Keefe, 2022; Muhle-Karbe et al., 2023).

      In paragraphs 3 and 4, we introduce our central hypothesis: vectorial representations may originate from population-level projections of entorhinal grid cell activity, based on three key considerations:

      (1) The EC serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020).

      (2) Grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022), which makes it plausible that their spatially periodic activity can be detected using fMRI.

      (3) A model-based inference: for example, in the simplest case, when one mentally simulates a straight pathway aligned with the grid orientation, a subpopulation of grid cells would be activated. The resulting population activity would form a near-perfect vectorial representation, with constant activation strength along the path. In contrast, if the simulated path is misaligned with the grid orientation, the population response becomes a distorted vectorial code. Consequently, simulating all possible straight paths spanning 0°–360° results in 3-fold periodicity in the activity patterns—due to the 180° rotational symmetry of the hexagonal grid, orientations separated by 180° are indistinguishable.

      We therefore speculate that vectorial representations embedded in grid cell activity exhibit 3-fold periodicity across spatial orientations and serve as a periodic structure to represent spatial direction. Supporting this view, reorientation paradigms in both rodents and young children have shown that subjects search equally in two opposite directions, reflecting successful orientation encoding but a failure to integrate absolute spatial direction (Hermer and Spelke, 1994; Julian et al., 2015; Gallistel, 2017; Julian et al., 2018).

      (3) It took me a few reads to understand what the spectral analysis was. After understanding, I do think this is quite clever. However, this paper needs more motivation to understand why you are performing this analysis. E.g., why not just take the average regressor at the 10º, 70º, etc. bins and compare it to the average regressor at 40º, 100º bins? What does the Fourier transform buy you?

      We are sorry for the confusion. we outline the rationale for employing Fast Fourier Transform (FFT) analysis to identify neural periodicity. In the revised manuscript, we have added these clarifications into the first paragraph of the “3-fold periodicity in the HPC” subsection in the Results.

      First, FFT serves as an independent approach to cross-validate the sinusoidal modulation results, providing complementary evidence for the 6-fold periodicity in EC and the 3-fold periodicity in HPC.

      Second, FFT enables unbiased detection of multiple candidate periodicities (e.g., 3–7-fold) simultaneously without requiring prior assumptions about spatial phase (orientation). By contrast, directly comparing “aligned” versus “misaligned” angular bins (e.g., 10°/70° vs. 40°/100°) would implicitly assume knowledge of the phase offset, which was not known a priori.

      Finally, FFT uniquely allows periodicity analysis of behavioral performance, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency makes it possible to directly compare periodicities across neural and behavioral domains.

      (4) A more minor point: at one point, you say it’s a spectral analysis of the BOLD signals, but the methods description makes it sound like you estimated regressors at each of the bins before performing FFT. Please clarify. 

      We apologize for the confusion. In our manuscript, we use the term spectral analysis to distinguish this approach from sinusoidal modulation analysis. Conceptually, our spectral analysis involves a three-level procedure:

      (1) First level: We estimated direction-dependent activity maps using a general linear model (GLM), which included 36 regressors corresponding to path directions, down-sampled in 10° increments.

      (2) Second level: We applied a Fast Fourier Transform (FFT) to the direction-dependent activity maps derived from the GLM to examine the spectral magnitude of potential spatial periodicities.

      (3) Third level: We conducted group-level statistical analyses across participants to assess the consistency of the observed periodicities.

      We have revised the “Spectral analysis of MRI BOLD signals” subsection in the Methods to clarify this multi-level procedure.

      (5) Figure 4a:

      Why do the phases go all the way to 2*pi if periodicity is either three-fold or six-fold? 

      When performing correlation between phases, you should perform a circular-circular correlation instead of a Pearson's correlation.

      We thank the reviewer for raising this important point. In the original Figure 4a, both EC and HPC phases spanned 0–2π because their sinusoidal phase estimates were projected into a common angular space by scaling them according to their symmetry factors (i.e., multiplying the 3-fold phase by 3 and the 6-fold phase by 6), followed by taking the modulo 2π. However, this projection forced signals with distinct intrinsic periodicities (120° vs. 60° cycles) into a shared 360° space, thereby distorting their relative angular distances and disrupting the one-to-one correspondence between physical directions and phase values. Consequently, this transformation could bias the estimation of their phase relationship.

      In the revised analysis and Figure 4a, we retained the original phase estimates derived from the sinusoidal modulation within their native periodic ranges (0–120° for 3-fold and 0–60° for 6-fold) by applying modulo operations directly. Following your suggestion, the relationship between EC and HPC phases was then quantified using circular–circular correlation (Jammalamadaka & Sengupta, 2001), as implemented in the CircStat MATLAB toolbox. This updated analysis avoids the rescaling artifact and provides a statistically stronger and conceptually clearer characterization of the phase correspondence between EC and HPC.

      (6) Figure 4d needs additional clarification:

      Phase-locking is typically used to describe data with a high temporal precision. I understand you adopted an EEG analysis technique to this reconstructed fMRI time-series data, but it should be described differently to avoid confusion. This needs additional control analyses (especially given that 3 is a multiple of 6) to confirm that this result is specific to the periodicities found in the paper.

      We thank the reviewer for this insightful comment. We have extensively revised the description of the Figure 4 to avoid confusion with EEG-based phase-locking techniques. The revised text now explicitly clarifies that our approach quantifies spatial-domain periodic coupling across path directions, rather than temporal synchronization of neural signals.

      To further address the reviewer’s concern about potential effects of the integer multiple relationship between the 3-fold HPC and 6-fold EC periodicities, we additionally performed two control analyses using the 9-fold and 12-fold EC components, both of which are also integer multiples of the 3-fold HPC periodicity. Neither control analysis showed significant coupling (p > 0.05), confirming that the observed 3-fold–6-fold coupling was specific and not driven by their harmonic relationship.

      The description of the revised Figure 4 has been updated in the “Phase Synchronization Between HPC and EC Activity” subsection of the Results.

      (7) Figure 5a is misleading. In the text, you say you test for propagation to egocentric cortical areas, but I don’t see any analyses done that test this. This feels more like a possible extension/future direction of your work that may be better placed in the discussion.

      We are sorry for the confusion. Figure 5a was intended as a hypothesis-driven illustration to motivate our analysis of behavioral periodicity based on participants’ task performance. However, we agree with the reviewer that, on its own, Figure 5a could be misleading, as it does not directly present supporting analyses.

      To provide empirical support for the interpretation depicted in Figure 5a, we conducted a whole-brain analysis (Figure S8), which revealed significant 3-fold periodic signals in egocentric cortical regions, including the parietal cortex (PC), precuneus (PCU), and motor regions.

      To avoid potential misinterpretation, we have revised the main text to include these results and explicitly referenced Figure S8 in connection with Figure 5a.

      The updated description in the “3-fold periodicity in human behavior” subsection in the Results is as follows:

      “Considering the reciprocal connectivity between the medial temporal lobe (MTL), where the EC and HPC reside, and the parietal cortex implicated in visuospatial perception and action, together with the observed 3-fold periodicity within the DMN (including the PC and PCu; Fig. S8), we hypothesized that the 3-fold periodic representations of path directions extend beyond the MTL to the egocentric cortical areas, such as the PC, thereby influencing participants' visuospatial task performance (Fig. 5a)”.

      Additionally, Figure 5a has been modified to more clearly highlight the hypothesized link between activity periodicity and behavioral periodicity, rather than suggesting a direct anatomical pathway.

      (8) PhaseSync model: I am not an expert in this type of modeling, so please put a lower weight on this comment (especially compared to some of the other reviewers). While the PhaseSync model seems interesting, it’s not clear from the discussion how this compares to current models. E.g., Does it support them by adding the three-fold HC periodicity? Does it demonstrate that some of them can't be correct because they don't include this three-fold periodicity?

      We thank the reviewer for the insightful comment regarding the PhaseSync model. We agree that further clarifying its relationship to existing computational frameworks is important.

      The EC–HPC PhaseSync model is not intended to replace or contradict existing grid–place cell models of navigation (e.g., Bicanski and Burgess, 2019; Whittington et al., 2020; Edvardsen et al., 2020). Instead, it offers a hierarchical extension by proposing that vectorial representations in the hippocampus emerge from the projections of periodic grid codes in the entorhinal cortex. Specifically, the model suggests that grid cell populations encode integrated path information, forming a vectorial gradient toward goal locations.

      To simplify the theoretical account, our model was implemented in an idealized square layout. In more complex real-world environments, hippocampal 3-fold periodicity may interact with additional spatial variables, such as distance, movement speed, and environmental boundaries.

      We have revised the final two paragraphs of the Discussion to clarify this conceptual framework and emphasize the importance of future studies in exploring how periodic activity in the EC–HPC circuit interacts with environmental features to support navigation.

      Reviewer #2 (Recommendations for the authors):

      (1) Please show a histogram of movement direction sampling for each participant.

      We thank the reviewer for this helpful suggestion. We have added a new supplementary figure (Figure S2) showing histograms of path direction sampling for each participant (36 bins of 10°). The figure is also included. Rayleigh tests for circular uniformity revealed no significant deviations from uniformity (all ps > 0.05, Bonferroni-corrected across participants), confirming that path directions were sampled evenly across 0°–360°.

      (2) Why didn’t you use participants’ original trajectories (instead of the trajectories inferred from the movement start and end points) for the hexadirectional analyses? 

      In our paradigm, participants used two MRI-compatible 2-button response boxes (one for each hand) to adjust the two features of the greebles. As a result, the raw adjustment path contained only four cardinal directions (up, down, left, right). If we were to use the raw stepwise trajectories, the analysis would be restricted to these four directions, which would severely limit the angular resolution. By instead defining direction as the vector from the start to the end position in feature space, we can expand the effective range of directions to the full 0–360°. This approach follows previous literature on abstract grid-like coding in humans (e.g., Constantinescu et al., 2016), where direction was similarly defined by the relative change between two feature dimensions rather than the literal stepwise path. We have added this clarification in the “Sinusoidal modulation” subsection of the revised method.

      (3) Legend of Figure 2: the statement "localizing grid cell activity" seems too strong because it is still not clear whether hexadirectional signals indeed result from grid-cell activity (e.g., Bin Khalid et al., eLife, 2024). I would suggest rephrasing this statement (here and elsewhere). 

      Thank you for this helpful suggestion. We have removed the statement “localizing grid cell activity” to avoid ambiguity and revised the legend of Figure 2a to more explicitly highlight its main purpose—defining how path directions and the aligned/misaligned conditions were constructed in the 6-fold modulation. We have also modified similar expressions throughout the manuscript to ensure consistency and clarity.

      (4) Legend of Figure 2: “cluster-based SVC correction for multiple comparisons” - what is the small volume you are using for the correction? Bilateral EC?

      For both Figure 2 and Figure 3, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This has been clarified in the revised Statistical Analysis section of the Methods as “… with small-volume correction (SVC) applied within the bilateral MTL”.

      (5) Legend of Figure 2: "ROI-based analysis" - what kind of ROI are you using? "corrected for multiple comparisons" - which comparisons are you referring to? Different symmetries and also the right/left hemisphere?

      In Figure 2b, the ROI was defined as a functional mask derived from the significant activation cluster in the right entorhinal cortex (EC). Since no robust clusters were observed in the left EC, the functional ROI was restricted to the right hemisphere. We indeed included Figure 2c to illustrate this point; however, we recognize that our description in the text was not sufficiently clear.

      Regarding the correction for multiple comparisons, this refers specifically to the comparisons across different rotational symmetries (3-, 4-, 5-, 6-, and 7-fold). Only the 6-fold symmetry survived correction, whereas no significant effects were detected for the other symmetries.

      We have clarified these points in the “6-fold periodicity in the EC” subsection of the result as “… The ROI was defined as a functional mask of the right EC identified in the voxel-based analysis and further restricted within the anatomical EC. These analyses revealed significant periodic modulation only at 6-fold (Figure  2c; t(32) = 3.56, p = 0.006, two-tailed, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.62) …”.

      We have also revised the “3-fold periodicity in the HPC” subsection of the result as “… ROI analysis, using a functional mask of the HPC identified in the spectral analysis and further restricted within the anatomical HPC, indicated that HPC activity selectively fluctuated at 3-fold periodicity (Figure 3e; t(32) = 3.94, p = 0.002, corrected for multiple comparisons across rotational symmetries; Cohen’s d = 0.70) …”.

      (6) Figure 2d: Did you rotationally align 0{degree sign} across participants? Please state explicitly whether (or not) 0{degree sign} aligns with the x-axis in Greeble space.

      We thank the reviewer for this helpful question. Yes, before reconstructing the directional tuning curve in Figure 2d, path directions were rotationally aligned for each participant by subtracting the participant-specific grid orientation (ϕ) estimated from the independent dataset (odd sessions). We have now made this description explicit in the revised manuscript in the “6-fold periodicity in the EC” subsection of the Results, stating “… To account for individual difference in spatial phase, path directions were calibrated by subtracting the participant-specific grid orientation estimated from the odd sessions ...”.

      (7) Clustering of grid orientations in 30 participants: What does “Bonferroni corrected” refer to? Also, the Rayleigh test is sensitive to the number of voxels - do you obtain the same results when using pair-wise phase consistency? 

      “Bonferroni corrected” here refers to correction across participants. We have clarified this in the first paragraph of the “6-fold periodicity in the EC” subsection of the Result and in the legend of Supplementary Figure S5 as “Bonferroni-corrected across participants.”

      To examine whether our findings were sensitive to the number of voxels, we followed the reviewer’s guidance to compute pairwise phase consistency (PPC; Vinck et al., 2010) for each participant. The PPC results replicated those obtained with the Rayleigh test. We have updated the new results into the Supplementary Figure S5. We also updated the “Statistical Analysis” subsection of the Methods to describe PPC as “For the PPC (Vinck et al., 2010), significance was tested using 5,000 permutations of uniformly distributed random phases (0–2π) to generate a null distribution for comparison with the observed PPC”.

      (8) 6-fold periodicity in the EC: Do you compute an average grid orientation across all EC voxels, or do you compute voxel-specific grid orientations?

      Following the protocol originally described by Doeller et al. (2010), we estimated voxel-wise grid orientations within the EC and then obtained a participant-specific orientation by averaging across voxels within a hand-drawn bilateral EC mask. The procedure is described in detail in the “Sinusoidal modulation” subsection of the Methods.

      (9) Hand-drawn bilateral EC mask: What was your procedure for drawing this mask? What results do you get with a standard mask, for example, from Freesurfer or SPM? Why do you perform this analysis bilaterally, given that the earlier analysis identified 6-fold symmetry only in the right EC? What do you mean by "permutation corrected for multiple comparisons"?

      We thank the reviewer for raising these important methodological points. To our knowledge, no standard volumetric atlas provides an anatomically defined entorhinal cortex (EC) mask. For example, the built-in Harvard–Oxford cortical structural atlas in FSL contains only a parahippocampal region that encompasses, but does not isolate, the EC. The AAL atlas likewise does not contain an EC region. In FreeSurfer, an EC label is available, but only in the fsaverage surface space, which is not directly compatible with MNI-based volumetric group-level analyses.

      Therefore, we constructed a bilateral EC mask by manually delineating the EC according to the detailed anatomical landmarks described by Insausti et al. (1998). Masks were created using ITK-SNAP (Version 3.8, www.itksnap.org). For transparency and reproducibility, the mask has been made publicly available at the Science Data Bank (link: https://www.scidb.cn/s/NBriAn), as indicated in the revised Data and Code availability section.

      Regarding the use of a bilateral EC mask despite voxel-wise effects being strongest in the right EC. First, we did not have any a priori hypothesis regarding laterality of EC involvement before performing analyses. Second, previous studies estimated grid orientation using a bilateral EC mask in their sinusoidal analyses (Doeller et al., 2010; Constantinescu et al., 2016; Bao et al., 2019; Wagner et al., 2023; Raithel et al., 2023). We therefore followed this established approach to estimate grid orientation.

      By “permutation corrected for multiple comparisons” we refer to the family-wise error correction applied to the reconstructed directional tuning curves (Figure 2d for the EC, Figure 3f for the HPC). Specifically, directional labels were randomly shuffled 5,000 times, and an FFT was applied to each shuffled dataset to compute spectral power at each fold. This procedure generated null distributions of spectral power for each symmetry. For each fold, the 95th percentile of the maximal power across permutations was used as the uncorrected threshold. To correct across folds, the 95th percentile of the maximal suprathreshold power across all symmetries was taken as the family-wise error–corrected threshold. We have clarified this procedure in the revised “Statistical Analysis” subsection of the Methods.

      (10) Figures 3b and 3d: Why do different hippocampal voxels show significance for the sinusoidal versus spectral analysis? Shouldn’t the analyses be redundant and, thus, identify the same significant voxels? 

      We thank the reviewer for this insightful question. Although both sinusoidal modulation and spectral analysis aim to detect periodic neural activity, the two approaches are methodologically distinct and are therefore not expected to identify exactly the same significant voxels.

      Sinusoidal modulation relies on a GLM with sine and cosine regressors to test for phase-aligned periodicity (e.g., 3-fold or 6-fold), calibrated according to the estimated grid orientation. This approach is highly specific but critically depends on accurate orientation estimation. In contrast, spectral analysis applies Fourier decomposition to the directional tuning profile, enabling the detection of periodic components without requiring orientation calibration.

      Accordingly, the two analyses are not redundant but complementary. The FFT approach allows for an unbiased exploration of multiple candidate periodicities (e.g., 3–7-fold) without predefined assumptions, thereby providing a critical cross-validation of the sinusoidal GLM results. This strengthens the evidence for 6-fold periodicity in EC and 3-fold periodicity in HPC. Furthermore, FFT uniquely facilitates the analysis of periodicities in behavioral performance data, which is not feasible with standard sinusoidal GLM approaches. This methodological consistency enables direct comparison of periodicities across neural and behavioral domains.

      Additionally, the anatomical distributions of the HPC clusters appear more similar between Figure 3b and Figure 3d after re-plotting Figure 3d using the peak voxel coordinates (x = –24, y = –18), which are closer to those used for Figure 3b (x = –24, y = –20), as shown in the revised Figure 3.

      Taken together, the two analyses serve distinct but complementary purposes.

      (11) 3-fold sinusoidal analysis in hippocampus: What kind of small volume are you using to correct for multiple comparisons?

      We thank the reviewer for this comment. The same small volume correction procedure was applied as described in R4. Specifically, the anatomical mask of the bilateral medial temporal lobe (MTL), as defined by the AAL atlas, was used as the small volume for correction. This procedure has been clarified in the revised Statistical Analysis section of the Methods as following: “… with small-volume correction (SVC) applied within the bilateral MTL.”

      (12) Figure S5: “right HPC” – isn’t the cluster in the left hippocampus? 

      We are sorry for the confusion. The brain image was present in radiological orientation (i.e., the left and right orientations are flipped). We also checked the figure and confirmed that the cluster shown in the original Figure S5 (i.e., Figure S6 in the revised manuscript) is correctly labeled as the right hippocampus, as indicated by the MNI coordinate (x = 22), where positive x values denote the right hemisphere. To avoid potential confusion, we have explicitly added the statement “Volumetric results are displayed in radiological orientation” to the figure legends of all volume-based results.

      (13) Figure S5: Why are the significant voxels different from the 3-fold symmetry analysis using 10{degree sign} bins?

      As shown in R10, the apparent differences largely reflect variation in MNI coordinates. After adjusting for display coordinates, the anatomical locations of the significant clusters are in fact highly similar between the 10°-binned (Figure 3d, shown above) and the 20°-binned results (Figure S6).

      Although both analyses rely on sinusoidal modulation, they differ in the resolution of the input angular bins (10° vs. 20°). Combined with the inherent noise in fMRI data, this makes it unlikely that the two approaches would yield exactly the same set of significant voxels. Importantly, both analyses consistently reveal robust 3-fold periodicity in the hippocampus, indicating that the observed effect is not dependent on angular bin size.

      (14) Figure 4a and corresponding text: What is the unit? Phase at which frequency? Are you using a circular-circular correlation to test for the relationship?

      We thank the reviewer for raising this important point. In the revised manuscript, we have clarified that the unit of the phase values is radians, corresponding to the 6-fold periodic component in the EC and the 3-fold periodic component in the HPC. In the original Figure 4a, both EC and HPC phases—estimated from sinusoidal modulation—were analyzed using Pearson correlation. We have since realized issues with this approach, as also noted R5 to Reviewer #1.

      In the revised analysis and Figure 4a (as shown above), we re-evaluated the relationship between EC and HPC phases using a circular–circular correlation (Jammalamadaka & Sengupta, 2001), implemented in the CircStat MATLAB toolbox. The “Phase synchronization between the HPC and EC activity” subsection of the Result has been accordingly updated as following:

      “To examine whether the spatial phase structure in one region could predict that in another, we tested whether the orientations of the 6-fold EC and 3-fold HPC periodic activities, estimated from odd-numbered sessions using sinusoidal modulation with rotationally symmetric parameters (in radians), were correlated across participants. A cross-participant circular–circular correlation was conducted between the spatial phases of the two areas to quantify the spatial correspondence of their activity patterns (EC: purple dots; HPC: green dots) (Jammalamadaka & Sengupta, 2001). The analysis revealed a significant circular correlation (Figure 4a; r = 0.42, p < 0.001) …”.

      In the “Statistical analysis” subsection of the method:

      “… The relationship between EC and HPC phases was evaluated using the circular–circular correlation (Jammalamadaka & Sengupta, 2001) implemented in the CircStat MATLAB toolbox …”.

      (15) Paragraph following “We further examined amplitude-phase coupling...” - please clarify what data goes into this analysis.

      We thank the reviewer for this helpful comment. In this analysis, the input data consisted of hippocampal (HPC) phase and entorhinal (EC) amplitude, both extracted using the Hilbert transform from the reconstructed BOLD signals of the EC and HPC derived through sinusoidal modulation. We have substantially revised the description of the amplitude–phase coupling analysis in the third paragraph of the “Phase Synchronization Between HPC and EC Activity” subsection of the Results to clarify this procedure.

      (16) Alignment between EC 6-fold phases and HC 3-fold phases: Why don't you simply test whether the preferred 6-fold orientations in EC are similar to the preferred 3-fold phases in HC? The phase-amplitude coupling analyses seem sophisticated but are complex, so it is somewhat difficult to judge to what extent they are correct. 

      We thank the reviewer for this thoughtful comment. We employed two complementary analyses to examine the relationship between EC and HPC activity. In the revised Figure 4 (as shown in Figure 4 for Reviewer #1), Figure 4a provides a direct and intuitive measure of the phase relationship between the two regions using circular–circular correlation. Figure 4b–c examines whether the activity peaks of the two regions are aligned across path directions using cross-frequency amplitude–phase coupling, given our hypothesis that the spatial phase of the HPC depends on EC projections. These two analyses are complementary: a phase correlation does not necessarily imply peak-to-peak alignment, and conversely, peak alignment does not always yield a statistically significant phase correlation. We therefore combined multiple analytical approaches as a cross-validation across methods, providing convergent evidence for robust EC–HPC coupling.

      (17) Figure 5: Do these results hold when you estimate performance just based on “deviation from the goal to ending locations” (without taking path length into account)? 

      We thank the reviewer for this thoughtful suggestion. Following the reviewer’s advice, we re-estimated behavioral performance using the deviation between the goal and ending locations (i.e., error size) and path length independently. As shown in the new Figure S9, no significant periodicity was observed in error size (p > 0.05), whereas a robust 3-fold periodicity was found for path length (p < 0.05, corrected for multiple comparisons).

      We employed two behavioral metrics,(1) path length and (2) error size, for complementary reasons. In our task, participants navigated using four discrete keys corresponding to the cardinal directions (north, south, east, and west). This design inherently induces a 4-fold bias in path directions, as described in the “Behavioral performance” subsection of the Methods. To minimize this artifact, we computed the objectively optimal path length and used it to calibrate participants’ path lengths. However, error size could not be corrected in the same manner and retained a residual 4-fold tendency (see Figure S9d).

      Given that both path length and error size are behaviorally relevant and capture distinct aspects of task performance, we decided to retain both measures when quantifying behavioral periodicity. This clarification has been incorporated into the “Behavioral performance” subsection of the Methods, and the 2<sup>nd</sup> paragraph of the “3-fold periodicity in human behavior” subsection of the Results.

      (18) Phase locking between behavioral performance and hippocampal activity: What is your way of creating surrogates here?

      We thank the reviewer for this helpful question. Surrogate datasets were generated by circularly shifting the signal series along the direction axis across all possible offsets (following Canolty et al., 2006). This procedure preserves the internal phase structure within each domain while disrupting consistent phase alignment, thereby removing any systematic coupling between the two signals. Each surrogate dataset underwent identical filtering and coherence computation to generate a null distribution, and the observed coherence strength was compared with this distribution using paired t-tests across participants. The statistical analysis section has been systematically revised to incorporate these methodological details.

      (19) I could not follow why the authors equate 3-fold symmetry with vectorial representations. This includes statements such as “these empirical findings provide a potential explanation for the formation of vectorial representation observed in the HPC.” Please clarify.

      We thank the reviewer for raising this point. Please refer to our response to R2 for Reviewer #1 and the revised Introduction (paragraphs 2–4), where we explicitly explain why the three primary axes of the hexagonal grid cell code can manifest as vector fields.

      (20) It was unclear whether the sentence “The EC provides a foundation for the formation of periodic representations in the HPC” is based on the authors’ observations or on other findings. If based on the authors’ findings, this statement seems too strong, given that no other studies have reported periodic representations in the hippocampus to date (to the best of my knowledge).

      We thank the reviewer for this comment. We agree that the original wording lacked sufficient rigor. We have extensively revised the 3rd paragraph of the Discussion section with more cautious language by reducing overinterpretation and emphasizing the consistency of our findings with prior empirical evidence, as follows: “The EC–HPC PhaseSync model demonstrates how a vectorial representation may emerge in the HPC from the projections of populations of periodic grid codes in the EC. The model was motivated by two observations. First, the EC intrinsically serves as the major source of hippocampal input (Witter and Amaral, 1991; van Groen et al., 2003; Garcia and Buffalo, 2020), and grid codes exhibit nearly invariant spatial orientations (Hafting et al., 2005; Gardner et al., 2022). Second, mental planning, characterized by “forward replay” (Dragoi and Tonegawa, 2011; Pfeiffer, 2020), has the capacity to activate populations of grid cells that represent sequential experiences in the absence of actual physical movement (Nyberg et al., 2022). We hypothesize that an integrated path code of sequential experiences may eventually be generated in the HPC, providing a vectorial gradient toward the goal location. The path code exhibits regular, vector-like representations when the path direction aligns with the orientations of grid axes, and becomes irregular when they misalign. This explanation is consistent with the band-like representations observed in the dorsomedial EC (Krupic et al., 2012) and the irregular activity fields of trace cells in the HPC (Poulter et al., 2021). ”

    1. Author response:

      The following is the authors’ response to the original reviews

      A point by point response included below. Before we turn to that we want to note one change that we decided to introduce, related to generalization on unseen tissues/cell types (Figure 3a in the original submission and related question by Reviewer #2 below). This analysis was based on adding a latent “RBP state” representation during learning of condition/tissue specific splicing. The “RBP state” per condition is captured by a dedicated encoder. Our original plan was to have a paper describing a new RBP-AE model we developed in parallel, which also served as the base to capture this “RBP State”. However, we got delayed in getting this second paper finalized (it was led by other lab members, some of whom have already left the lab). This delay affected the TrASPr manuscript as TrASPr’s code should be available and analysis reproducible upon publication. After much deliberation, we decided that in order to comply with reproducibility standards while not self scooping the RBP-AE paper, we eventually decided to take out the RBP-AE and replace it with a vanilla PCA based embedding for the “RBP-State”. The PCA approach is simpler and reproducible, based on linear transformation of the RBPs expression vector into a lower dimension. The qualitative results included in Figure 3a still hold, and we also produced the new results suggested by Reviewer #2 in other GTEX tissues with this PCA based embedding (below). 

      We don’t believe the switch to PCA based embedding should have any bearing on the current manuscript evaluation but wanted to take this opportunity to explain the reasoning behind this additional change.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors propose a transformer-based model for the prediction of condition - or tissue-specific alternative splicing and demonstrate its utility in the design of RNAs with desired splicing outcomes, which is a novel application. The model is compared to relevant existing approaches (Pangolin and SpliceAI) and the authors clearly demonstrate its advantage. Overall, a compelling method that is well thought out and evaluated.

      Strengths:

      (1) The model is well thought out: rather than modeling a cassette exon using a single generic deep learning model as has been done e.g. in SpliceAI and related work, the authors propose a modular architecture that focuses on different regions around a potential exon skipping event, which enables the model to learn representations that are specific to those regions. Because each component in the model focuses on a fixed length short sequence segment, the model can learn position-specific features. Another difference compared to Pangolin and SpliceAI which are focused on modeling individual splice junctions is the focus on modeling a complete alternative splicing event.

      (2) The model is evaluated in a rigorous way - it is compared to the most relevant state-of-the-art models, uses machine learning best practices, and an ablation study demonstrates the contribution of each component of the architecture.

      (3) Experimental work supports the computational predictions.     

      (4) The authors use their model for sequence design to optimize splicing outcomes, which is a novel application.

      We wholeheartedly thank Reviewer #1 for these positive comments regarding the modeling approach we took to this task and the evaluations we performed. We have put a lot of work and thought into this and it is gratifying to see the results of that work acknowledged like this.

      Weaknesses:

      No weaknesses were identified by this reviewer, but I have the following comments:

      (1) I would be curious to see evidence that the model is learning position-specific representations.

      This is an excellent suggestion to further assess what the model is learning. To get a better sense of the position-specific representation we performed the following analyses:

      (1) Switching the transformers relative order: All transformers are pretrained on 3’ and 5’ splice site regions before fine-tunning for the PSI and dPSI prediction task. We hypothesized that if relative position is important, switching the order of the transformers would make a large difference on prediction accuracy. Indeed if we switch the 3’ and 5’ we see as expected a severe drop in performance, with Pearson correlation on test data dropping from 0.82 to 0.11. Next, we switched the two 5’ and 3’ transformers, observing a drop to 0.65 and 0.78 respectively. When focusing only on changing events the drop was from 0.66 to 0.54 (for 3’ SS transformers), 0.48 (for 5’ SS transformers), and 0.13 (when the 3’ and 5’ transformers flanking the alternative exon were switched). 

      (2) Position specific effect of RBPs: We wanted to test whether the model is able to learn position specific effects for RBPs. For this we focused on two RBPs, FOX (a family of three highly related RBPs), and QKI, both have a relatively well defined motif, known condition and position specific effect identified via RBP KD experiments combined with CLIP experiments (e.g. PMID: 23525800, PMID: 24637117, PMID: 32728246). For each, we randomly selected 40 highly and 40 lowly included cassette exons sequences. We then ran in-silico mutagenesis experiments where we replaced small windows of sequences with the RBP motifs (80 for RBFOX and 80 for QKI), then compared TrASPR’s predictions for the average predictions for 5 random sequences inserted in the same location. The results of this are now shown in Figure 4 Supp 3, where the y-axis represents the dPSI effect per position (x-axis), and the color represents the percentile of observed effects over inserting motifs in that position across all 80 sequences tested. We see that both RBPs have strong positional preferences for exerting a strong effect on the alternative exon. We also see differences between binding upstream and downstream of the alternative exon. These results, learned by the model from natural tissue-specific variations, recapitulate nicely the results derived from high-throughput experimental assays. However, we also note that effects were highly sequence specific. For example, RBFOX is generally expected to increase inclusion when binding downstream of the alternative exon and decrease inclusion when binding upstream. While we do observe such a trend we also see cases where the opposite effects are observed. These sequence specific effects have been reported in the literature but may also represent cases where the model errs in the effect’s direction. We discuss these new results in the revised text.

      (3) Assessing BOS sequence edits to achieve tissue-specific splicing: Here we decided to test whether BOS edits in intronic regions (at least 8b away from the nearest splice site) are important for the tissue-specific effect. The results are now included in Figure 6 Supp 1, clearly demonstrating that most of the neuronal specific changes achieved by BOS were based on changing the introns, with a strong effect observed for both up and downstream intron edits.

      (2) The transformer encoders in TrASPr model sequences with a rather limited sequence size of 200 bp; therefore, for long introns, the model will not have good coverage of the intronic sequence. This is not expected to be an issue for exons.

      The reviewer is raising a good question here. On one hand, one may hypothesize that, as the reviewer seems to suggest, TrASPr may not do well on long introns as it lacks the full intronic sequence.

      Conversely, one may also hypothesize that for long introns, where the flanking exons are outside the window of SpliceAI/Pangolin, TrASPr may have an advantage.

      Given this good question and a related one by Reviewer #2, we divided prediction accuracy by intron length and the alternative exon length.

      For short exons  (<100bp) we find TrASPr and Pangolin perform similarly, but for longer exons, especially those > 200, TrASPr results are better. When dividing samples by the total length of the upstream and downstream intron, we find TrASPr outperform all other models for introns of combined length up to 6K, but Pangolin gets better results when the combined intron length is over 10K. This latter result is interesting as it means that contrary to the second hypothesis laid out above, Pangolin’s performance did not degrade for events where the flanking exons were outside its field of view. We note that all of the above holds whether we assess all events or just cases of tissue specific changes. It is interesting to think about the mechanistic causes for this. For example, it is possible that cassette exons involving very long introns evoke a different splicing mechanism where the flanking exons are not as critical and/or there is more signal in the introns which is missed by TrASPr. We include these new results now as Figure 2 - Supp 1,2 and discuss these in the main text.

      (3) In the context of sequence design, creating a desired tissue- or condition-specific effect would likely require disrupting or creating motifs for splicing regulatory proteins. In your experiments for neuronal-specific Daam1 exon 16, have you seen evidence for that? Most of the edits are close to splice junctions, but a few are further away.

      That is another good question. Regarding Daam1 exon 16, in the original paper describing the mutation locations some motif similarities were noted to PTB (CU) and CUG/Mbnl-like elements (Barash et al Nature 2010). In order to explore this question beyond this specific case we assessed the importance of intronic edits by BOS to achieve a tissue specific splicing profile - see above.

      (4) For sequence design, of tissue- or condition-specific effect in neuronal-specific Daam1 exon 16 the upstream exonic splice junction had the most sequence edits. Is that a general observation? How about the relative importance of the four transformer regions in TrASPr prediction performance?

      This is another excellent question. Please see new experiments described above for RBP positional effect and BOS edits in intronic regions which attempt to give at least partial answers to these questions. We believe a much more systematic analysis can be done to explore these questions but such evaluation is beyond the scope of this work.

      (5) The idea of lightweight transformer models is compelling, and is widely applicable. It has been used elsewhere. One paper that came to mind in the protein realm:

      Singh, Rohit, et al. "Learning the language of antibody hypervariability." Proceedings of the National Academy of Sciences 122.1 (2025): e2418918121.

      We definitely do not make any claim this approach of using lighter, dedicated models instead of a large ‘foundation’ model has not been taken before. We believe Rohit et al mentioned above represents a somewhat different approach, where their model (AbMAP) fine-tunes large general protein foundational models (PLM) for antibody-sequence inputs by supervising on antibody structure and binding specificity examples. We added a description of this modeling approach citing the above work and another one which specifically handles RNA splicing (intron retention, PMID: 39792954).

      Reviewer #2 (Public review):

      Summary:

      The authors present a transformer-based model, TrASPr, for the task of tissue-specific splicing prediction (with experiments primarily focused on the case of cassette exon inclusion) as well as an optimization framework (BOS) for the task of designing RNA sequences for desired splicing outcomes.

      For the first task, the main methodological contribution is to train four transformer-based models on the 400bp regions surrounding each splice site, the rationale being that this is where most splicing regulatory information is. In contrast, previous work trained one model on a long genomic region. This new design should help the model capture more easily interactions between splice sites. It should also help in cases of very long introns, which are relatively common in the human genome.

      TrASPr's performance is evaluated in comparison to previous models (SpliceAI, Pangolin, and SpliceTransformer) on numerous tasks including splicing predictions on GTEx tissues, ENCODE cell lines, RBP KD data, and mutagenesis data. The scope of these evaluations is ambitious; however, significant details on most of the analyses are missing, making it difficult to evaluate the strength of the evidence. Additionally, state-of-the-art models (SpliceAI and Pangolin) are reported to perform extremely poorly in some tasks, which is surprising in light of previous reports of their overall good prediction accuracy; the reasoning for this lack of performance compared to TrASPr is not explored.

      In the second task, the authors combine Latent Space Bayesian Optimization (LSBO) with a Transformer-based variational autoencoder to optimize RNA sequences for a given splicing-related objective function. This method (BOS) appears to be a novel application of LSBO, with promising results on several computational evaluations and the potential to be impactful on sequence design for both splicing-related objectives and other tasks.

      We thank Reviewer #2 for this detailed summary and positive view of our work. It seems the main issue raised in this summary regards the evaluations: The reviewer finds details of the evaluations missing and the fact that SpliceAI and Pangolin perform poorly on some of the tasks to be surprising. We made a concise effort to include the required details, including code and data tables. In short, some of the concerns were addressed by adding additional evaluations, some by clarifying missing details, and some by better explaining where Pangolin and SpliceAI may excel vs. settings where these may not do as well. More details are given below. 

      Strengths:

      (1) A novel machine learning model for an important problem in RNA biology with excellent prediction accuracy.

      (2) Instead of being based on a generic design as in previous work, the proposed model incorporates biological domain knowledge (that regulatory information is concentrated around splice sites). This way of using inductive bias can be important to future work on other sequence-based prediction tasks.

      Weaknesses:

      (1) Most of the analyses presented in the manuscript are described in broad strokes and are often confusing. As a result, it is difficult to assess the significance of the contribution.

      We made an effort to make the tasks be specific and detailed,  including making the code and data of those available. We believe this helped improve clarity in the revised version.

      (2) As more and more models are being proposed for splicing prediction (SpliceAI, Pangolin, SpliceTransformer, TrASPr), there is a need for establishing standard benchmarks, similar to those in computer vision (ImageNet). Without such benchmarks, it is exceedingly difficult to compare models. For instance, Pangolin was apparently trained on a different dataset (Cardoso-Moreira et al. 2019), and using a different processing pipeline (based on SpliSER) than the ones used in this submission. As a result, the inferior performance of Pangolin reported here could potentially be due to subtle distribution shifts. The authors should add a discussion of the differences in the training set, and whether they affect your comparisons (e.g., in Figure 2). They should also consider adding a table summarizing the various datasets used in their previous work for training and testing. Publishing their training and testing datasets in an easy-to-use format would be a fantastic contribution to the community, establishing a common benchmark to be used by others.

      There are several good points to unpack here. Starting from the last one, we very much agree that a standard benchmark will be useful to include. For tissue specific splicing quantification we used the GTEx dataset from which we select six representative human tissues (heart, cerebellum, lung, liver, spleen, and EBV-transformed lymphocytes). In total, we collected 38394 cassette exon events quantified across 15 samples (here a ‘sample’ is a cassette exon quantified in two tissues) from the GTEx dataset with high-confidence quantification for their PSIs based on MAJIQ. A detailed description of how this data was derived is now included in the Methods section, and the data itself is made available via the bitbucket repository with the code.

      Next, regarding the usage of different data and distribution shifts for Pangolin: The reviewer is right to note there are many differences between how Pangolin and TrASPr were trained. This makes it hard to determine whether the improvements we saw are not just a result of different training data/labels. To address this issue, we first tried to finetune the pre-trained Pangolin with MAJIQ’s PSI dataset: we use the subset of the GTEx dataset described above, focusing on the three tissues analyzed in Pangolin’s paper—heart, cerebellum, and liver—for a fair comparison. In total, we obtained 17,218 events, and we followed the same training and test split as reported in the Pangolin paper. We got Pearson: 0.78 Spearman: 0.68 which are values similar to what we got without this extra fine tuning. Next, we retrained Pangolin from scratch, with the full tissues and training set used for TrASPr, which was derived from MAJIQ’s quantifications. Since our model only trained on human data with 6 tissues at the same time, we modified Pangolin from original 4 splice site usage outputs to 6 PSI outputs. We tried to take the sequence centered with the first or the second splice site of the mid exon. This test resulted in low performance (3’ SS: pearson 0.21 5’ SS: 0.26.). 

      The above tests are obviously not exhaustive but their results suggest that the differences we observe are unlikely to be driven by distribution shifts. Notably, the original Pangolin was trained on much more data (four species, four tissues each, and sliding windows across the entire genome). This training seems to be important for performance while the fact we switched from Pangolin’s splice site usage to MAJIQ’s PSI was not a major contributor. Other potential reasons for the improvements we observed include the architecture, target function, and side information (see below) but a complete delineation of those is beyond the scope of this work. 

      (3) Related to the previous point, as discussed in the manuscript, SpliceAI, and Pangolin are not designed to predict PSI of cassette exons. Instead, they assign a "splice site probability" to each nucleotide. Converting this to a PSI prediction is not obvious, and the method chosen by the authors (averaging the two probabilities (?)) is likely not optimal. It would be interesting to see what happens if an MLP is used on top of the four predictions (or the outputs of the top layers) from SpliceAI/Pangolin. This could also indicate where the improvement in TrASPr comes from: is it because TrASPr combines information from all four splice sites? Also, consider fine-tuning Pangolin on cassette exons only (as you do for your model).

      Please see the above response. We did not investigate more sophisticated models that adjust Pangolin’s architecture further as such modifications constitute new models which are beyond the scope of this work.

      (4) L141, "TrASPr can handle cassette exons spanning a wide range of window sizes from 181 to 329,227 bases - thanks to its multi-transformer architecture." This is reported to be one of the primary advantages compared to existing models. Additional analysis should be included on how TrASPr performs across varying exon and intron sizes, with comparison to SpliceAI, etc.

      This was a good suggestion, related to another comment made by Reviewer #1. Please see above our response to them with a breakdown by exon/intron length.

      (5) L171, "training it on cassette exons". This seems like an important point: previous models were trained mostly on constitutive exons, whereas here the model is trained specifically on cassette exons. This should be discussed in more detail.

      Previous models were not trained exclusively on constitutive exons and Pangolin specifically was trained with their version of junction usage across tissues. That said, the reviewer’s point is valid (and similar to ones made above) about a need to have a matched training/testing and potential distribution shifts. Please see response and evaluations described above. 

      (6) L214, ablations of individual features are missing.

      These were now added to the table which we moved to the main text (see table also below).

      (7) L230, "ENCODE cell lines", it is not clear why other tissues from GTEx were not included.

      Good question. The task here was to assess predictions in unseen conditions, hence we opted to test on completely different data of human cell lines rather than additional tissue samples. Following the reviewers suggestion we also evaluated predictions on two additional GTEx tissues, Cortex and Adrenal Gland. These new results, as well as the previous ones for ENCODE, were updated to use the PCA based embedding of “RBP-State” as described above. We also compared the predictions using the PCA based embedding of the “RBP-State” to training directly on data (not the test data of course) from these tissues. See updated Figure 3a,b. Figure 3 Supp 1,2.

      (8) L239, it is surprising that SpliceAI performs so badly, and might suggest a mistake in the analysis. Additional analysis and possible explanations should be provided to support these claims. Similarly, the complete failure of SpliceAI and Pangolin is shown in Figure 4d.

      Line 239 refers to predicting relative inclusion levels between competing 3’ and 5’ splice sites. We admit we too expected this to be better for SpliceAI and Pangolin but we were not able to find bugs in our analysis (which is all made available for readers and reviewers alike). Regarding this expectation to perform better, first we note that we are not aware of a similar assessment being done for either of those algorithms (i.e. relative inclusion for 3’ and 5’ alternative splice site events). Instead, our initial expectation, and likely the reviewer’s as well, was based on their detection of splice site strengthening/weakening due to mutations, including cryptic splice site activation. More generally though, it is worth noting in this context that given how SpliceAI, Pangolin and other algorithms have been presented in papers/media/scientific discussions, we believe there is a potential misperception regarding tasks that SpliceAI and Pangolin excel at vs other tasks where they should not necessarily be expected to excel. Both algorithms focus on cryptic splice site creation/disruption. This has been the focus of those papers and subsequent applications.  While Pangolin added tissue specificity to SpliceAI training, the authors themselves admit “...predicting differential splicing across tissues from sequence alone is possible but remains a considerable challenge and requires further investigation”. The actual performance on this task is not included in Pangolin’s main text, but we refer Reviewer #2 to supplementary figure S4 in the Pangolin manuscript to get a sense of Pangolin’s reported performance on this task. Similar to that, Figure 4d in our manuscript is for predicting ‘tissue specific’ regulators. We do not think it is surprising that SpliceAI (tissue agnostic) and Pangolin (slight improvement compared to SpliceAI in tissue specific predictions) do not perform well on this task. Similarly, we do not find the results in Figure 4C surprising either. These are for mutations that slightly alter inclusion level of an exon, not something SpliceAI was trained on - SpiceAI was trained on genomic splice sites with yes/no labels across the genome. As noted elsewhere in our response, re-training Pangolin on this mutagenesis dataset results in performance much closer to that of TrASPr. That is to be expected as well - Pangolin is constructed to capture changes in PSI (or splice site usage as defined by the authors), those changes are not even tissue specific for the CD19 data and the model has no problem/lack of capacity to generalize from the training set just like TrASPr does. In fact, if you only use combinations of known mutations seen during training a simple regression model gives correlation of ~92-95% (Cortés-López et al 2022). In summary, we believe that better understanding of what one can realistically expect from models such as SpliceAI, Pangolin, and TrASPr will go a long way to have them better understood and used effectively. We have tried to make this more clear in the revision.

      (9) BOS seems like a separate contribution that belongs in a separate publication. Instead, consider providing more details on TrASPr.

      We thank the reviewer for the suggestion. We agree those are two distinct contributions/algorithms and we indeed considered having them as two separate papers. However, there is strong coupling between the design algorithm (BOS) and the predictor that enables it (TrASPr). This coupling is both conceptual (TrASPr as a “teacher”) and practical in terms of evaluations. While we use experimental data (experiments done involving Daam1 exon 16, CD19 exon 2) we still rely heavily on evaluations by TrASPr itself. A completely independent evaluation would have required a high-throughput experimental system to assess designs, which is beyond the scope of the current paper. For those reasons we eventually decided to make it into what we hope is a more compelling combined story about generative models for prediction and design of RNA splicing.

      (10) The authors should consider evaluating BOS using Pangolin or SpliceTransformer as the oracle, in order to measure the contribution to the sequence generation task provided by BOS vs TrASPr.

      We can definitely see the logic behind trying BOS with different predictors. That said, as we note above most of BOS evaluations are based on the “teacher”. As such, it is unclear what value replacing the teacher would bring. We also note that given this limitation we focus mostly on evaluations in comparison to existing approaches (genetic algorithm or random mutations as a strawman). 

      Recommendations for the authors: 

      Reviewer #1 (Recommendations for the authors):

      Additional comments:

      (1) Is your model picking up transcription factor binding sites in addition to RBPs? TFs have been recently shown to have a role in splicing regulation:

      Daoud, Ahmed, and Asa Ben-Hur. "The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models." PLOS Computational Biology 21.1 (2025): e1012755.

      We agree this is an interesting point to explore, especially given the series of works from the Ben-Hur’s group. We note though that these works focus on intron retention (IR) which we haven’t focused on here, and we only cover short intronic regions flanking the exons. We leave this as a future direction as we believe the scope of this paper is already quite extensive.

      (2) SpliceNouveau is a recently published algorithm for the splicing design problem:

      Wilkins, Oscar G., et al. "Creation of de novo cryptic splicing for ALS and FTD precision medicine." Science 386.6717 (2024): 61-69.

      Thank you for pointing out Wilkins et al recent publication, we now refer to it as well. 

      (3) Please discuss the relationship between your model and this deep learning model. You will also need to change the following sentence: "Since the splicing sequence design task is novel, there are no prior implementations to reference."

      We revised this statement and now refer to several recent publications that propose similar design tasks.  

      (4) I would suggest adding a histogram of PSI values - they appear to be mostly close to 1 or 0.

      PSI values are indeed typically close to either 0 or 1. This is a known phenomenon illustrated in previous studies of splicing (e.g. Shen et al NAR 2012 ). We are not sure what is meant by the comment to add a histogram but we made sure to point this out in the main text: 

      “...Still, those statistics are dominated by extreme values, such that 33.2\% are smaller than 0.15 and 56.0\% are higher than 0.85. Furthermore, most cassette exons do not change between a given tissue pair (only 14.0\% of the samples in the dataset, \ie a cassette exon measured across two tissues, exhibit ΔΨ| ≥ 0.15).”

      (5) Part of the improvement of TrASPr over Pangolin could be the result of a more extensive dataset.

      Please see above responses and new analysis.

      (6) In the discussion of the roles of alternative splicing, protein diversity is mentioned, but I suggest you also mention the importance of alternative splicing as a regulatory mechanism:

      Lewis, Benjamin P., Richard E. Green, and Steven E. Brenner. "Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans." Proceedings of the National Academy of Sciences 100.1 (2003): 189-192.

      Thank you for the suggestion. We added that point and citation. 

      (7) Line 96: You use dPSI without defining it (although quite clear that it should be Delta PSI).

      Fixed.

      (8) Pretrained transformers: Have you trained separate transformers on acceptor and donor sites, or a single splice junction transformer?

      Single splice junction pre-training.

      (9) "TrASPr measures the probability that the splice site in the center of Se is included in some tissue" - that's not my understanding of what TrASPr is designed to do.

      We revised the above sentence to make it more precise: “Given a genomic sequence context S<sub>e</sub> = (s<sub>e</sub>,...,s<sub>e</sub>), made of  a cassette exon e and flanking intronic/exonic regions, TrASPr predicts for tissue c the fraction of transcripts where exon e is included or skipped over, ΔΨ-<sub>e,c,c’</sub>.”

      (10) Please include the version of the human genome annotations that you used. 

      We used GENCODE v40 human genome hg38- this is now included in the Data section. 

      (11) I did not see a description of the RBP-AE component in the methods section. A bit more detail on the model would be useful as well.

      Please see above details about replacing RBP-AE with a simpler linear PCA “RBP-State” encoding. We added details about how the PCA was performed to the Methods section.

      (12) Typos, grammar:

      -   Fix the following sentence: ATP13A2, a lysosomal transmembrane cation transporter, linked to an early-onset form of Parkinson's Disease (PD) when 306 loss-of-function mutations disrupt its function.

      Sentence was fixed to now read: “The first example is of a brain cerebellum-specific cassette exon skipping event predicted by TrASPr in the ATP13A2 gene (aka PARK9). ATP13A2 is a lysosomal transmembrane cation transporter, for which loss of function mutation has been linked to early-onset of Parkinson’s Disease (PD)”.

      -   Line 501: "was set to 4e−4"(the - is a superscript). 

      Fixed

      -   A couple of citations are missing in lines 580 and 581.

      Thank you for catching this error. Citations in line 580, 581 were fixed.

      (13) Paper title: Generative modeling for RNA splicing predictions and design - it would read better as "Generative modeling for RNA splicing prediction and design", as you are solving the problems of splicing prediction and splicing design.  

      Thank you for the suggestion. We updated the title and removed the plural form.

      Reviewer #2 (Recommendations for the authors):

      (1) Appendices are not very common in biology journals. It is also not clear what purpose the appendix serves exactly - it seems to repeat some of the things said earlier. Consider merging it into the methods or the main text. 

      We merged the appendices into the Methods section and removed redundancy.

      (2) L112, "For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than N edit locations and M total base changes." How are N and M different? Is there a difference between an edit location and a base change? 

      Yes, N is the number of locations (one can think of it as a start position) of various lengths (e.g. a SNP is of length 1) and the total number of positions edited is M. The text now reads “For instance, the model could be tasked with designing a new version of the cassette exon, restricted to no more than  $N$ edit locations (\ie start position of one or more consecutive bases) and $M$ total base changes.”

      (3) L122: "DEN was developed for a distinct problem". What prevents one from adapting DEN to your sequence design task? The method should be generic. I do not see what "differs substantially" means here. (Finally, wasn't DEN developed for the task you later refer to as "alternative splice site" (as opposed to "splice site selection")? Use consistent terminology. And in L236 you use "splice site variation" - is that also the same?).

      Indeed, our original description was not clear/precise enough. DEN was designed and trained for two tasks: APA, and 5’ alternative splice site usage. The terms “selection”, “usage”, and “variation” were indeed used interchangeably in different locations and the reviewer was right, noting the lack of precision. We have now revised the text to make sure the term “relative usage” is used. 

      Nonetheless, we hold DEN was indeed defined for different tasks. See figures from Figure 2A, 6A of Linder et al 2020 (the reference was also incorrect as we cited the preprint and not the final paper):

      In both cases DEN is trying to optimize a short region for selecting an alternative PA site (left) or a 5’ splice site (right). This work focused on an MPRA dataset of short synthetic sequences inserted in the designated region for train/test. We hold this is indeed a different type of data and task then the one we focus on here. Yes, one can potentially adopt DEN for our task, but this is beyond the scope of this paper. Finally, we note that a more closely related algorithm recently proposed is Ledidi (Schreiber et al 2025) which was posted as a pre-print. Similar to BOS, Ledidi tries to optimize a given sequence and adopt it with a few edits for a given task. Regardless, we updated the main text to make the differences between DEN and the task we defined here for BOS more clear, and we also added a reference to Ledidi and other recent works in the discussion section.

      (4) L203, exons with DeltaPSI very close to 0.15 are going to be nearly impossible to classify (or even impossible, considering that the DeltaPSI measurements are not perfect). Consider removing such exons to make the task more feasible.

      Yes, this is how it was done. As described in more details below, we defined changing samples as ones where the change was >= 0.15 and non-changing as ones where the change in PSI was < 0.05 to avoid ambiguous cases affecting the classification task.  

      (5) L230, RBP-AE is not explained in sufficient detail (and does not appear in the methods, apparently). It is not clear how exactly it is trained on each new cellular condition.

      Please see response in the opening of this document and Q11 from

      Reviewer 1 

      (6) L230, "significantly improving": the r value actually got worse; it is therefore not clear you can claim any significant improvement. Please mention that fact in the text.

      This is a fair point. We note that we view the “a” statistic as potentially more interesting/relevant here as the Pearson “r” is dominated by points being generally close to 0/1.  Regardless, revisiting this we realized one can also make a point that the term “significant” is imprecise/misplaced since there is no statistical test done here (side note: given the amount of points, a simple null of same distribution yes/no would pass significance but we don’t think this is an interesting/relevant test here). Also, we note that with the transition to PCA instead of RBP-AE we actually get improvements in both a and r values, both for the ENCODE samples shown in Figure 3a and the two new GTEX tissues we tested (see above). We now changed the text to simply state: 

      “...As shown in Figure 3a, this latent space representation allows TrSAPr to generalize from the six GTEX tissues to unseen conditions, including unseen GTEX tissues (top row), and ENCODE cell lines (bottom row). It improves prediction accuracy compared to TrASPr lacking PCA (eg a=88.5% vs a=82.3% for ENCODE cell lines), though naturally training on the additional GTEX and ENCODE conditions can lead to better performance  (eg a=91.7%, for ENCODE, Figure 3a left column).”

      (7) L233, "Notably, previous splicing codes focused solely on cassette exons", Rosenberg et al. focused solely on alternative splice site choice.

      Right - we removed that sentence.. 

      (8) L236, "trained TrASPr on datasets for 3' and 5' splice site variations". Please provide more details on this task. What is the input to TrASPr and what is the prediction target (splice site usage, PSI of alternative isoforms)? What datasets are used for this task?

      The data for this data was the same GTEx tissue data processed, just for alternative 3’ and 5’ splice sites events. We revised the description of this task in the main task and added information in the Methods section. The data is also included in the repo.

      (9) L243, "directly from genomic sequences", and conservation?

      Yes, we changed the sentence to read “...directly from genomic sequences combined with related features” 

      (10) L262, what is the threshold for significant splicing changes?

      The threshold is 0.15 We updated the main text to read the following:

      The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in \FIG{mut_effect}b (left), while the distribution of effects ($|\Delta \Psi|$) observed across those 6106 samples is shown in \FIG{mut_effect}b (right). To this data we applied three testing schemes. The first is a standard 5-fold CV where 20\% of combinations of point mutations were hidden in every fold while the second test involved 'unseen mutation' (UM) where we hide any sample that includes mutations in specific positions for a total of 1480 test samples. As illustrated by the CDF in \FIG{mut_effect}b, most samples (each sample may involve multiple positions mutated) do not involve significant splicing changes. Thus, we also performed a third test using only  the 883 samples were mutations cause significant changes ($|\Delta \Psi|\geq 0.15 $). 

      (11) L266, Pangolin performance is only provided for one of the settings (and it is not clear which). Please provide details of its performance in all settings.

      The description was indeed not clear. Pangolin’s performance was similar to SpliceAI as mentioned above but retraining it on the CD19 data yielded much closer performance to TrASPr. We include all the matching tests for Pangolin after retraining in Figure 4 Supp Figure 1. 

      (12) Please specify "n=" in all relevant plots. 

      Fixed.

      (13) Figure 3a, "The tissues were first represented as tokens, and new cell line results were predicted based on the average over conditions during training." Please explain this procedure in more detail. What are these tokens and how are they provided to the model? Are the cell line predictions the average of the predictions for the training tissues?

      Yes, we compared to simply the average over the predictions for the training tissues for that specific event as baseline to assess improvements (see related work pointing for the need to have similar baselines in DL for genomics in https://pubmed.ncbi.nlm.nih.gov/33213499/). Regarding the tokens - we encode each tissue type as a possible value and feed the two tissues as two tokens to the transformer.

      (14) Figure 4b, the total count in the histogram is much greater than 6106. Please explain the dataset you're using in more detail, and what exactly is shown here.

      We updated the text to read: 

      “...we used 6106 sequence samples where each sample may have multiple positions mutated (\ie mutation combinations) in exon 2 of CD19 and its flanking introns and exons (Cortes et al 2022). The total number of mutations hitting each of the 1198 genomic positions across the 6106 sequences is shown in Figure 4b (left).”

      (15) Figure 5a, how are the prediction thresholds (TrASPr passed, TrASPr stringent, and TrASPr very stringent) defined?

      Passed: dpsi>0.1, Stringent: dpsi>0.15, Very stringent: dpsi>0.2 This is now included in the main text.

      (16) L417, please include more detail on the relative size of TrASPr compared to other models (e.g. number of parameters, required compute, etc.).

      SpliceAI is a general-purpose splicing predictor with 32-layer deep residual neural network to capture long-range dependencies in genomic sequences. Pangolin is a deep learning model specifically designed for predicting tissue-specific splicing with similar architecture as SpliceAI. The implementation of SpliceAI that can be found here https://huggingface.co/multimolecule/spliceai involves an ensemble of 5 such models for a total of ~3.5M parameters. TrASPr, has 4 BERT transformers (each 6 layers and 12 heads) and MLP a top of those for a total of ~189M parameters. Evo 2, a genomic ‘foundation’ model has 40B parameters, DNABERT has ~86M (a single BERT with 12 layers and 12 heads), and Borzoi has 186M parameters (as stated in https://www.biorxiv.org/content/10.1101/2025.05.26.656171v2).  We note that the difference here is not just in model size but also the amount of data used to train the model. We edited the original L417 to reflect that.

      (17) L546, please provide more detail on the VAE. What is the dimension of the latent representation?

      We added more details in the Methods section like the missing dimension (256) and definitions for P(Z) and P(S). 

      (18) Consider citing (and possibly comparing BOS to) Ghari et al., NeurIPS 2024 ("GFlowNet Assisted Biological Sequence Editing").

      Added.

      (19) Appendix Figure 2, and corresponding main text: it is not clear what is shown here. What is dPSI+ and dPSI-? What pairs of tissues are you comparing? Spearman correlation is reported instead of Pearson, which is the primary metric used throughout the text.

      The dPSI+ and dPSI- sets were indeed not well defined in the original submission. Moreover, we found our own code lacked consistency due to different tests executed at different times/by different people. We apologize for this lack of consistency and clarity which we worked to remedy in the revised version. To answer the reviewer’s question, given two tissues ($c,c'$), dPSI+ and dPSI- is for correctly classifying the exons that are significantly differentially included or excluded. Specifically, differential included exons are those for which  $\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \geq 0.15$, compared to those that are not  ($\Delta \Psi_{e,c1,c2} < 0.05). Similarly, dPSI- is for correctly classifying the exons that are significantly differentially excluded in the first tissue or included in the second tissue ($\Delta \Psi_{e,c1,c2} = \Psi_\Psi_{e,c1} - \Psi_{e,c2}  \leq -0.15$) compared to those that are not  ($\Delta \Psi_{e,c1,c2} > -0.05). This means dPSI+ and dPSI- are dependent on the order of c1, c2. In addition, we also define a direction/order agnostic test for changing vs non changing events i.e. $|\Delta \Psi_{e,c1,c2}| \geq 0.15$ vs $|\Delta \Psi_{e,c1,c2}| < 0.05$. These test definitions are consistent with previous publications (e.g. Barash et al Nature 2010, Jha et al 2017) and also answer different biological questions: For example “Exons that go up in brain” and “Exons that go up in Liver” can reflect distinct mechanisms, while changing exons capture a model’s ability to identify regulated exons even if the direction of prediction may be wrong. The updated Appendix Figure 2 is now in the main text as Figure 2d and uses Pearson, while AUPRC and AUROC refer to the changing vs no-changing classification task described above such that we avoid dPSI+ and dPSI- when summarizing in this table over 3 pairs of tissues . Finally, we note that making sure all tests comply with the above definition also resulted in an update to Figure 2b/c labels and values, where TrASPr’s improvements over Pangolin reaches up to 1.8fold in AUPRC compared to 2.4fold in the earlier version. We again apologize for having a lack of clarity and consistent evaluations in the original submission.

      (20) Minor typographical comments:

      -   Some plots could use more polishing (e.g., thicker stroke, bigger font size, consistent style (compare 4a to the other plots)...).

      Agreed. While not critical for the science itself we worked to improve figure polishing in the revision to make those more readable and pleasant. 

      -   Consider using 2-dimensional histograms instead of the current kernel density plots, which tend to over-smooth the data and hide potentially important details. 

      We were not sure what the exact suggestion is here and opted to leave the plots as is.

      -   L53: dPSI_{e, c, c'} is never formally defined. Is it PSI_{e, c} - PSI_{e, c'} or vice versa?  

      Definition now included (see above).

      -   L91: Define/explain "transformer" and provide reference. 

      We added the explanation and related reference of the transformer in the introduction section and BERT in the method section.  

      -   L94: exons are short. Are you referring here to the flanking introns? Please explain. 

      We apologize for the lack of clarity. We are referring to a cassette exon alternative splicing event as is commonly defined by the splice junctions involved that is from the 5’ SS of the upstream exon to the 3’ SS of the downstream exon. The text now reads:

      “...In contrast, 24% of the cassette exons analyzed in this study span a region between the flanking exons' upstream 3' and downstream 5' splice sites that are larger than 10 kb.”

      -   L132: It's unclear whether a single, shared transformer or four different transformers (one for each splice site) are being pre-trained. One would at least expect 5' and 3' splice sites to have a different transformer. In Methods, L506, it seems that each transformer is pre-trained separately. 

      We updated the text to read:

      “We then center a dedicated transformer around each of the splice sites of the cassette exon and its upstream and downstream (competing) exons (four separate transformers for four splice sites in total).”

      -   L471: You explain here that it is unclear what tasks 'foundation' models are good for. Also in L128, you explain that you are not using a 'foundation' model. But then in L492, you describe the BERT model you're using as a foundation model! 

      Line 492 was simply a poor choice of wording as “foundation” is meant here simply as the “base component”. We changed it accordingly.

      -   L169, "pre-training ... BERT", explain what exactly this means. Is it using masking? Is it self-supervised learning? How many splice sites do you provide? Also explain more about the BERT architecture and provide references. 

      We added more details about the BERT architecture and training in the Methods section.

      -   L186 and later, the values for a and r provided here and in the below do not correspond to what is shown in Figure 2. 

      Fixed, thank you for noticing this.

      -   L187,188: What exactly do you mean by "events" and "samples"? Are they the same thing? If so, are they (exon, tissue) pairs? Please use consistent terminology. Moreover, when you say "changing between two conditions": do you take all six tissues whenever there is a 0.15 spread in PSI among them? Or do you take just the smallest PSI tissue and the largest PSI tissue when there is a 0.15 spread between them? Or something else altogether?

      Reviewer #2 is yet again correct that the definitions were not precise. A “sample” involves a specific exon skipping “event” measured in two tissues.  The text now reads: 

      “....most cassette exons do not change between a given tissue pair (only 14.0% of the samples in the dataset, i.e., a cassette exon measured across two tissues, exhibit |∆Ψ| ≥ 0.15). Thus, when we repeat this analysis only for samples involving exons that exhibited a change in inclusion (|∆Ψ| ≥ 0.15) between at least two tissues, performance degrades for all three models, but the differences between them become more striking (Figure 2a, right column).”

      -   Figure 1a, explain the colors in the figure legend. The 3D effect is not needed and is confusing (ditto in panel C).

      Color explanation is now added: “exons and introns are shown as blue rectangles and black lines. The blue dashed line indicates the inclusive pattern and the red junction indicates an alternative splicing pattern.” 

      These are not 3D effects but stacks to indicate multiple events/cases. We agree these are not needed in Fig1a to illustrate types of AS and removed those. However, in Fig1c and matching caption we use the stacks to  indicate HT data captures many such LSVs over which ML algorithms can be trained. 

      -   Figure 1b, this cartoon seems unnecessary and gives the wrong impression that this paper explores mechanistic aspects of splicing. The only relevant fact (RBPs serving as splicing factors) can be explained in the text (and is anyway not really shown in this figure).

      We removed Figure 1b cartoon.

      -   Figure 1c, what is being shown by the exon label "8"? 

      This was meant to convey exon ID, now removed to simplify the figure. 

      -   Figure 1e, left, write "Intron Len" in one line. What features are included under "..."? Based on the text, I did not expect more features.

      Also, the arrows emanating from the features do not make sense. Is "Embedding" a layer? I don't think so. Do not show it as a thin stripe. Finally, what are dPSI'+ and dPSI'-? are those separate outputs? are those logits of a classification task?

      We agree this description was not good and have updated it in the revised version. 

      -   Figure 1e, the right-hand side should go to a separate figure much later, when you introduce BOS.

      We appreciate the suggestion. However, we feel that Figure 1e serves as a visual representation of the entire framework. Just like we opted to not turn this work into two separate papers (though we fully agree it is a valid option that would also increase our publication count), we also prefer to leave this unified visual representation as is.

      -   Figure 2, does the n=2456 refer to the number of (exons, tissues) pairs? So each exon contributes potentially six times to this plot? Typo "approximately". 

      The “n” refers to the number of samples which is a cassette event measured in two tissues. The same cassette event may appear in multiple samples if it was confidently quantified in more than two tissues. We updated the caption to reflect this and corrected the typo.

      -   Figure 2b, typo "differentially included (dPSI+) or excluded" .

      Fixed.

      -   L221, "the DNABERT" => "DNABERT".

      Fixed.

      -   L232, missing percent sign.

      -    

      Fixed.

      -   L246, "see Appendix Section 2 for details" seems to instead refer to the third section of the appendix.

      We do not have this as an Appendix, the reference has been updated.

      -   Figure 3, bottom panels, PSI should be "splice site usage"? 

      PSI is correct here - we hope the revised text/definitions make it more clear now.

      -   Figure 3b: typo: "when applied to alternative alternative 3'".

      Fixed.

      -   p252, "polypyrimidine" (no capitalization).

      Fixed.

      -   Strange capitalization of tissue names (e.g., "Brain-Cerebellum"). The tissue is called "cerebellum" without capitalization.

      We used EBV (capital) for the abbreviation and lower case for the rest.

      -   Figure 4c: "predicted usage" on the left but "predicted PSI" on the right. 

      Right. We opted to leave it as is since Pangolin and SpliceAI do predict their definition of “usage” and not directly PSI, we just measure correlations to observed PSI as many works have done in the past. 

      -   Figure 4 legend typo: "two three".

      Fixed.

      -   L351, typo: "an (unsupervised)" (and no need to capitalize Transformer).

      Fixed.

      -   L384, "compared to other tissues at least" => "compared to other tissues of at least".

      Fixed.

      -   L549, P(Z) and P(S) are not defined in the text.

      Fixed.

      -   L572, remove "Subsequently". Add missing citations at the end of the paragraph.

      Fixed.

      -   L580-581, citations missing.

      Fixed.

      -   L584-585, typo: "high confidince predictions"

      Fixed.

      -   L659-660, BW-M and B-WM are both used. Typo?

      Fixed.

      -   L895, "calculating the average of these two", not clear; please rewrite.

      Fixed.

      -   L897, "Transformer" and "BERT", do these refer to the same thing? Be consistent.  

      BOS is a transformer and not a BERT but TrASPr uses the BERT architecture. BERT is a type of transformer as the reviewer is surely well aware so the sentence is correct. Still, to follow the reviewer’s recommendation for consistency/clarity we changed it here to state BERT.

      -   Appendix Figure 5: The term dPSI appears to be overloaded to also represent the difference between predicted PSI and measured PSI, which is inconsistent with previous definitions. 

      Indeed! We thank the reviewer again for their sharp eye and attention to details that we missed. We changed Supp Figure 5, now Figure 4 Supplementary Figure 2, to |PSI’-PSI| and defined those as the difference between TrASPr’s predictions (PSI’) and MAJIQ based PSI quantifications.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      We thank the reviewers and editors for this peer review. Following the editorial assessment and specific review comments, in this revision we have included new analysis to support the validity of the behavioral task (Reviewer #2). We have improved data presentation by including 1) data points from individual animals (Reviewer #1, #3), 2) updated histology showing the expression of hM4Di in LC neurons as well as LC terminals in the mPFC (Reviewer #3), and 3) more detailed descriptions of methodology and data analysis (Reviewer #1, #2, #3).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Planned t-tests should be performed in both control and experimental animals to determine if the number of trials needed to reach criterion on the ID is lower than on the ED. Based on the data analyses showing no difference among the control group, the data could be pooled to demonstrate that the task is valid. Reporting all p-values using 2 decimal points and standard language e.g., p < 0.001 would greatly improve the readability of the data. 

      Thank you for this suggestion. As pointed out by this reviewer, more trials to reach performance criterion in EDS than IDS is indicative of successful acquisition and switching of the attentional sets. Upon closer examination of the behavioral data, we exclude several sessions where more trials were taken in IDS than in EDS, and our conclusions that DREADD inhibition of the LC or LC input to the mPFC impaired rule switching in EDS remain robust (e.g., new Fig. 1e, 1h). We also pool control and test data (Fig. 1e, 1h, new Supp. Fig. 1a, 1b) to demonstrate the validity of this task (new Supp. Fig. 1c, IDS vs. EDS in the control group, 10 ± 1 trials vs. 16 ± 1 trials, P < 1e-3). The validity of set shifting is also supported by the new Fig. 1c.  

      We report p values using 2 decimal points and standard language as suggested by this reviewer.

      Relevant to the comments from Reviewer #1 in the public review, we now show individual data points on the bar charts (new Fig. 1e, 1h).  

      (2) It may also be helpful to provide the average time between CNO infusion and onset of the ED as well as information about when maximal effects are expected after these treatments.

      Systemic CNO injections were administered immediately after IDS, and we waited approximately one hour before proceeding to EDS. Maximal effects of systemic CNO activation were reported to occur after 30 minutes and last for at least 4-6 hours. Both control and test groups received the CNO injections in the same manner. This is now better described in Methods.  

      Reviewer #3 (Recommendations for the authors):

      (1) Add better histology images showing colocalization of TH and HM4Di. Quantification of colocalization would be optimal.

      We now include better histology images (new Fig. 1d) and have quantified the colocalization of TH and HM4Di in the main text (line 115-116).  

      (2) If possible, images showing HM4Di expression in mPFC axon terminals would be useful. If these are colocalized with TH immunostaining, that would increase confidence in their identity. This would be much more useful than the images provided in Figure 1C.

      We now include new image to show hM4Di expression (mCherry) in LC terminals in the mPFC (new Fig. 1f). However, due to technical limitations (species of the primary antibody), we did not co-stain with TH.

      (3) Include behavior of mice from the miniscope experiment in Figure 2 to show they are similar to those from Figure 1.

      This is now included in Supp. Fig. 1b.

      (4) More details about the processing and segmentation of miniscope data would be helpful (e.g., how many neurons were identified from each animal?). 

      We use standard preprocessing and segmentation pipelines in Inscopix data processing software (version 1.6), which includes modules for motion correction and signal extraction. Briefly, raw imaging videos underwent preprocessing, including a x4 spatial down sampling to reduce file size and processing time. No temporal down sampling was performed. The images were then cropped to eliminate post-registration borders and areas where cells were not visible. Prior to the calculation of the dF/F0 traces, lateral movement was corrected. For ROI identification, we used a constrained non-negative matrix factorization algorithm optimized for endoscopic data (CNMF-E) to extract fluorescence traces from ROIs. We identified 128 ± 31 neurons after manual selection, depending on recording quality and field of view. Number of neurons acquired from each animal are now included in Methods. This is now further elaborated in Methods (line 405415).  

      (5) Add more methodological detail for how cell tuning was analyzed, including how z-scoring was performed (across the entire session?), and how neurons in each category were classified. 

      We have expanded the Methods section to clarify how cell tuning was analyzed (line 419430). Calcium traces were z-scored on a per-neuron basis across the entire session. For each neuron, we computed trial-averaged activity aligned to specific task events (e.g., digging in one of the two ramekins available). A neuron was classified as responsive if its activity showed a significant difference (p < 0.05) between two conditions within the defined time window in the ROC analysis.

      (6) For data from Figure 2F it would be very useful to plot data from individual mice in addition to this aggregated representation.

      We now include data from individual mice in Supp. Table 1.

      (7) I think it would be helpful to move some parts of Figure S1 to the main Figure 1, in particular the table from S1A. 

      Fig. S1 is now part of the new Fig. 1.

      (8) Clarify whether Figure S2 is an independent replication, as implied, or whether the same test data is shown twice in two separate figures (In Figure 1b and Supplementary Figure 2).

      The test group in Fig. S2 (new Fig. S1) is the same as the test group in Fig. 1b (new Fig. 1e), but the control group is a separate cohort. This is now clarified in the figure legends.  

      (9) The authors should add a limitations section to the discussion where they specifically discuss the caveats involved in relating their results specifically to NE. This should include the possible involvement of co-transmitters and off-target expression of Cre in other populations.

      Thank you for this comment. Previous pharmacology and lesion studies showed that LC input or NE content in the mPFC was specifically required for EDS-type switching processes (Lapiz, M.D. et al., 2006; Tait, D.S. et al. 2007; McGaughy, J. et al. 2008), in light of which we interpret our mPFC neurophysiological effects with LC inhibition as at least partially mediated by the direct LC-NE input.  When discussing the limitations of our study, we now explicitly acknowledge the potential involvement of co-transmitters released by LC neurons (line 253-256).  

      (10) The authors should provide details about the TH antibody uses for IHC

      We now include more details in immunohistochemistry (line 384-388).

      (11) Throughout, it would be helpful to include datapoints from individual animals - these are included in some supplementary figures, but are missing in a number of the main plots.

      Reviewer #1 made a similar comment, and we now include individual data points in the figures (e.g., Fig. 1e, 1h).

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This study introduces a novel method for estimating spatial spectra from irregularly sampled intracranial EEG data, revealing cortical activity across all spatial frequencies, which supports the global and integrated nature of cortical dynamics. The study showcases important technical innovations and rigorous analyses, including tests to rule out potential confounds; however, the lack of comprehensive theoretical justification and assumptions about phase consistency across time points renders the strength of evidence incomplete. The dominance of low spatial frequencies in cortical phase dynamics continues to be of importance, and further elaboration on the interpretation and justification of the results would strengthen the link between evidence and conclusions.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The paper uses rigorous methods to determine phase dynamics from human cortical stereotactic EEGs. It finds that the power of the phase is higher at the lowest spatial phase.

      Strengths:

      Rigorous and advanced analysis methods.

      Weaknesses:

      The novelty and significance of the results are difficult to appreciate from the current version of the paper.

      (1) It is very difficult to understand which experiments were analysed, and from where they were taken, reading the abstract. This is a problem both for clarity with regard to the reader and for attribution of merit to the people who collected the data.

      We now explicitly state the experiments that were used, lines 715-716.

      (2) The finding that the power is higher at the lowest spatial phase seems in tune with a lot of previous studies. The novelty here is unclear and it should be elaborated better.

      It is not generally accepted in neuroscience that power is higher at lowest spatial frequencies, and recent research concludes that traveling waves at this scale may be the result of artefactual measurement (Orczyk et al., 2022; Hindriks et al., 2014; Zhigalov & Jensen,2023). The question we answer is therefore timely and a source of controversy to researchers analysing TWs in cortex. While, in our view, the previous literature points in the direction of our conclusions (notably the work of Freeman et. al. 2003; 2000; Barrie et al. 1996), it is not conclusive at the scale we are interested in, specifically >8cm, and certainly not convincing to the proponents of ‘artefactual measurement’.

      We have added to a sentence to make this explicit in the abstract, lines 20-22. Please also note previous text at the end of the introduction, lines 140-148 and in the first paragraph of the discussion, lines 563-569.

      I could not understand reading the paper the advantage I would have if I used such a technique on my data. I think that this should be clear to every reader.

      We have made the core part of the code available on github (line 1154), which should simplify adoption of the technique. We have urged, in the Discussion (lines 653-663), why habitual measurement of SF spectra is desirable, since the same task measured with EEG, sEEG or ECoG does not encompass the same spatial scales, and researchers may be comparing signals with different functional properties. Until reliable methods for estimating SF are available, not dependent on the layout of the recording array, data cannot be analysed to resolve this question. Publication of our results and methods will help this process along.

      (3) It seems problematic to trust in a strong conclusion that they show low spatial frequency dynamics of up to 15-20 cm given the sparsity of the arrays. The authors seem to agree with this concern in the last paragraph of page 12. 

      The new surrogate testing supports our conclusions. The sEEG arrays would not normally be a first choice to estimate SF spectra, for reasons of their sparsity, which may be why such estimates have not been done before. Yet, this is the research challenge that we sought to solve, and a problem for which there was no ready method to hand. Nevertheless, it is a problem that urgently needed to be solved given the current debate on the origin of large-scale TWs. We have now included detailed surrogate testing of real data plus varying strength model waves (Figure 6A and Supplementary Figure 4). We believe this should convince the reader that we are measuring the spatial frequency spectrum with sufficient accuracy to answer the central research question.

      They also say that it would be informative to repeat the analyses presented here after the selection of more participants from all available datasets. It begs the question of why this was not done. It should be done if possible.

      We have now doubled the number of participants in the main analyses. Since each participant comprises a test of the central hypothesis, now the hypothesis test now has 23 replications (Supplementary Figures 2 and 3). There were four failures to reach significance due to under-powered tests, i.e., not enough contacts. This is sufficient test of the hypothesis and, in our opinion, not the primary obstacle to scientific acceptance of our results. The main obstacle is providing convincing tests that the method is accurate, and this is what we have focussed on. Publication of python code and the detailed methods described here enable any interested researcher to extend our method to other datasets.

      (4) Some of the analyses seem not to exploit in full the power of the dataset. Usually, a figure starts with an example participant but then the analysis of the entire dataset is not as exhaustive. For example, in Figure 6 we have a first row with the single participants and then an average over participants. One would expect quantifications of results from each participant (i.e. from the top rows of GFg 6) extracting some relevant features of results from each participant and then showing the distribution of these features across participants. This would complement the subject average analysis.

      The results are now clearly split into sections, where we first deal with all the single participant analyses, then the surrogate testing to confirm the basic results, then the participant aggregate results (Figure 7 and Supplementary Figure 7). The participant aggregate results reiterate the basic findings for the single participants. The key finding is straightforward (SF power decreases with SF) and required only one statistical analysis per subject.

      (5) The function of brain phase dynamics at different frequencies and scales has been examined in previous papers at frequencies and scales relevant to what the authors treat. The authors may want to be more extensive with citing relevant studies and elaborating on the implications for them. Some examples below:

      Womelsdorf T, et alScience. 2007

      Besserve M et al. PloS Biology 2015

      Nauhaus I et al Nat Neurosci 2009

      We have added two paragraphs to the discussion, in response to the reviewer suggestion (lines 606-623). These paragraphs place our high TF findings in the context of previous research.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors analyze the organization of phases across different spatial scales. The authors analyze intracranial, stereo-electroencephalogram (sEEG) recordings from human clinical patients. The authors estimate the phase at each sEEG electrode at discrete temporal frequencies. They then use higher-order SVD (HOSVD) to estimate the spatial frequency spectrum of the organization of phase in a data-driven manner. Based on this analysis, the authors conclude that most of the variance explained is due to spatially extended organizations of phase, suggesting that the best description of brain activity in space and time is in fact a globally organized process. The authors' analysis is also able to rule out several important potential confounds for the analysis of spatiotemporal dynamics in EEG.

      Strengths:

      There are many strengths in the manuscript, including the authors' use of SVD to address the limitation of irregular sampling and their analyses ruling out potential confounds for these signals in the EEG.

      Weaknesses:

      Some important weaknesses are not properly acknowledged, and some conclusions are overinterpreted given the evidence presented.

      The central weakness is that the analyses estimate phase from all signal time points using wavelets with a narrow frequency band (see Methods - "Numerical methods"). This step makes the assumption that phase at a particular frequency band is meaningful at all times; however, this is not necessarily the case. Take, for example, the analysis in Figure 3, which focuses on a temporal frequency of 9.2 Hz. If we compare the corresponding wavelet to the raw sEEG signal across multiple points in time, this will look like an amplitude-modulated 9.2 Hz sinusoid to which the raw sEEG signal will not correspond at all. While the authors may argue that analyzing the spatial organization of phase across many temporal frequencies will provide insight into the system, there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal. This is a critical point for the analysis because while this analysis of the spatial organization of phase could provide some interesting results, this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time. If this is not true, then the foundation of the analysis may not be precisely clear. This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local". Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.

      “using wavelets with a narrow frequency band … this analysis also requires a very strong assumption about oscillations, specifically that the phase at a particular frequency (e.g. 9.2 Hz in Figure 3, or 8.0 Hz in Figure 5) is meaningful at all points in time”

      Our method uses very short time-window Morlet wavelets to avoid the assumptions of oscillations, i.e., long-lasting sinusoids in the signal, in the sense of sinusoidal waveforms, or limit cycles extending in time. Cortical TWs can only last one or two cycles (Alexander et al., 2006), requiring methods that are compact in the time domain to avoid underreporting the desired phenomena. Additionally, the short time-window Morlet wavelets have low frequency resolution, so they are robust with respect to shifts in frequency between sites. We now discuss this issue explicitly in the Methods (lines 658-674). This means the phase estimation methods used in the manuscript precisely do not have the problem of assuming narrow-band oscillations in the signal. The methods are also robust to the exact shape of the waveforms; the signal needs be only approximately sinusoidal; to rise and fall. This means the Fourier variant we use does not introduce ringing artefact that can be introduced using longer timeseries methods, such as FFT.

      “This step makes the assumption that phase at a particular frequency band is meaningful at all times”

      This important consideration is entrenched in our choice of methods. By way of explanatory background, we point out that this step is not the final step. Aggregation methods can be used to distinguish between signal and noise. In the simple case, event-locked time-series of phase can be averaged. This would allow consistent (non-noise) phase relations to be preserved, while the inconsistent (including noise) phase relations would be washed out. This is part of the logic behind all such aggregation procedures, e.g., phase-locking, coherence. SVD has the advantage of capturing consistent relations in this sense, but without loss of information as occurs in averaging (up to the choice of number of singular vectors in the final model). Specifically, maps of the spatial covariances in phase are captured in the order of the variance explained. Noise (in the sense conveyed by the reviewer) in the phase measurements will not contribute to highest rank singular vectors. SVD is commonly used to remove noise, and that is one of its purposes here. This point can be seen by considering the very smooth singular vectors derived from MEG (Figure 3F) in this new version of the manuscript. These maps of phase gradients pull out only the non-noisy relations, even as their weighted sums reproduce any individual sample to any desired accuracy.

      To summarize, the next step (of incorporating the phase measure into the SVD) neatly bypasses the issue of non-meaningful phase quantification. This is one of the reasons why we do not undertake the spatial frequency estimates on the raw matrices of estimated phase.

      We now include a new sub-paragraph on this topic in the methods, lines 831-838.

      In addition, we have reworded the first description of the methods with a new paragraph at the end of the introduction, which better balances the description of the steps involved. The two sentences (lines 162-166 highlight the issue of concern to the reviewer.

      “there is no guarantee that the spatial organization of phase at many individual temporal frequencies converges to the correct description of the full sEEG signal.”

      The correct description of the full sEEG signal is beyond the scope of the present research. Our main goal, as stated, is to show that the hypothesis that ‘extra-cranial measurements of TWs is the result of projection from localized activity’ is not supported by the evidence of spatial patterns of activity in the cortex. Since this activity can be accessed as single frequency band (especially if localized sources create the large-scale patterns), analysis of SF on a TF-by-TF basis is sufficient.

      “This has an impact on the results presented here, specifically where the authors assert that "phase measured at a single contact in the grey matter is more strongly a function of global phase organization than local".

      We agree with the reviewer, even though we expect that the strongest influences on local phase are due to other cortical signals in the same band. The implicit assumption of the focus on bands of the same temporal frequency is now made explicit in the abstract (lines 31-34).

      A sentence addressing this issue had been added to the first paragraph of the discussion (lines 579-582).

      Inclusion of cross-frequency interactions would likely require a highly regular measurement array over the scales of interest here, i.e., the noise levels inherent in the spatial organization of sEEG contacts would not support such analyses.

      “Finally, the phase examples given in Supplementary Figure 5 are not strongly convincing to support this point.”

      We have removed the phase examples that were previously in Supplementary Figure 5 (and Figure 5 in the previous version of the main text), since further surrogate testing and modelling (Supplementary Figure 11) shows the LSVs from irregular arrays will inevitably capture mixtures of low and high SF signals. The final section of the Methods explains this effect in some detail. Instead, the new version of the manuscript relies on new surrogate testing to validate our methods.

      Another weakness is in the discussion on spatial scale. In the analyses, the authors separate contributions at (approximately) > 15 cm as macroscopic and < 15 cm as mesoscopic. The problem with the "macroscopic" here is that 15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur. For example, if a specific set of cortical regions, spanning over a 10 cm range, were to exhibit a consistent organization of phase at a particular temporal frequency (required by the analysis technique, as noted above), it is not clear why that would not be considered a "macroscopic" organization of phase, since it comprises multiple areas of the brain acting in coordination. Further, while this point could be considered as mostly semantic in nature, there is also an important technical consideration here: would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected? If this is not the case, then could it be possible that the lowest spatial frequencies are detected more often simply because it would be difficult to detect variable organizations in subsets of electrodes?

      The motivation for our study was to show that large-scale TWs measured outside the cortex cannot be the result of more localized activity being ‘projected up’. In this case, the temporal frequency of the artefactual waves would be the same as the localized sources, so the criticism does not apply.

      “while this point could be considered as mostly semantic in nature”

      We have changed the terminology in the paper to better coincide with standard usage. Macroscopic now refers to >1cm, while we refer to >8cm as large-scale.

      “15 cm is essentially on the scale of the whole brain, without accounting for the fact that organization in sub-systems may occur.”

      We can assume that subtle frequency variation (e.g., within an alpha phase binding) is greatest at the largest scales of cortex, or at least not less varying than measurements within regions. This means that not considering frequency-drift effects will not inflate low spatial frequency power over high spatial frequency power. Even so, the power spectrum we estimated is approximately 1/SF, so that unmeasured cross-frequency effects in binding (causal influences on local phase) would have to overcome the strength of this relation for this criticism to apply, which seems unlikely.

      “would spatial phase organizations occurring in varying subsets of electrodes and with somewhat variable temporal frequency reliably be detected?”

      See our previous comments about the low temporal frequency resolution of two cycle Morlet wavelets. The answer is yes, up to the range approximated by half-power bandwidth, which is large in the case of this method (see lines 760-764).

      Another weakness is disregarding the potential spike waveform artifact in the sEEG signal in the context of these analyses. Specifically, Zanos et al. (J Neurophysiol, 2011) showed that spike waveform artifacts can contaminate electrode recordings down to approximately 60 Hz. This point is important to consider in the context of the manuscript's results on spatial organization at temporal frequencies up to 100 Hz. Because the spike waveform artifact might affect signal phase at frequencies above 60 Hz, caution may be important in interpreting this point as evidence that there is significant phase organization across the cortex at these temporal frequencies.

      We have now added a sentence on this issue to the discussion (lines 600-602).

      However, our reading of the Zanos et al. paper is that the low temporal frequency (60-100Hz) contribution of spikes and spike patterns is negligible compared to genuine post-synaptic membrane fluctuations (see their Figure 3). These considerations come more strongly into play when correlations between LFP and spikes are calculated or spike triggered averaging is undertaken, since then a signal is being partly correlated with itself, or, partly averaged over the supposedly distinct signal with which it was detected.

      A last point is that, even though the present results provide some insight into the organization of phase across the human brain, the analyses do not directly link this to spiking activity. The predictive power that these spatial organizations of phase could provide for spiking activity - even if the analyses were not affected by the distortion due to the narrow-frequency assumption - remains unknown. This is important because relating back to spiking activity is the key factor in assessing whether these specific analyses of phase can provide insight into neural circuit dynamics. This type of analysis may be possible to do with the sEEG recordings, as well, by analyzing high-gamma power (Ray and Maunsell, PLoS Biology, 2011), which can provide an index of multi-unit spiking activity around the electrodes.

      “even if the analyses were not affected by the distortion due to the narrow-frequency assumption”

      See our earlier comment about narrow TFs; this is not the case in the present work.

      The spiking activity analysis would be an interesting avenue for future research. It appears the 1000Hz sampling frequency in the present data is not sufficient for method described in Ray & Maunsell (2011). On a related topic, we have shown that large-scale traveling waves in the MEG and 8cm waves in ECoG can both be used to predict future localized phase at a single sensor/contact, two cycles into the future (Alexander et al., 2019). This approach could be used to predict spiking activity, by combining it with the reviewer’s suggestion. However, the current manuscript is motivated by the argument that measured large-scale extra-cranial TWs are merely projections of localized cortical activity. Since spikes do not arise in this argument, we feel it is outside the scope of the present research. We have added this suggestion to the discussion as a potential line of future research (lines 686-688).

      Reviewer #3 (Public review):

      Summary:

      The authors propose a method for estimation of the spatial spectra of cortical activity from irregularly sampled data and apply it to publicly available intracranial EEG data from human patients during a delayed free recall task. The authors' main findings are that the spatial spectra of cortical activity peak at low spatial frequencies and decrease with increasing spatial frequency. This is observed over a broad range of temporal frequencies (2-100 Hz).

      Strengths:

      A strength of the study is the type of data that is used. As pointed out by the authors, spatial spectra of cortical activity are difficult to estimate from non-invasive measurements (EEG and MEG) due to signal mixing and from commonly used intracranial measurements (i.e. electrocorticography or Utah arrays) due to their limited spatial extent. In contrast, iEEG measurements are easier to interpret than EEG/MEG measurements and typically have larger spatial coverage than Utah arrays. However, iEEG is irregularly sampled within the threedimensional brain volume and this poses a methodological problem that the proposed method aims to address.

      Weaknesses:

      The used method for estimating spatial spectra from irregularly sampled data is weak in several respects.

      First, the proposed method is ad hoc, whereas there exist well-developed (Fourier-based) methods for this. The authors don't clarify why no standard methods are used, nor do they carry out a comparative evaluation.

      We disagree that the method is ad hoc, though the specific combination of SVD and multiscale differencing is novel in its application to sEEG. The SVD method has been used to isolate both ~30cm TWs in MEG and EEG (Alexander et al., 2013; 2016), as well as 8cm waves in ECoG (Alexander et al., 2013; 2019). In our opening examples in the results now reiterate these previous related findings, by way of example analysis of MEG data (Figure 3). This will better inform the reader on the extent of continuity of the method from previous research.

      Standard FFT has been used after interpolating between EEG electrodes to produce a uniform array (Alamia et al., 2023). There exist well-developed Fourier methods for nonuniform grids, such as simple interpolation, the butterfly algorithm, wavefield extrapolation and multi-scale vector field techniques. However, the problems for which these methods are designed require non-sparse sampling or less irregular arrays. The sEEG contacts (reduced in number to grey matter contacts) are well outside the spatial irregularity range of any Fourierrelated methods that we are aware of, particularly at the broad range of spatial scales of interest here (2cm up to 24cm). This would make direct comparison of these specialized Fourier method to our novel methods, in the sEEG, something of a straw-man comparison.

      We now include a summary paragraph in the introduction, which is a brief review of Fourier methods designed to deal with non-uniform sampling (lines 159-162).

      Second, the proposed method lacks a theoretical foundation and hinges on a qualitative resemblance between Fourier analysis and singular value decomposition.

      We have improved our description of the theoretical relation between Fourier analysis and SVD (additional material at lines 839-861 and 910-922). In fact, there are very strong links between the two methods, and now it should be clearer that our method does not rely on a mere qualitative resemblance.

      Third, the proposed method is not thoroughly tested using simulated data. Hence it remains unclear how accurate the estimated power spectra actually are.

      We now include a new surrogate testing procedure, which takes as inputs the empirical data and a model signal (of known spatial frequency) in various proportions. Thus, we test both the impact of small amount of surrogate signal on the empirical signal, and the impact of ‘noise’ (in the form of a small amount of empirical signal) added to the well-defined surrogate signal.

      In addition, there are a number of technical issues and limitations that need to be addressed or clarified (see recommendations to the authors).

      My assessment is that the conclusions are not completely supported by the analyses. What would convince me, is if the method is tested on simulated cortical activity in a more realistic set-up. I do believe, however, that if the authors can convincingly show that the estimated spatial spectra are accurate, the study will have an impact on the field. Regarding the methodology, I don't think that it will become a standard method in the field due to its ad hoc nature and well-developed alternatives.

      Simulations of cortical activity do not seem the most direct way to achieve this goal. The first author has published in this area (Liley et. al., 1999; Wright et al., 2001), and such simulations, for both bulk and neuronally based simulations, readily display traveling wave activity at low spatial frequencies (indeed, this was the origin of the present scientific journey). The manuscript outlines these results in the introduction, as well as theoretical treatments proposing the same. Several other recent studies have highlighted the appearance of largescale travelling waves using connectome-based models (https://www.biorxiv.org/content/10.1101/2025.07.05.663278v1; https://www.nature.com/articles/s41467-024-47860-x), which we do not include in the manuscript for reasons of brevity. In short, the emergence of TW phenomenon in models is partly a function of the assumptions put into them (i.e., spatial damping, boundary conditions, parameterization of connection fields) and would therefore be inconclusive in our view.

      Instead, we rely on the advantages provided by the way our central research question has been posed: that the spatial frequency distribution of grey matter signal can determine whether extra-cranial TWs are artefactual. The newly introduced surrogate methods reflect this advantage by directly adding ground truth spatial frequency components to individual sample measurements. This is a less expensive option than making cortical simulations to achieve the same goal.

      For the same reasons, we include testing of the methods using real cortical signals with MEG arrays (for which we could test the effects of increasing sparseness of contacts, test the effects of average referencing, and also construct surrogate time-series with alternative spectra).

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      Major points

      Methods, Page 18: "... using notch filters to remove the 50Hz line signal and its harmonics ...": The sEEG data appear to have been recorded in North America, where the line frequency is 60 Hz. Is this perhaps a typo, or was a 50 Hz notch filter in fact applied here (which would be a mistake)?

      This has now been fixed in the text to read 60Hz. This is the notch filter that was applied.

      Minor points

      (1) While the authors do state that they are analyzing the "spatial frequency spectrum of phase dynamics" in the abstract, this could be more clearly emphasized. Specifically, the difference between signal power at different spatial frequencies (as analyzed by a standard Fourier analysis) and the organization of phase in space (as done here) could be more clearly distinguished.

      We now address this point explicitly on lines 167-172. We now include at the end of the results additional analyses where the TF power is included. This means that the effects of including signal power at different temporal frequencies can be directly compared to our main analysis of the SF spectrum of the phase dynamics.

      (2) Figure 1A-C: It was not immediately clear what the lengths provided in these panels (e.g."> 40 cm cortex", "< 10 cm", "< 30 cm") were meant to indicate. This could be made clearer.

      Now fixed in the caption.

      (3) Figure 2A: If this is surrogate data to explain the analysis technique, it would be helpful to note explicitly at this point.

      This Figure has been completely reworked, and now the status of the examples (from illustrative toy models to actual MEG data) should be clearer.

      (4) Figure 4A: Why change from "% explained variance" for the example data in Figure 2C to arbitrary units at this point?

      This has now been explicitly stated in the methods (lines 1033-1036).

      (5) Page 15: "This means either the results were biased by a low pass filter, or had a maximum measurable...": If the authors mean that the low-pass filter is due to spatial blurring of neural activity in the EEG signal, it would be helpful to state that more directly at this point.

      Now stated directly, lines 567-568.

      (6) Page 23: "...where |X| is the complex magnitude of X...": The modulus operation is defined on a complex number, yet here is applied to a vector of complex numbers. If the operation is elementwise, it should be defined explicitly.

      ‘Elementwise’ is now stated explicitly (line 1020).

      Reviewer #3 (Recommendations for the authors):

      In the submitted manuscript, the authors propose a method to estimate spatial (phase) spectra from irregularly sampled oscillatory cortical activity. They apply the method to intracranial (iEEG) data and argue that cortical activity is organized into global waves up to the size of the entire cortex. If true, this finding is certainly of interest, and I can imagine that it has profound implications for how we think about the functional organization of cortical activity.

      We have added a section to the discussion outlining the most radical of these implications: what does it mean to do source localization when non-local signals dominate? Lines 670-681.

      The manuscript is well-written, with comprehensive introduction and discussion sections, detailed descriptions of the results, and clear figures. However, the proposed method comprised several ad hoc elements and is not well-founded mathematically, its performance is not adequately assessed, and its limitations are not sufficiently discussed. As such, the study failed to convince (me) of the correctness of the main conclusions.

      We now have a direct surrogate testing of the method. We have also improved the mathematical explanation to show that the link between Fourier analysis and SVD is not ad hoc, but well understood in both literatures. We had addressed explicitly in the text all of the limitations raised by the reviewers.

      Major comments

      (1) The main methodological contribution of the study is summarized in the introduction section:

      "The irregular sampling of cortical spatial coordinates via stereotactic EEG was partly overcome by the resampling of the phase data into triplets corresponding to the vertices of approximately equilateral triangles within the cortical sheet."

      There exist well-established Fourier methods for handling irregularly sampled data so it is unclear why the authors did not resort to these and instead proposed a rather ad hoc method without theoretical justification (see next comment).

      We have re-reviewed the literature on non-uniform Fourier analysis. We now briefly review the Fourier methods for handling irregularly sampled data (lines 155-162) and conclude that none of the existing methods can deal with the degree of irregularity, and especially sparsity, found for the grey-matter sEEG contacts.

      (2) In the Appendix, the authors write:

      "For appropriate signals, i.e., those with power that decreases monotonically with frequency, each of the first few singular vectors, v_k, is an approximate complex sinusoid with wavenumber equal to k."

      I don't think this is true in general and if it is, there must be a formal argument that proves it. Furthermore, is it also true for irregularly sampled data? And in more than one spatial dimension? Moreover, it is also unclear exactly how the spatial Fourier spectrum is estimated from the SVD.

      In response to these reviewer queries, we now spend considerably more time in the conceptual set-up of the manuscript, giving examples of where SVD can be used to estimate the Fourier spectrum. We have now unpacked the word ‘appropriate’ and we are now more exact in our phrasing. This is laid out in lines 843-850 of the manuscript. In addition, the methods now describe the mathematical links between Fourier analysis and SVD (lines 851861 and 910-922).

      The authors write:

      "The spatial frequency spectrum can therefore be estimated using SVD by summing over the singular values assigned to each set of singular vectors with unique (or by binning over a limited range of) spatial frequencies. This procedure is illustrated in Figure 1A-C."

      First, the singular vectors are ordered to decreasing values of the corresponding singular values. Hence, if the singular values are used to estimate spectral power, the estimated spectrum will necessarily decrease with increasing spatial frequency (as can be seen in Figure 2C). Then how can traveling waves be detected by looking for local maxima of the estimated power spectra?

      TWs are not detected by looking for local maxima in the spectra. Our work has focussed on the global wave maps derived from the SVD of phase (i.e., k=1-3), which also explain most of the variance in phase. This is now mentioned in the caption to Figure 3 (lines 291-294).

      Second, how are spatial frequencies assigned to the different singular vectors? The proposed method for estimating spatial power spectra from irregularly sampled data seems rather ad hoc and it is not at all clear if, and under what conditions, it works and how accurate it is.

      The new version of the manuscript uses a combination of the method previously presented (the multi-scale differencing) and the method previously outlined in the supplementary materials (doing complex-valued SVD on the spatial vectors of phase). We hope that along with the additional expository material in the methods the new version is clearer and seems less ad hoc to the reviewer. Certainly, there are deep and well-understood links between Fourier analysis and SVD, and we hope we have brought these into focus now.

      (3) The authors define spatial power spectra in three-dimensional Euclidean space, whereas the actual cortical activity occurs on a two-dimensional sheet (the union of two topological 2spheres). As such, it is not at all clear how the estimated wavelengths in three-dimensional space relate to the actual wavelengths of the cortical activity.

      We define spatial power spectra on the folded cortical sheet, rather than Cartesian coordinates. We use geodesic distances in all cases where a distance measurement is required. We have included two new figures (Figure 5 and Supplementary Figure1) showing the mapping of the triangles onto the cortical sheet, which should bring this point home.

      (4) The authors' analysis of the iEEG data is subject to a caveat that is not mentioned in the manuscript: As a reference for the local field potentials, the average white-matter signal was used and this can lead to artifactual power at low spatial frequencies. This is because fluctuations in the reference signal are visible as standing waves in the recording array. This might also explain the observation that

      "A surprising finding was that the shape of the spatial frequency spectrum did not vary much with temporal frequency."

      because fluctuations in the reference signal are expected to have power at all temporal frequencies (1/f spectrum). When superposed with local activity at the recording electrodes, this leads to spurious power at low spatial frequencies. Can the authors exclude this interpretation of the results?

      The new version of the manuscript deals explicitly with this potential confound (lines 454467). First, the artefactual global synchrony due to the reference signal (the DC component in our spatial frequency spectra of phase) is at a distinct frequency from the lowest SF of interest here. The lowest spatial frequency is a function of the maximum spatial range of the recording array and not overlapping in our method with the DC component, despite the loss of SF resolution due to the noise of the spatial irregularity of the recording array. This can be seen from consideration of the SF tuning (Figure 4) for the MEG wave maps shown in Figure 3, and the spectra generated for sparse MEG arrays in Supplementary Figure 5. Additionally, this question led us to a series of surrogate tests which are now included in the manuscript. We used MEG to test for the effects of average reference, since in this modality the reference free case is available. The results show that even after imposing a strong and artefactual global synchrony, the method is highly robust to inflation of the DC component, which either way does not strongly influence the SF estimates in the range of interest (4c/m to 12c/m for the case of MEG).

      (5) Related to the previous comment: Contrary to the authors' claims, local field potentials are susceptible to volume conduction, particularly when average references are used (see e.g. https://www.cell.com/neuron/fulltext/S0896-6273(11)00883-X)

      Methods exist to mitigate these effects (e.g. taking first- or second-order spatial differences of the signals). I think this issue deserves to be discussed.

      We have reviewed this research and do not find it to be a problem. The authors cited by the reviewer were concerned with unacknowledged volume conduction up to 1 cm for LFP. The maximum spatial frequency we report here is 50c/m, or equivalent to 2cm. While the intercontact distance on the sEEG electrodes was 0.5cm, in practice the smallest equilateral triangles (i.e., between two electrodes) to be found in the grey matter was around 2cm linear size. We make no statements about SF in the 1cm range. We do now cite this paper and mention this short-range volume conduction (lines 602-605). The method of taking derivatives has the same problems as source localization methods. They remove both artefactual correlations (volume conduction) and real correlations (the low SF interactions of interest here). We mention this now at lines 667-669. In addition, our method to remove negative SF components from the LSVs ameliorates the effects of average referencing. There are now more details in the Methods about this step (lines 924-947), as well as a new supplementary figure illustrating its effects on signal with a known SF spectrum (MEG, supplementary Figure 6).

      (6) Could the authors add an analysis that excludes the possibility that the observed local maxima in the spectra are a necessary consequence of the analysis method, rather than reflecting true maxima in the spectra? A (possibly) similar effect can be observed in ordinary Fourier spectra that are estimated from zero-mean signals: Because the signals have zero mean, the power spectrum at frequency zero is close to zero and this leads to an artificial local maximum at low frequencies.

      We acknowledge the reviewer’s mathematical point. We do not agree that it could be an issue, though it is important to rule it out definitively. First, removing the DC component will only produce an artefactual low SF peak if the power at low SF is high. This may occur in the reviewer’s example only because temporal frequency has a ~1/f spectrum. If the true spectrum is flat, or increasing in power with f, no such artificial low SF will be produced (see Supplementary Figure 5G). Additionally,

      (1) The DC component is well separated from the low SF components in our method;

      (2) We now include several surrogate methods which show that our method finds the correct spectral distribution and is not just finding a maximum at low SFs due to the suggested effect (subtraction of the DC component). Analysis of separated wave maps in MEG (Figures 3 & 4) shows the expected peaks in SF, increasing in peak SF for each family of maps when wavenumber increases (roughly three k=1 maps, three k=2 etc.). A specific surrogate test for this query was also undertaken by creating a reverse SF spectrum in MEG phase data, in which the spectrum goes linearly with f over the SF range of interest, rather than the usual 1/f. Our method correctly finds the former spectrum (Supplementary Figure 5). Additionally, we tested for the effects of introducing the average reference and the effects of our method to remove the DC component of the phase SF spectrum (Supplementary Figure 6). We can definitively rule out the reviewer’s concern.

      A related issue (perhaps) is the observation that the location of the maximum (i.e. the peak spatial frequency of cortical activity) depends on array size: If cortical activity indeed has a characteristic wavelength (in the sense of its spectrum having a local maximum) would one not expect it to be independent of array size?

      This is only true when making estimates for relatively clean sinusoidal signals, and not from broad-band signals. Fourier analysis and our related SVD methods are very much dependent on maximum array size used to measure cortical signals. This is why the first frequency band (after the DC component) in Fourier analysis is always at a frequency equivalent to 1/array_size, even if the signal is known to contain lower frequency components. We now include a further illustration of this in Figure 3, a more detailed exposition of this point in the methods, and in Supplementary Figure 11 we provide a more detailed example of the relation between Fourier analysis and SVD when grids with two distinct scales are used.

      In short, it is not possible, mathematically, to measure wavelengths greater than the array size in broad-band data. This is now stated explicitly in the manuscript (lines 143-144). A common approach in Neuroscience research is to first do narrowband filtering, then use a method that can accurately estimate ‘instantaneous’ phase change, such as the Hilbert transform. This is not possible for highly irregular sEEG arrays.

      (7) The proposed method of estimating wavelength from irregularly sampled threedimensional iEEG data involves several steps (phase-extraction, singular value decomposition, triangle definition, dimension reduction, etc.) and it is not at all clear that the concatenation of all these steps actually yields accurate estimates.

      Did the authors use more realistic simulations of cortical activity (i.e. on the convoluted cortical sheet) to verify that the method indeed yields accurate estimates of phase spectra?

      We now included detailed surrogate testing, in which varying combinations of sEEG phase data and veridical surrogate wavelengths are added together.

      See our reply from the public reviewer comments. We assess that real neurophysiological data (here, sEEG plus surrogate and MEG manipulated in various ways) is a more accurate way to address these issues. In our experience, large scale TWs appear spontaneously in realistic cortical simulations, and we now cite the relevant papers in the manuscript (line 53).

      Minor comments

      (1) Perhaps move the first paragraph of the results section to the Introduction (it does not describe any results).

      So moved.

      (2) The authors write:

      "The stereotactic EEG contacts in the grey matter were re-referenced using the average of low-amplitude white matter contacts"

      Does this mean that the average is taken over a subset of white-matter contacts (namely those with low amplitude)? Or do the authors refer to all white-matter contacts as "low-amplitude"? And had contacts at different needles different references? Or where the contacts from all needles pooled?

      A subset of white-matter contacts was used for re-referencing, namely those 50% with lowest amplitude signals. This subset was used to construct a pooled, single, average reference. We have rephrased the sentences referring to this procedure to improve clarity (line 202 and 743745).

    1. Reviewer #2 (Public review):

      I have completed a thorough review of this paper, which seeks to use the large datasets of species occurrences available through GBIF to estimate variation in how large numbers of plant and animal species are associated with urbanization throughout the world, describing what they call the "species urbanness distribution" or SUD. They explore how these SUDs differ between regions and different taxonomic levels. They then calculate a measure of urban tolerance and seek to explore whether organism size predicts variation in tolerance among species and across regions.

      The study is impressive in many respects. Over the course of several papers, Callaghan and coauthors have been leaders in using "big [biodiversity] data" to create metrics of how species' occurrence data are associated with urban environments, and in describing variation in urban tolerance among taxa and regions. This work has been creative, novel, and it has pushed the boundaries of understanding how urbanization affects a wide diversity of taxa. The current paper takes this to a new level by performing analyses on over 94000 observations from >30,000 species of plants and animals, across more than 370 plant and animal taxonomic families. All of these analyses were focused on answering two main questions:

      (1) What is the shape of species' urban tolerance distributions within regional communities?

      (2) Does body size consistently correlate with species' urban tolerance across taxonomic groups and biogeographic contexts?

      Overall, I think the questions are interesting and important, the size and scope of the data and analyses are impressive, and this paper has a potentially large contribution to make in pushing forward urban macroecology specifically and urban ecology and evolution more generally.

      Despite my enthusiasm for this paper and its potential impact, there are aspects that could be improved, and I believe the paper requires major revision.

      Some of these revisions ideally involve being clearer about the methodology or arguments being made. In other cases, I think their metrics of urban tolerance are flawed and need to be rethought and recalculated, and some of the conclusions are inaccurate. I hope the authors will address these comments carefully and thoroughly. I recognize that there is no obligation for authors to make revisions. However, revising the paper along the lines of the comments made below would increase the impact of the paper and its clarity to a broad readership.

      Major Comments:

      (1) Subrealms

      Where does the concept of "subrealms" come from? No citation is given, and it could be said that this sounds like an idea straight out of Middle Earth. How do subrealms relate to known bioclimatic designations like Koppen Climate classifications, which would arguably be more appropriate? Or are subrealms more socio-ecologically oriented? From what I can tell, each subrealm lumps together climatically diverse areas. It might be better and more tractable to break things in terms of continents, as the rationale for subrealms is unclear, and it makes the analyses and results more confusing. The authors rationalized the use of subrealms to account for potential intraspecific differences in species' response to urbanization, but that is never a core part of the questions or interpretation in the paper, and averaging across subrealms also accounts for intraspecific variation. Another issue with using the subrealm approach is that the authors only included a species if it had 100 observations in a given subrealm, leading to a focus on only the most common species, which may be biased in their SUD distribution. How many more species would be included if they did their analysis at the continental or global scale, and would this change the shape of SUDs?

      (2) Methods - urban score

      The authors describe their "urban score" as being calculated as "the mean of the distribution of VIIRS values as a relative species specific measure of a response to urban land cover."

      I don't understand how this is a "relative species-specific measure". What is it relative to? Figures S4 and S5 show the mean distribution of VIIRS for various taxa, and this mean looks to be an absolute measure. Mean VIIRS for a given species would be fine and appropriate as an "urban score", but the authors then state in the next sentence: "this urban score represents the relative ranking of that species to other species in response to urban land cover".

      That doesn't follow from the description of how this is calculated. Something is missing here. Please clarify and add an explicit equation for how the urban score is calculated because the text is unclear and confusing.

      (3) Methods - urban tolerance

      How the authors are defining and calculating tolerance is unclear, confusing, and flawed in my opinion.

      Tolerance is a common concept in ecology, evolution, and physiology, typically defined as the ability for an organism to maintain some measure of performance (e.g., fitness, growth, physiological homeostasis) in the presence versus absence of some stressor. As one example, in the herbivory literature, tolerance is often measured as the absolute or relative difference in fitness of plants that are damaged versus undamaged (e.g., https://academic.oup.com/evolut/article/62/9/2429/6853425?login=true).

      On line 309, after describing the calculation of urban scores across subrealms, they write: "Therefore, a species could be represented across multiple subrealms with differing measures of urban tolerance (Fig. S4). Importantly, this continuous metric of urban tolerance is a relative measure of a species' preference, or affinity, to urban areas: it should be interpreted only within each subrealm".

      This is problematic on several fronts. First, the authors never define what they mean by the term "tolerance". Second, they refer to urban tolerance throughout the paper, but don't describe the calculation until lines 315-319, where they write (text in [ ] is from the reviewer):

      "Within each subrealm, we further accounted for the potential of different levels of urbanization by scaling each species' urban score by subtracting the mean VIIRS of all observations in the subrealm (this value is hereafter referred to as urban tolerance). This 'urban tolerance' (Fig. S5) value can be negative - when species under-occupy urban areas [relative to the average across all species] suggesting they actively avoid them-or positive-when species over-occupy urban areas [relative to the average across all species] suggesting they prefer them (i.e., ranging from urban avoiders to urban exploiters, respectively).<br /> They are taking a relativized urban score and then subtracting the mean VIIRS of all observations across species in a subrealm. How exactly one interprets the magnitude isn't clear and they admit this metric is "not interpretative across subrealms".

      This is not a true measure of tolerance, at least not in the conventional sense of how tolerance is typically defined. The problem is that a species distribution isn't being compared to some metric of urbanness, but instead it is relative to other species' urban scores, where species may, on average, be highly urban or highly nonurban in their distribution, and this may vary from subrealm to subrealm. A measure of urban tolerance should be independent of how other species are responding, and should be interpretable across subrealms, continents, and the globe.

      I propose the authors use one of two metrics of urban tolerance:

      (i) Absolute Urban Tolerance = Mean VIIRS of species_i - Mean VIIRS of city centers<br /> Here, the mean VIIRS of city centers could be taken from the center of multiple cities throughout a subrealm, across a continent, or across the world. Here, the units are in the original VIIRS units where 0 would correspond to species being centered on the most extreme urban habitats, and the most extreme negative values would correspond to species that occupy the most non-urban habitats (i.e., no artificial light at night). In essence, this measure of tolerance would quantify how far a species' distribution is shifted relative to the most highly urbanized habitat available.

      (ii) % Urban Tolerance = (Mean VIIRS of species_i - Mean VIIRS of city centers)/MeanVIIRS of city centers * 100%<br /> This metric provides a % change in species mean VIIRS distribution relative to the most urban habitats. This value could theoretically be negative or positive, but will typically be negative, with -100% being completely non-urban, and 0% being completely urban tolerant.

      Both of these metrics can be compared across the world, as it would provide either absolute (equation 1) or relative (equation 2) metrics of urban tolerance that are comparable and easily interpretable in any region.

      In summary, the definition of tolerance should be clear, the metric should be a true measure of tolerance that is comparable across regions, and an equation should be given.

      (4) Figure 1: The figure does not stand alone. For example, what is the hypothesis for thermophily or the temperature-size rule? The authors should expand the legend slightly to make the hypotheses being illustrated clearer.

      (5) SUDs: I don't agree with the conclusion given on line 83 ("pattern was consistent across subrealms and several taxonomic levels") or in the legend of Figure 2 ("there were consistent patterns for kingdoms, classes, and orders, as shown by generally similar density histograms shapes for each of these").

      The shapes of the curves are quite different, especially for the two Kingdoms and the different classes. I agree they are relatively consistent for the different taxonomic Orders of insects.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      Zhou and colleagues developed a computational model of replay that heavily builds on cognitive models of memory in context (e.g., the context-maintenance and retrieval model), which have been successfully used to explain memory phenomena in the past. Their model produces results that mirror previous empirical findings in rodents and offers a new computational framework for thinking about replay.

      Strengths:

      The model is compelling and seems to explain a number of findings from the rodent literature. It is commendable that the authors implement commonly used algorithms from wakefulness to model sleep/rest, thereby linking wake and sleep phenomena in a parsimonious way. Additionally, the manuscript's comprehensive perspective on replay, bridging humans and non-human animals, enhanced its theoretical contribution.

      Weaknesses:

      This reviewer is not a computational neuroscientist by training, so some comments may stem from misunderstandings. I hope the authors would see those instances as opportunities to clarify their findings for broader audiences.

      (1) The model predicts that temporally close items will be co-reactivated, yet evidence from humans suggests that temporal context doesn't guide sleep benefits (instead, semantic connections seem to be of more importance; Liu and Ranganath 2021, Schechtman et al 2023). Could these findings be reconciled with the model or is this a limitation of the current framework?

      We appreciate the encouragement to discuss this connection. Our framework can accommodate semantic associations as determinants of sleep-dependent consolidation, which can in principle outweigh temporal associations. Indeed, prior models in this lineage have extensively simulated how semantic associations support encoding and retrieval alongside temporal associations. It would therefore be straightforward to extend our model to simulate how semantic associations guide sleep benefits, and to compare their contribution against that conferred by temporal associations across different experimental paradigms. In the revised manuscript, we have added a discussion of how our framework may simulate the role of semantic associations in sleep-dependent consolidation.

      “Several recent studies have argued for dominance of semantic associations over temporal associations in the process of human sleep-dependent consolidation (Schechtman et al., 2023; Liu and Ranganath 2021; Sherman et al., 2025), with one study observing no role at all for temporal associations (Schechtman et al., 2023). At first glance, these findings appear in tension with our model, where temporal associations drive offline consolidation. Indeed, prior models have accounted for these findings by suppressing temporal context during sleep (Liu and Ranganath 2024; Sherman et al., 2025). However, earlier models in the CMR lineage have successfully captured the joint contributions of semantic and temporal associations to encoding and retrieval (Polyn et al., 2009), and these processes could extend naturally to offline replay. In a paradigm where semantic associations are especially salient during awake learning, the model could weight these associations more and account for greater co-reactivation and sleep-dependent memory benefits for semantically related than temporally related items. Consistent with this idea, Schechtman et al. (2023) speculated that their null temporal effects likely reflected the task’s emphasis on semantic associations. When temporal associations are more salient and task-relevant, sleep-related benefits for temporally contiguous items are more likely to emerge (e.g., Drosopoulos et al., 2007; King et al., 2017).”

      The reviewer’s comment points to fruitful directions for future work that could employ our framework to dissect the relative contributions of semantic and temporal associations to memory consolidation.

      (2) During replay, the model is set so that the next reactivated item is sampled without replacement (i.e., the model cannot get "stuck" on a single item). I'm not sure what the biological backing behind this is and why the brain can't reactivate the same item consistently.

      Furthermore, I'm afraid that such a rule may artificially generate sequential reactivation of items regardless of wake training. Could the authors explain this better or show that this isn't the case?

      We appreciate the opportunity to clarify this aspect of the model. We first note that this mechanism has long been a fundamental component of this class of models (Howard & Kahana 2002). Many classic memory models (Brown et al., 2000; Burgess & Hitch, 1991; Lewandowsky & Murdock 1989) incorporate response suppression, in which activated items are temporarily inhibited. The simplest implementation, which we use here, removes activated items from the pool of candidate items. Alternative implementations achieve this through transient inhibition, often conceptualized as neuronal fatigue (Burgess & Hitch, 1991; Grossberg 1978). Our model adopts a similar perspective, interpreting this mechanism as mimicking a brief refractory period that renders reactivated neurons unlikely to fire again within a short physiological event such as a sharp-wave ripple. Importantly, this approach does not generate spurious sequences. Instead, the model’s ability to preserve the structure of wake experience during replay depends entirely on the learned associations between items (without these associations, item order would be random). Similar assumptions are also common in models of replay. For example, reinforcement learning models of replay incorporate mechanisms such as inhibition to prevent repeated reactivations (e.g., Diekmann & Cheng, 2023) or prioritize reactivation based on ranking to limit items to a single replay (e.g., Mattar & Daw, 2018). We now discuss these points in the section titled “A context model of memory replay”

      “This mechanism of sampling without replacement, akin to response suppression in established context memory models (Howard & Kahana 2002), could be implemented by neuronal fatigue or refractory dynamics (Burgess & Hitch, 1991; Grossberg 1978). Non-repetition during reactivation is also a common assumption in replay models that regulate reactivation through inhibition or prioritization (Diekmann & Cheng 2023; Mattar & Daw 2018; Singh et al., 2022).”

      (3) If I understand correctly, there are two ways in which novelty (i.e., less exposure) is accounted for in the model. The first and more talked about is the suppression mechanism (lines 639-646). The second is a change in learning rates (lines 593-595). It's unclear to me why both procedures are needed, how they differ, and whether these are two different mechanisms that the model implements. Also, since the authors controlled the extent to which each item was experienced during wakefulness, it's not entirely clear to me which of the simulations manipulated novelty on an individual item level, as described in lines 593-595 (if any).

      We agree that these mechanisms and their relationships would benefit from clarification. As noted, novelty influences learning through two distinct mechanisms. First, the suppression mechanism is essential for capturing the inverse relationship between the amount of wake experience and the frequency of replay, as observed in several studies. This mechanism ensures that items with high wake activity are less likely to dominate replay. Second, the decrease in learning rates with repetition is crucial for preserving the stochasticity of replay. Without this mechanism, the model would increase weights linearly, leading to an exponential increase in the probability of successive wake items being reactivated back-to-back due to the use of a softmax choice rule. This would result in deterministic replay patterns, which are inconsistent with experimental observations.

      We have revised the Methods section to explicitly distinguish these two mechanisms:

      “This experience-dependent suppression mechanism is distinct from the reduction of learning rates through repetition; it does not modulate the update of memory associations but exclusively governs which items are most likely to initiate replay.”

      We have also clarified our rationale for including a learning rate reduction mechanism:

      “The reduction in learning rates with repetition is important for maintaining a degree of stochasticity in the model’s replay during task repetition, since linearly increasing weights would, through the softmax choice rule, exponentially amplify differences in item reactivation probabilities, sharply reducing variability in replay.”

      Finally, we now specify exactly where the learning-rate reduction applied, namely in simulations where sequences are repeated across multiple sessions:

      “In this simulation, the learning rates progressively decrease across sessions, as described above.“

      As to the first mechanism - experience-based suppression - I find it challenging to think of a biological mechanism that would achieve this and is selectively activated immediately before sleep (somehow anticipating its onset). In fact, the prominent synaptic homeostasis hypothesis suggests that such suppression, at least on a synaptic level, is exactly what sleep itself does (i.e., prune or weaken synapses that were enhanced due to learning during the day). This begs the question of whether certain sleep stages (or ultradian cycles) may be involved in pruning, whereas others leverage its results for reactivation (e.g., a sequential hypothesis; Rasch & Born, 2013). That could be a compelling synthesis of this literature. Regardless of whether the authors agree, I believe that this point is a major caveat to the current model. It is addressed in the discussion, but perhaps it would be beneficial to explicitly state to what extent the results rely on the assumption of a pre-sleep suppression mechanism.

      We appreciate the reviewer raising this important point. Unlike the mechanism proposed by the synaptic homeostasis hypothesis, the suppression mechanism in our model does not suppress items based on synapse strength, nor does it modify synaptic weights. Instead, it determines the level of suppression for each item based on activity during awake experience. The brain could implement such a mechanism by tagging each item according to its activity level during wakefulness. During subsequent consolidation, the initial reactivation of an item during replay would reflect this tag, influencing how easily it can be reactivated.

      A related hypothesis has been proposed in recent work, suggesting that replay avoids recently active trajectories due to spike frequency adaptation in neurons (Mallory et al., 2024). Similarly, the suppression mechanism in our model is critical for explaining the observed negative relationship between the amount of recent wake experience and the degree of replay.

      We discuss the biological plausibility of this mechanism and its relationship with existing models in the Introduction. In the section titled “The influence of experience”, we have added the following:

      “Our model implements an activity‑dependent suppression mechanism that, at the onset of each offline replay event, assigns each item a selection probability inversely proportional to its activation during preceding wakefulness. The brain could implement this by tagging each memory trace in proportion to its recent activation; during consolidation, that tag would then regulate starting replay probability, making highly active items less likely to be reactivated. A recent paper found that replay avoids recently traversed trajectories through awake spike‑frequency adaptation (Mallory et al., 2025), which could implement this kind of mechanism. In our simulations, this suppression is essential for capturing the inverse relationship between replay frequency and prior experience. Note that, unlike the synaptic homeostasis hypothesis (Tononi & Cirelli 2006), which proposes that the brain globally downscales synaptic weights during sleep, this mechanism leaves synaptic weights unchanged and instead biases the selection process during replay.”

      (4) As the manuscript mentions, the only difference between sleep and wake in the model is the initial conditions (a0). This is an obvious simplification, especially given the last author's recent models discussing the very different roles of REM vs NREM. Could the authors suggest how different sleep stages may relate to the model or how it could be developed to interact with other successful models such as the ones the last author has developed (e.g., C-HORSE)? 

      We appreciate the encouragement to comment on the roles of different sleep stages in the manuscript, especially since, as noted, the lab is very interested in this and has explored it in other work. We chose to focus on NREM in this work because the vast majority of electrophysiological studies of sleep replay have identified these events during NREM. In addition, our lab’s theory of the role of REM (Singh et al., 2022, PNAS) is that it is a time for the neocortex to replay remote memories, in complement to the more recent memories replayed during NREM. The experiments we simulate all involve recent memories. Indeed, our view is that part of the reason that there is so little data on REM replay may be that experimenters are almost always looking for traces of recent memories (for good practical and technical reasons).

      Regarding the simplicity of the distinction between simulated wake and sleep replay, we view it as an asset of the model that it can account for many of the different characteristics of awake and NREM replay with very simple assumptions about differences in the initial conditions. There are of course many other differences between the states that could be relevant to the impact of replay, but the current target empirical data did not necessitate us taking those into account. This allows us to argue that differences in initial conditions should play a substantial role in an account of the differences between wake and sleep replay.

      We have added discussion of these ideas and how they might be incorporated into future versions of the model in the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      Finally, I wonder how the model would explain findings (including the authors') showing a preference for reactivation of weaker memories. The literature seems to suggest that it isn't just a matter of novelty or exposure, but encoding strength. Can the model explain this? Or would it require additional assumptions or some mechanism for selective endogenous reactivation during sleep and rest?

      We appreciate the encouragement to discuss this, as we do think the model could explain findings showing a preference for reactivation of weaker memories, as in Schapiro et al. (2018). In our framework, memory strength is reflected in the magnitude of each memory’s associated synaptic weights, so that stronger memories yield higher retrieved‑context activity during wake encoding than weaker ones. Because the model’s suppression mechanism reduces an item’s replay probability in proportion to its retrieved‑context activity, items with larger weights (strong memories) are more heavily suppressed at the onset of replay, while those with smaller weights (weaker memories) receive less suppression. When items have matched reward exposure, this dynamic would bias offline replay toward weaker memories, therefore preferentially reactivating weak memories. 

      In the section titled “The influence of experience”, we updated a sentence to discuss this idea more explicitly: 

      “Such a suppression mechanism may be adaptive, allowing replay to benefit not only the most recently or strongly encoded items but also to provide opportunities for the consolidation of weaker or older memories, consistent with empirical evidence (e.g., Schapiro et al. 2018; Yu et al., 2024).”

      (5) Lines 186-200 - Perhaps I'm misunderstanding, but wouldn't it be trivial that an external cue at the end-item of Figure 7a would result in backward replay, simply because there is no potential for forward replay for sequences starting at the last item (there simply aren't any subsequent items)? The opposite is true, of course, for the first-item replay, which can't go backward. More generally, my understanding of the literature on forward vs backward replay is that neither is linked to the rodent's location. Both commonly happen at a resting station that is further away from the track. It seems as though the model's result may not hold if replay occurs away from the track (i.e. if a0 would be equal for both pre- and post-run).

      In studies where animals run back and forth on a linear track, replay events are decoded separately for left and right runs, identifying both forward and reverse sequences for each direction, for example using direction-specific place cell sequence templates. Accordingly, in our simulation of, e.g., Ambrose et al. (2016), we use two independent sequences, one for left runs and one for right runs (an approach that has been taken in prior replay modeling work). Crucially, our model assumes a context reset between running episodes, preventing the final item of one traversal from acquiring contextual associations with the first item of the next. As a result, learning in the two sequences remains independent, and when an external cue is presented at the track’s end, replay predominantly unfolds in the backward direction, only occasionally producing forward segments when the cue briefly reactivates an earlier sequence item before proceeding forward.

      We added a note to the section titled “The context-dependency of memory replay” to clarify this:

      “In our model, these patterns are identical to those in our simulation of Ambrose et al. (2016), which uses two independent sequences to mimic the two run directions. This is because the drifting context resets before each run sequence is encoded, with the pause between runs acting as an event boundary that prevents the final item of one traversal from associating with the first item of the next, thereby keeping learning in each direction independent.”

      To our knowledge, no study has observed a similar asymmetry when animals are fully removed from the track, although both types of replay can be observed when animals are away from the track. For example, Gupta et al. (2010) demonstrated that when animals replay trajectories far from their current location, the ratio of forward vs. backward replay appears more balanced. We now highlight this result in the manuscript and explain how it aligns with the predictions of our model:

      “For example, in tasks where the goal is positioned in the middle of an arm rather than at its end, CMR-replay predicts a more balanced ratio of forward and reverse replay, whereas the EVB model still predicts a dominance of reverse replay due to backward gain propagation from the reward. This contrast aligns with empirical findings showing that when the goal is located in the middle of an arm, replay events are more evenly split between forward and reverse directions (Gupta et al., 2010), whereas placing the goal at the end of a track produces a stronger bias toward reverse replay (Diba & Buzsaki 2007).” 

      Although no studies, to our knowledge, have observed a context-dependent asymmetry between forward and backward replay when the animal is away from the track, our model does posit conditions under which it could. Specifically, it predicts that deliberation on a specific memory, such as during planning, could generate an internal context input that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track.

      We now discuss this prediction in the section titled “The context-dependency of memory replay”:

      “Our model also predicts that deliberation on a specific memory, such as during planning, could serve to elicit an internal context cue that biases replay: actively recalling the first item of a sequence may favor forward replay, while thinking about the last item may promote backward replay, even when the individual is physically distant from the track. While not explored here, this mechanism presents a potential avenue for future modeling and empirical work.”

      (6) The manuscript describes a study by Bendor & Wilson (2012) and tightly mimics their results. However, notably, that study did not find triggered replay immediately following sound presentation, but rather a general bias toward reactivation of the cued sequence over longer stretches of time. In other words, it seems that the model's results don't fully mirror the empirical results. One idea that came to mind is that perhaps it is the R/L context - not the first R/L item - that is cued in this study. This is in line with other TMR studies showing what may be seen as contextual reactivation. If the authors think that such a simulation may better mirror the empirical results, I encourage them to try. If not, however, this limitation should be discussed.

      Although our model predicts that replay is triggered immediately by the sound cue, it also predicts a sustained bias toward the cued sequence. Replay in our model unfolds across the rest phase as multiple successive events, so the bias observed in our sleep simulations indeed reflects a prolonged preference for the cued sequence.

      We now discuss this issue, acknowledging the discrepancy:

      “Bendor and Wilson (2012) found that sound cues during sleep did not trigger immediate replay, but instead biased reactivation toward the cued sequence over an extended period of time. While the model does exhibit some replay triggered immediately by the cue, it also captures the sustained bias toward the cued sequence over an extended period.”

      Second, within this framework, context is modeled as a weighted average of the features associated with items. As a result, cueing the model with the first R/L item produces qualitatively similar outcomes as cueing it with a more extended R/L cue that incorporates features of additional items. This is because both approaches ultimately use context features unique to the two sides.

      (7) There is some discussion about replay's benefit to memory. One point of interest could be whether this benefit changes between wake and sleep. Relatedly, it would be interesting to see whether the proportion of forward replay, backward replay, or both correlated with memory benefits. I encourage the authors to extend the section on the function of replay and explore these questions.

      We thank the reviewer for this suggestion. Regarding differences in the contribution of wake and sleep to memory, our current simulations predict that compared to rest in the task environment, sleep is less biased toward initiating replay at specific items, leading to a more uniform benefit across all memories. Regarding the contributions of forward and backward replay, our model predicts that both strengthen bidirectional associations between items and contexts, benefiting memory in qualitatively similar ways. Furthermore, we suggest that the offline learning captured  by our teacher-student simulations reflects consolidation processes that are specific to sleep.

      We have expanded the section titled The influence of experience to discuss these predictions of the model: 

      “The results outlined above arise from the model's assumption that replay strengthens bidirectional associations between items and contexts to benefit memory. This assumption leads to several predictions about differences across replay types. First, the model predicts that sleep yields different memory benefits compared to rest in the task environment: Sleep is less biased toward initiating replay at specific items, resulting in a more uniform benefit across all memories. Second, the model predicts that forward and backward replay contribute to memory in qualitatively similar ways but tend to benefit different memories. This divergence arises because forward and backward replay exhibit distinct item preferences, with backward replay being more likely to include rewarded items, thereby preferentially benefiting those memories.”

      We also updated the “The function of replay” section to include our teacher-student speculation:

      “We speculate that the offline learning observed in these simulations corresponds to consolidation processes that operate specifically during sleep, when hippocampal-neocortical dynamics are especially tightly coupled (Klinzing et al., 2019).”

      (8) Replay has been mostly studied in rodents, with few exceptions, whereas CMR and similar models have mostly been used in humans. Although replay is considered a good model of episodic memory, it is still limited due to limited findings of sequential replay in humans and its reliance on very structured and inherently autocorrelated items (i.e., place fields). I'm wondering if the authors could speak to the implications of those limitations on the generalizability of their model. Relatedly, I wonder if the model could or does lead to generalization to some extent in a way that would align with the complementary learning systems framework.

      We appreciate these insightful comments. Traditionally, replay studies have focused on spatial tasks with autocorrelated item representations (e.g., place fields). However, an increasing number of human studies have demonstrated sequential replay using stimuli with distinct, unrelated representations. Our model is designed to accommodate both scenarios. In our current simulations, we employ orthogonal item representations while leveraging a shared, temporally autocorrelated context to link successive items. We anticipate that incorporating autocorrelated item representations would further enhance sequence memory by increasing the similarity between successive contexts. Overall, we believe that the model generalizes across a broad range of experimental settings, regardless of the degree of autocorrelation between items. Moreover, the underlying framework has been successfully applied to explain sequential memory in both spatial domains, explaining place cell firing properties (e.g., Howard et al., 2004), and in non-spatial domains, such as free recall experiments where items are arbitrarily related. 

      In the section titled “A context model of memory replay”, we added this comment to address this point:

      “Its contiguity bias stems from its use of shared, temporally autocorrelated context to link successive items, despite the orthogonal nature of individual item representations. This bias would be even stronger if items had overlapping representations, as observed in place fields.”

      Since CMR-replay learns distributed context representations where overlap across context vectors captures associative structure, and replay helps strengthen that overlap, this could indeed be viewed as consonant with complementary learning systems integration processes. 

      Reviewer #2 (Public Review):

      This manuscript proposes a model of replay that focuses on the relation between an item and its context, without considering the value of the item. The model simulates awake learning, awake replay, and sleep replay, and demonstrates parallels between memory phenomenon driven by encoding strength, replay of sequence learning, and activation of nearest neighbor to infer causality. There is some discussion of the importance of suppression/inhibition to reduce activation of only dominant memories to be replayed, potentially boosting memories that are weakly encoded. Very nice replications of several key replay findings including the effect of reward and remote replay, demonstrating the equally salient cue of context for offline memory consolidation.

      I have no suggestions for the main body of the study, including methods and simulations, as the work is comprehensive, transparent, and well-described. However, I would like to understand how the CMRreplay model fits with the current understanding of the importance of excitation vs inhibition, remembering vs forgetting, activation vs deactivation, strengthening vs elimination of synapses, and even NREM vs REM as Schapiro has modeled. There seems to be a strong association with the efforts of the model to instantiate a memory as well as how that reinstantiation changes across time. But that is not all this is to consolidation. The specific roles of different brain states and how they might change replay is also an important consideration.

      We are gratified that the reviewer appreciated the work, and we agree that the paper would benefit from comment on the connections to these other features of consolidation.

      Excitation vs. inhibition: CMR-replay does not model variations in the excitation-inhibition balance across brain states (as in other models, e.g., Chenkov et al., 2017), since it does not include inhibitory connections. However, we posit that the experience-dependent suppression mechanism in the model might, in the brain, involve inhibitory processes. Supporting this idea, studies have observed increased inhibition with task repetition (Berners-Lee et al., 2022). We hypothesize that such mechanisms may underlie the observed inverse relationship between task experience and replay frequency in many studies. We discuss this in the section titled “A context model of memory replay”:

      “The proposal that a suppression mechanism plays a role in replay aligns with models that regulate place cell reactivation via inhibition (Malerba et al., 2016) and with empirical observations of increased hippocampal inhibitory interneuron activity with experience (Berners-Lee et al., 2022). Our model assumes the presence of such inhibitory mechanisms but does not explicitly model them.”

      Remembering/forgetting, activation/deactivation, and strengthening/elimination of synapses: The model does not simulate synaptic weight reduction or pruning, so it does not forget memories through the weakening of associated weights. However, forgetting can occur when a memory is replayed less frequently than others, leading to reduced activation of that memory compared to its competitors during context-driven retrieval. In the Discussion section, we acknowledge that a biologically implausible aspect of our model is that it implements only synaptic strengthening: 

      “Aspects of the model, such as its lack of regulation of the cumulative positive weight changes that can accrue through repeated replay, are biologically implausible (as biological learning results in both increases and decreases in synaptic weights) and limit the ability to engage with certain forms of low level neural data (e.g., changes in spine density over sleep periods; de Vivo et al., 2017; Maret et al., 2011). It will be useful for future work to explore model variants with more elements of biological plausibility.” Different brain states and NREM vs REM: Reviewer 1 also raised this important issue (see above). We have added the following thoughts on differences between these states and the relationship to our prior work to the Discussion section:

      “Our current simulations have focused on NREM, since the vast majority of electrophysiological studies of sleep replay have identified replay events in this stage. We have proposed in other work that replay during REM sleep may provide a complementary role to NREM sleep, allowing neocortical areas to reinstate remote, already-consolidated memories that need to be integrated with the memories that were recently encoded in the hippocampus and replayed during NREM (Singh et al., 2022). An extension of our model could undertake this kind of continual learning setup, where the student but not teacher network retains remote memories, and the driver of replay alternates between hippocampus (NREM) and cortex (REM) over the course of a night of simulated sleep. Other differences between stages of sleep and between sleep and wake states are likely to become important for a full account of how replay impacts memory. Our current model parsimoniously explains a range of differences between awake and sleep replay by assuming simple differences in initial conditions, but we expect many more characteristics of these states (e.g., neural activity levels, oscillatory profiles, neurotransmitter levels, etc.) will be useful to incorporate in the future.”

      We hope these points clarify the model’s scope and its potential for future extensions.

      Do the authors suggest that these replay systems are more universal to offline processes beyond episodic memory? What about procedural memories and working memory?

      We thank the reviewer for raising this important question. We have clarified in the manuscript:

      “We focus on the model as a formulation of hippocampal replay, capturing how the hippocampus may replay past experiences through simple and interpretable mechanisms.”

      With respect to other forms of memory, we now note that:

      “This motor memory simulation using a model of hippocampal replay is consistent with evidence that hippocampal replay can contribute to consolidating memories that are not hippocampally dependent at encoding (Schapiro et al., 2019; Sawangjit et al., 2018). It is possible that replay in other, more domain-specific areas could also contribute (Eichenlaub et al., 2020).”

      Though this is not a biophysical model per se, can the authors speak to the neuromodulatory milieus that give rise to the different types of replay?

      Our work aligns with the perspective proposed by Hasselmo (1999), which suggests that waking and sleep states differ in the degree to which hippocampal activity is driven by external inputs. Specifically, high acetylcholine levels during waking bias activity to flow into the hippocampus, while low acetylcholine levels during sleep allow hippocampal activity to influence other brain regions. Consistent with this view, our model posits that wake replay is more biased toward items associated with the current resting location due to the presence of external input during waking states. In the Discussion section, we have added a comment on this point:

      “Our view aligns with the theory proposed by Hasselmo (1999), which suggests that the degree of hippocampal activity driven by external inputs differs between waking and sleep states: High acetylcholine levels during wakefulness bias activity into the hippocampus, while low acetylcholine levels during slow-wave sleep allow hippocampal activity to influence other brain regions.”

      Reviewer #3 (Public Review):

      In this manuscript, Zhou et al. present a computational model of memory replay. Their model (CMR-replay) draws from temporal context models of human memory (e.g., TCM, CMR) and claims replay may be another instance of a context-guided memory process. During awake learning, CMR replay (like its predecessors) encodes items alongside a drifting mental context that maintains a recency-weighted history of recently encoded contexts/items. In this way, the presently encoded item becomes associated with other recently learned items via their shared context representation - giving rise to typical effects in recall such as primacy, recency, and contiguity. Unlike its predecessors, CMR-replay has built-in replay periods. These replay periods are designed to approximate sleep or wakeful quiescence, in which an item is spontaneously reactivated, causing a subsequent cascade of item-context reactivations that further update the model's item-context associations.

      Using this model of replay, Zhou et al. were able to reproduce a variety of empirical findings in the replay literature: e.g., greater forward replay at the beginning of a track and more backward replay at the end; more replay for rewarded events; the occurrence of remote replay; reduced replay for repeated items, etc. Furthermore, the model diverges considerably (in implementation and predictions) from other prominent models of replay that, instead, emphasize replay as a way of predicting value from a reinforcement learning framing (i.e., EVB, expected value backup).

      Overall, I found the manuscript clear and easy to follow, despite not being a computational modeller myself. (Which is pretty commendable, I'd say). The model also was effective at capturing several important empirical results from the replay literature while relying on a concise set of mechanisms - which will have implications for subsequent theory-building in the field.

      With respect to weaknesses, additional details for some of the methods and results would help the readers better evaluate the data presented here (e.g., explicitly defining how the various 'proportion of replay' DVs were calculated).

      For example, for many of the simulations, the y-axis scale differs from the empirical data despite using comparable units, like the proportion of replay events (e.g., Figures 1B and C). Presumably, this was done to emphasize the similarity between the empirical and model data. But, as a reader, I often found myself doing the mental manipulation myself anyway to better evaluate how the model compared to the empirical data. Please consider using comparable y-axis ranges across empirical and simulated data wherever possible.

      We appreciate this point. As in many replay modeling studies, our primary goal is to provide a qualitative fit that demonstrates the general direction of differences between our model and empirical data, without engaging in detailed parameter fitting for a precise quantitative fit. Still, we agree that where possible, it is useful to better match the axes. We have updated figures 2B and 2C so that the y-axis scales are more directly comparable between the empirical and simulated data. 

      In a similar vein to the above point, while the DVs in the simulations/empirical data made intuitive sense, I wasn't always sure precisely how they were calculated. Consider the "proportion of replay" in Figure 1A. In the Methods (perhaps under Task Simulations), it should specify exactly how this proportion was calculated (e.g., proportions of all replay events, both forwards and backwards, combining across all simulations from Pre- and Post-run rest periods). In many of the examples, the proportions seem to possibly sum to 1 (e.g., Figure 1A), but in other cases, this doesn't seem to be true (e.g., Figure 3A). More clarity here is critical to help readers evaluate these data. Furthermore, sometimes the labels themselves are not the most informative. For example, in Figure 1A, the y-axis is "Proportion of replay" and in 1C it is the "Proportion of events". I presumed those were the same thing - the proportion of replay events - but it would be best if the axis labels were consistent across figures in this manuscript when they reflect the same DV.

      We appreciate these useful suggestions. We have revised the Methods section to explain in detail how DVs are calculated for each simulation. The revisions clarify the differences between related measures, such as those shown in Figures 1A and 1C, so that readers can more easily see how the DVs are defined and interpreted in each case. 

      Reviewer #4/Reviewing Editor (Public Review):

      Summary:

      With their 'CMR-replay' model, Zhou et al. demonstrate that the use of spontaneous neural cascades in a context-maintenance and retrieval (CMR) model significantly expands the range of captured memory phenomena.

      Strengths:

      The proposed model compellingly outperforms its CMR predecessor and, thus, makes important strides towards understanding the empirical memory literature, as well as highlighting a cognitive function of replay.

      Weaknesses:

      Competing accounts of replay are acknowledged but there are no formal comparisons and only CMR-replay predictions are visualized. Indeed, other than the CMR model, only one alternative account is given serious consideration: A variant of the 'Dyna-replay' architecture, originally developed in the machine learning literature (Sutton, 1990; Moore & Atkeson, 1993) and modified by Mattar et al (2018) such that previously experienced event-sequences get replayed based on their relevance to future gain. Mattar et al acknowledged that a realistic Dyna-replay mechanism would require a learned representation of transitions between perceptual and motor events, i.e., a 'cognitive map'. While Zhou et al. note that the CMR-replay model might provide such a complementary mechanism, they emphasize that their account captures replay characteristics that Dyna-replay does not (though it is unclear to what extent the reverse is also true).

      We thank the reviewer for these thoughtful comments and appreciate the opportunity to clarify our approach. Our goal in this work is to contrast two dominant perspectives in replay research: replay as a mechanism for learning reward predictions and replay as a process for memory consolidation. These models were chosen as representatives of their classes of models because they use simple and interpretable mechanisms that can simulate a wide range of replay phenomena, making them ideal for contrasting these two perspectives.

      Although we implemented CMR-replay as a straightforward example of the memory-focused view, we believe the proposed mechanisms could be extended to other architectures, such as recurrent neural networks, to produce similar results. We now discuss this possibility in the revised manuscript (see below). However, given our primary goal of providing a broad and qualitative contrast of these two broad perspectives, we decided not to undertake simulations with additional individual models for this paper.

      Regarding the Mattar & Daw model, it is true that a mechanistic implementation would require a mechanism that avoids precomputing priorities before replay. However, the "need" component of their model already incorporates learned expectations of transitions between actions and events. Thus, the model's limitations are not due to the absence of a cognitive map.

      In contrast, while CMR-replay also accumulates memory associations that reflect experienced transitions among events, it generates several qualitatively distinct predictions compared to the Mattar & Daw model. As we note in the manuscript, these distinctions make CMR-replay a contrasting rather than complementary perspective.

      Another important consideration, however, is how CMR replay compares to alternative mechanistic accounts of cognitive maps. For example, Recurrent Neural Networks are adept at detecting spatial and temporal dependencies in sequential input; these networks are being increasingly used to capture psychological and neuroscientific data (e.g., Zhang et al, 2020; Spoerer et al, 2020), including hippocampal replay specifically (Haga & Fukai, 2018). Another relevant framework is provided by Associative Learning Theory, in which bidirectional associations between static and transient stimulus elements are commonly used to explain contextual and cue-based phenomena, including associative retrieval of absent events (McLaren et al, 1989; Harris, 2006; Kokkola et al, 2019). Without proper integration with these modeling approaches, it is difficult to gauge the innovation and significance of CMR-replay, particularly since the model is applied post hoc to the relatively narrow domain of rodent maze navigation.

      First, we would like to clarify our principal aim in this work is to characterize the nature of replay, rather than to model cognitive maps per se. Accordingly, CMR‑replay is not designed to simulate head‐direction signals, perform path integration, or explain the spatial firing properties of neurons during navigation. Instead, it focuses squarely on sequential replay phenomena, simulating classic rodent maze reactivation studies and human sequence‐learning tasks. These simulations span a broad array of replay experimental paradigms to ensure extensive coverage of the replay findings reported across the literature. As such, the contribution of this work is in explaining the mechanisms and functional roles of replay, and demonstrating that a model that employs simple and interpretable memory mechanisms not only explains replay phenomena traditionally interpreted through a value-based lens but also accounts for findings not addressed by other memory-focused models.

      As the reviewer notes, CMR-replay shares features with other memory-focused models. However, to our knowledge, none of these related approaches have yet captured the full suite of empirical replay phenomena, suggesting the combination of mechanisms employed in CMR-replay is essential for explaining these phenomena. In the Discussion section, we now discuss the similarities between CMR-replay and related memory models and the possibility of integrating these approaches:

      “Our theory builds on a lineage of memory-focused models, demonstrating the power of this perspective in explaining phenomena that have often been attributed to the optimization of value-based predictions. In this work, we focus on CMR-replay, which exemplifies the memory-centric approach through a set of simple and interpretable mechanisms that we believe are broadly applicable across memory domains. Elements of CMR-replay share similarities with other models that adopt a memory-focused perspective. The model learns distributed context representations whose overlaps encodes associations among items, echoing associative learning theories in which overlapping patterns capture stimulus similarity and learned associations (McLaren & Mackintosh 2002). Context evolves through bidirectional interactions between items and their contextual representations, mirroring the dynamics found in recurrent neural networks (Haga & Futai 2018; Levenstein et al., 2024). However, these related approaches have not been shown to account for the present set of replay findings and lack mechanisms—such as reward-modulated encoding and experience-dependent suppression—that our simulations suggest are essential for capturing these phenomena. While not explored here, we believe these mechanisms could be integrated into architectures like recurrent neural networks (Levenstein et al., 2024) to support a broader range of replay dynamics.”

      Recommendations For The Authors

      Reviewer #1 (Recommendations For The Authors):

      (1) Lines 94-96: These lines may be better positioned earlier in the paragraph.

      We now introduce these lines earlier in the paragraph.

      (2) Line 103 - It's unclear to me what is meant by the statement that "the current context contains contexts associated with previous items". I understand why a slowly drifting context will coincide and therefore link with multiple items that progress rapidly in time, so multiple items will be linked to the same context and each item will be linked to multiple contexts. Is that the idea conveyed here or am I missing something? I'm similarly confused by line 129, which mentions that a context is updated by incorporating other items' contexts. How could a context contain other contexts?

      In the model, each item has an associated context that can be retrieved via Mfc. This is true even before learning, since Mfc is initialized as an identity matrix. During learning and replay, we have a drifting context c that is updated each time an item is presented. At each timestep, the model first retrieves the current item’s associated context cf by Mfc, and incorporates it into c. Equation #2 in the Methods section illustrates this procedure in detail. Because of this procedure, the drifting context c is a weighted sum of past items’ associated contexts. 

      We recognize that these descriptions can be confusing. We have updated the Results section to better distinguish the drifting context from items’ associated context. For example, we note that:

      “We represent the drifting context during learning and replay with c and an item's associated context with cf.”

      We have also updated our description of the context drift procedure to distinguish these two quantities: 

      “During awake encoding of a sequence of items, for each item f, the model retrieves its associated context cf via Mfc. The drifting context c incorporates the item's associated context cf and downweights its representation of previous items' associated contexts (Figure 1c). Thus, the context layer maintains a recency weighted sum of past and present items' associated contexts.”

      (3) Figure 1b and 1d - please clarify which axis in the association matrices represents the item and the context.

      We have added labels to show what the axes represent in Figure 1.

      (4) The terms "experience" and "item" are used interchangeably and it may be best to stick to one term.

      We now use the term “item” wherever we describe the model results. 

      (5) The manuscript describes Figure 6 ahead of earlier figures - the authors may want to reorder their figures to improve readability.

      We appreciate this suggestion. We decided to keep the current figure organization since it allows us to group results into different themes and avoid redundancy. 

      (6) Lines 662-664 are repeated with a different ending, this is likely an error.

      We have fixed this error.

      Reviewer #3 (Recommendations For The Authors):

      Below, I have outlined some additional points that came to mind in reviewing the manuscript - in no particular order.

      (1) Figure 1: I found the ordering of panels a bit confusing in this figure, as the reading direction changes a couple of times in going from A to F. Would perhaps putting panel C in the bottom left corner and then D at the top right, with E and F below (also on the right) work?

      We agree that this improves the figure. We have restructured the ordering of panels in this figure. 

      (2) Simulation 1: When reading the intro/results for the first simulation (Figure 2a; Diba & Buszaki, 2007; "When animals traverse a linear track...", page 6, line 186). It wasn't clear to me why pre-run rest would have any forward replay, particularly if pre-run implied that the animal had no experience with the track yet. But in the Methods this becomes clearer, as the model encodes the track eight times prior to the rest periods. Making this explicit in the text would make it easier to follow. Also, was there any reason why specifically eight sessions of awake learning, in particular, were used?

      We now make more explicit that the animals have experience with the track before pre-run rest recording:

      “Animals first acquire experience with a linear track by traversing it to collect a reward. Then, during the pre-run rest recording, forward replay predominates.”

      We included eight sessions of awake learning to match with the number of sessions in Shin et al. (2017), since this simulation attempts to explain data from that study. After each repetition, the model engages in rest. We have revised the Methods section to indicate the motivation for this choice: 

      “In the simulation that examines context-dependent forward and backward replay through experience (Figs. 2a and 5a), CMR-replay encodes an input sequence shown in Fig. 7a, which simulates a linear track run with no ambiguity in the direction of inputs, over eight awake learning sessions (as in Shin et al. 2019)”

      (3) Frequency of remote replay events: In the simulation based on Gupta et al, how frequently overall does remote replay occur? In the main text, the authors mention the mean frequency with which shortcut replay occurs (i.e., the mean proportion of replay events that contain a shortcut sequence = 0.0046), which was helpful. But, it also made me wonder about the likelihood of remote replay events. I would imagine that remote replay events are infrequent as well - given that it is considerably more likely to replay sequences from the local track, given the recency-weighted mental context. Reporting the above mean proportion for remote and local replay events would be helpful context for the reader.

      In Figure 4c, we report the proportion of remote replay in the two experimental conditions of Gupta et al. that we simulate. 

      (4) Point of clarification re: backwards replay: Is backwards replay less likely to occur than forward replay overall because of the forward asymmetry associated with these models? For example, for a backwards replay event to occur, the context would need to drift backwards at least five times in a row, in spite of a higher probability of moving one step forward at each of those steps. Am I getting that right?

      The reviewer’s interpretation is correct: CMR-replay is more likely to produce forward than backward replay in sleep because of its forward asymmetry. We note that this forward asymmetry leads to high likelihood of forward replay in the section titled “The context-dependency of memory replay”: 

      “As with prior retrieved context models (Howard & Kahana 2002; Polyn et al., 2009), CMR-replay encodes stronger forward than backward associations. This asymmetry exists because, during the first encoding of a sequence, an item's associated context contributes only to its ensuing items' encoding contexts. Therefore, after encoding, bringing back an item's associated context is more likely to reactivate its ensuing than preceding items, leading to forward asymmetric replay (Fig. 6d left).”

      (5) On terminating a replay period: "At any t, the replay period ends with a probability of 0.1 or if a task-irrelevant item is reactivated." (Figure 1 caption; see also pg 18, line 635). How was the 0.1 decided upon? Also, could you please add some detail as to what a 'task-irrelevant item' would be? From what I understood, the model only learns sequences that represent the points in a track - wouldn't all the points in the track be task-relevant?

      This value was arbitrarily chosen as a small value that allows probabilistic stopping. It was not motivated by prior modeling or a systematic search. We have added: “At each timestep, the replay period ends either with a stop probability of 0.1 or if a task-irrelevant item becomes reactivated. (The choice of the value 0.1 was arbitrary; future work could explore the implications of varying this parameter).” 

      In addition, we now explain in the paper that task irrelevant items “do not appear as inputs during awake encoding, but compete with task-relevant items for reactivation during replay, simulating the idea that other experiences likely compete with current experiences during periods of retrieval and reactivation.”

      (6) Minor typos:

      Turn all instances of "nonlocal" into "non-local", or vice versa

      "For rest at the end of a run, cexternal is the context associated with the final item in the sequence. For rest at the end of a run, cexternal is the context associated with the start item." (pg 20, line 663) - I believe this is a typo and that the second sentence should begin with "For rest at the START of a run".

      We have updated the manuscript to correct these typos. 

      (7) Code availability: I may have missed it, but it doesn't seem like the code is currently available for these simulations. Including the commented code in a public repository (Github, OSF) would be very useful in this case.

      We now include a Github link to our simulation code: https://github.com/schapirolab/CMR-replay.

    1. early detection

      Regarding the decline in age-standardized incidence rates, we expect that as diagnostic tools improve and early detection advances, more cases will be identified, which may lead to an increase in this indicator. I think it might be better to relate this factor to the improvement of preventive strategies.

    1. Reviewer #2 (Public review):

      Okabe and colleagues build on a super-resolution-based technique that they have previously developed in cultured hippocampal neurons, improving the pipeline and using it to analyze spine nanostructure differences across 8 different mouse lines with mutations in autism or schizophrenia (Sz) risk genes/pathways. It is a worthy goal to try to use multiple models to examine potential convergent (or not) phenotypes, and the authors have made a good selection of models. They identify some key differences between the autism versus the Sz risk gene models, primarily that dendritic spines are smaller in Sz models and (mostly) larger in autism risk gene models. They then focus on three models (2 Sz - 22q11.2 deletion, Setd1a; 1 ASD - Nlgn3) for time-lapse imaging of spine dynamics, and together with computational modelling provide a mechanistic rationale for the smaller spines in Sz risk models. Bulk RNA sequencing of all 8 model cultures identifies several differentially expressed genes, which they go on to test in cultures, finding that ecgr4 is upregulated in several Sz models and its misexpression recapitulates spine dynamics changes seen in the Sz mutants, while knockdown rescues spine dynamics changes in the Sz mutants. Overall, these have the potential to be very interesting findings and useful for the field. However, I do have a number of major concerns.

      (1) The main finding of spine nanostructure changes is done by carrying out a PCA on various structural parameters, creating spine density plots across PC1 and PC2, and then subtracting the WT density plot from the mutant. Then, spines in the areas with obvious differences only are analyzed, from which they derive the finding that, for example, spine sizes are smaller. However, this seems a circular approach. It is like first identifying where there might be a difference in the data, then only analyzing that part of the data. I welcome input from a statistician, but to me, this is at best unconventional and potentially misleading. I assume the overall means are not different (although this should be included), but could they look at the distribution of sizes and see if these are shifted?

      (2) Despite extracting 64 parameters describing spine structure, only 5 of these seemed to be used for the PCA. It should be possible to use all parameters and show the same results. More information on PC1 and PC2 would be helpful, given that the rest of the paper is based on these - what features are they related to? These specific features could then be analyzed in the full dataset, without doing the cherry picking above. It would also be helpful to demonstrate whether PC1 and 2 differ across groups - for example, the authors could break their WT data into 2 subsets and repeat the analysis.

      (3) Throughout the paper, the 'n' used for statistical analysis is often spine, which is not appropriate. At a minimum, cell should be used, but ideally a nested mixed model, which would take into account factors like cell, culture, and animal, would be preferable. Also, all of these factors should be listed, with sufficient independent cultures.

      (4) The authors should confirm that all mutants are also on the C57BL/6J background, and clarify whether control cultures are from littermates (this would be important). Also, are control versus mutant cultures done simultaneously? There can be significant batch effects with cultures.

      (5) The spine analysis uses cultures from 18-22 DIV - this is quite a large range. It would be worth checking whether age is a confounder or correlated with any parameters / principal components.

      (6) The computational modelling is interesting, but again, I am concerned about some circularity. Parameter optimization was used to identify the best fit model that replicated the spine turnover rates, so it is somewhat circular to say that this matched the observations when one of these is the turnover rate. It is more convincing for spine density and size, but why not go back and test whether parameter differences are actually seen - for example, it would be possible to extract the probability of nascent spine loss, etc. More compelling would be to repeat the experiments and see if the model still fits the data. In the interpretation (line 314-318) it is stated that '... reduced spine maturation rate can account for the three key properties of schizophrenia-related spines...', which is interesting if true, but it has just been stated that the probability of spine destabilization is also higher in mutants (line 303) - the authors should test whether if the latter is set to be the same as controls whether all the findings are replicated.

      (7) No validation for overexpression or knockdown is shown, although it is mentioned in the methods - please include. Also, for the knockdown, a scrambled shRNA control would be preferable.

      (8) The finding regarding ecgr4 is interesting, but showing that some ecgr4 is expressed at boutons and spines and some in DCVs is not enough evidence to suggest that actively involved in the regulation of synapse formation and maturation (line 356).

      (9) The same caveats that apply to the analysis also apply to the ecgr4 rescue. In addition, while for 22q the control shRNA mutant vs WT looks vaguely like Figure 2, setd1a looks completely different. And if rescued, surely shRNA in the mutant should now resemble control in WT, so there shouldn't be big differences, but in fact, there are just as many differences as comparing mutant vs wildtype? Plus, for spine features, they only compare mutant rescue with mutant control, but this is not ideal - something more like a 2-way ANOVA is really needed. Maybe input from a statistician might be useful here?

      (10) Although this is a study entirely focused on spine changes in mouse models for Sz, there is no discussion (or citation) of the various studies that have examined this in the literature. For example, for Setd1a, smaller spines or reduced spine densities have been described in various papers (Mukai et al, Neuron 2019; Chen et al, Sci Adv 2022; Nagahama et al, Cell Rep 2020).

      (11) There is a conceptual problem with the models if being used to differentiate autism risk from Sz risk genes. It is difficult to find good mouse models for Sz, so the choice of 22q11.2del and Setd1a haploinsufficiency is completely reasonable. However, these are both syndromic. 22qdel syndrome involves multiple issues, including hearing loss, delayed development, and learning disabilities, and is associated with autism (20% have autism, as compared to 25% with Sz). Similarly, Setd1a is also strongly associated with autism as well as Sz (and also involves global developmental delay and intellectual disability). While I think this is still the best we can do, and it is reasonable to say that these models show biased risk for these developmental disorders, it definitely can't be used as an explanation for the higher variability seen in the autism risk models.

      (12) I am not convinced that using dissociated cultures is 'more likely to reflect the direct impact of schizophrenia-related gene mutations on synaptic properties' - first, cultures do have non-neuronal cells, although here glial proliferation was arrested at 2 days, glia will be present with the protocol used (or if not, this needs demonstrating). Second, activity levels will affect spine size, and activity patterns are very abnormal in dissociated cultures, so it is very possible that spine changes may not translate into in vivo scenarios. Overall, it is a weakness that the dissociated culture system has been used, which is not to say that it is not useful, and from a technical and practical perspective, there are good justifications.

      (13) As a minor comment, the spine time-lapse imaging is a strength of the paper. I wonder about the interpretation of Figure 5. For example, the results in Figure 5G and J look as if they may be more that the spines grow to a smaller size and start from a smaller size, rather than necessarily the rate of growth.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      *The authors have a longstanding focus and reputation on single cell sequencing technology development and application. In this current study, the authors developed a novel single-cell multi-omic assay termed "T-ChIC" so that to jointly profile the histone modifications along with the full-length transcriptome from the same single cells, analyzed the dynamic relationship between chromatin state and gene expression during zebrafish development and cell fate determination. In general, the assay works well, the data look convincing and conclusions are beneficial to the community. *

      Thank you for your positive feedback.

      *There are several single-cell methodologies all claim to co-profile chromatin modifications and gene expression from the same individual cell, such as CoTECH, Paired-tag and others. Although T-ChIC employs pA-Mnase and IVT to obtain these modalities from single cells which are different, could the author provide some direct comparisons among all these technologies to see whether T-ChIC outperforms? *

      In a separate technical manuscript describing the application of T-ChIC in mouse cells (Zeller, Blotenburg et al 2024, bioRxiv, 2024.05. 09.593364), we have provided a direct comparison of data quality between T-ChIC and other single-cell methods for chromatin-RNA co-profiling (Please refer to Fig. 1C,D and Fig. S1D, E, of the preprint). We show that compared to other methods, T-ChIC is able to better preserve the expected biological relationship between the histone modifications and gene expression in single cells.

      *In current study, T-ChIC profiled H3K27me3 and H3K4me1 modifications, these data look great. How about other histone modifications (eg H3K9me3 and H3K36me3) and transcription factors? *

      While we haven't profiled these other modifications using T-ChIC in Zebrafish, we have previously published high quality data on these histone modifications using the sortChIC method, on which T-ChIC is based (Zeller, Yeung et al 2023). In our comparison, we find that histone modification profiles between T-ChIC and sortChIC are very similar (Fig. S1C in Zeller, Blotenburg et al 2024). Therefore the method is expected to work as well for the other histone marks.

      *T-ChIC can detect full length transcription from the same single cells, but in FigS3, the authors still used other published single cell transcriptomics to annotate the cell types, this seems unnecessary? *

      We used the published scRNA-seq dataset with a larger number of cells to homogenize our cell type labels with these datasets, but we also cross-referenced our cluster-specific marker genes with ZFIN and homogenized the cell type labels with ZFIN ontology. This way our annotation is in line with previous datasets but not biased by it. Due the relatively smaller size of our data, we didn't expect to identify unique, rare cell types, but our full-length total RNA assay helps us identify non-coding RNAs such as miRNA previously undetected in scRNA assays, which we have now highlighted in new figure S1c .

      *Throughout the manuscript, the authors found some interesting dynamics between chromatin state and gene expression during embryogenesis, independent approaches should be used to validate these findings, such as IHC staining or RNA ISH? *

      We appreciate that the ISH staining could be useful to validate the expression pattern of genes identified in this study. But to validate the relationships between the histone marks and gene expression, we need to combine these stainings with functional genomics experiments, such as PRC2-related knockouts. Due to their complexity, such experiments are beyond the scope of this manuscript (see also reply to reviewer #3, comment #4 for details).

      *In Fig2 and FigS4, the authors showed H3K27me3 cis spreading during development, this looks really interesting. Is this zebrafish specific? H3K27me3 ChIP-seq or CutTag data from mouse and/or human embryos should be reanalyzed and used to compare. The authors could speculate some possible mechanisms to explain this spreading pattern? *

      Thanks for the suggestion. In this revision, we have reanalysed a dataset of mouse ChIP-seq of H3K27me3 during mouse embryonic development by Xiang et al (Nature Genetics 2019) and find similar evidence of spreading of H3K27me3 signal from their pre-marked promoter regions at E5.5 epiblast upon differentiation (new Figure S4i). This observation, combined with the fact that the mechanism of pre-marking of promoters by PRC1-PRC2 interaction seems to be conserved between the two species (see (Hickey et al., 2022), (Mei et al., 2021) & (Chen et al., 2021)), suggests that the dynamics of H3K27me3 pattern establishment is conserved across vertebrates. But we think a high-resolution profiling via a method like T-ChIC would be more useful to demonstrate the dynamics of signal spreading during mouse embryonic development in the future. We have discussed this further in our revised manuscript.

      Reviewer #1 (Significance (Required)):

      *The authors have a longstanding focus and reputation on single cell sequencing technology development and application. In this current study, the authors developed a novel single-cell multi-omic assay termed "T-ChIC" so that to jointly profile the histone modifications along with the full-length transcriptome from the same single cells, analyzed the dynamic relationship between chromatin state and gene expression during zebrafish development and cell fate determination. In general, the assay works well, the data look convincing and conclusions are beneficial to the community. *

      Thank you very much for your supportive remarks.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      *Joint analysis of multiple modalities in single cells will provide a comprehensive view of cell fate states. In this manuscript, Bhardwaj et al developed a single-cell multi-omics assay, T-ChIC, to simultaneously capture histone modifications and full-length transcriptome and applied the method on early embryos of zebrafish. The authors observed a decoupled relationship between the chromatin modifications and gene expression at early developmental stages. The correlation becomes stronger as development proceeds, as genes are silenced by the cis-spreading of the repressive marker H3k27me3. Overall, the work is well performed, and the results are meaningful and interesting to readers in the epigenomic and embryonic development fields. There are some concerns before the manuscript is considered for publication. *

      We thank the reviewer for appreciating the quality of our study.

      *Major concerns: *

        • A major point of this study is to understand embryo development, especially gastrulation, with the power of scMulti-Omics assay. However, the current analysis didn't focus on deciphering the biology of gastrulation, i.e., lineage-specific pioneer factors that help to reform the chromatin landscape. The majority of the data analysis is based on the temporal dimension, but not the cell-type-specific dimension, which reduces the value of the single-cell assay. *

      We focused on the lineage-specific transcription factor activity during gastrulation in Figure 4 and S8 of the manuscript and discovered several interesting regulators active at this stage. During our analysis of the temporal dimension for the rest of the manuscript, we also classified the cells by their germ layer and "latent" developmental time by taking the full advantage of the single-cell nature of our data. Additionally, we have now added the cell-type-specific H3K27-demethylation results for 24hpf in response to your comment below. We hope that these results, together with our openly available dataset would demonstrate the advantage of the single-cell aspect of our dataset.

      1. *The cis-spreading of H3K27me3 with developmental time is interesting. Considering H3k27me3 could mark bivalent regions, especially in pluripotent cells, there must be some regions that have lost H3k27me3 signals during development. Therefore, it's confusing that the authors didn't find these regions (30% spreading, 70% stable). The authors should explain and discuss this issue. *

      Indeed we see that ~30% of the bins enriched in the pluripotent stage spread, while 70% do not seem to spread. In line with earlier observations(Hickey et al., 2022; Vastenhouw et al., 2010), we find that H3K27me3 is almost absent in the zygote and is still being accumulated until 24hpf and beyond. Therefore the majority of the sites in the genome still seem to be in the process of gaining H3K27me3 until 24hpf, explaining why we see mostly "spreading" and "stable" states. Considering most of these sites are at promoters and show signs of bivalency, we think that these sites are marked for activation or silencing at later stages. We have discussed this in the manuscript ("discussion"). However, in response to this and earlier comment, we went back and searched for genes that show H3K27-demethylation in the most mature cell types (at 24 hpf) in our data, and found a subset of genes that show K27 demethylation after acquiring them earlier. Interestingly, most of the top genes in this list are well-known as developmentally important for their corresponding cell types. We have added this new result and discussed it further in the manuscript (Fig. 2d,e, , Supplementary table 3).

      *Minors: *

        • The authors cited two scMulti-omics studies in the introduction, but there have been lots of single-cell multi-omics studies published recently. The authors should cite and consider them. *

      We have cited more single-cell chromatin and multiome studies focussed on early embryogenesis in the introduction now.

      *2. T-ChIC seems to have been presented in a previous paper (ref 15). Therefore, Fig. 1a is unnecessary to show. *

      Figure 1a. shows a summary of our Zebrafish TChIC workflow, which contains the unique sample multiplexing and sorting strategy to reduce batch effects, which was not applied in the original TChIC workflow. We have now clarified this in "Results".

      1. *It's better to show the percentage of cell numbers (30% vs 70%) for each heatmap in Figure 2C. *

      We have added the numbers to the corresponding legends.

      1. *Please double-check the citation of Fig. S4C, which may not relate to the conclusion of signal differences between lineages. *

      The citation seems to be correct (Fig. S4C supplements Fig. 2C, but shows mesodermal lineage cells) but the description of the legend was a bit misleading. We have clarified this now.

      *5. Figure 4C has not been cited or mentioned in the main text. Please check. *

      Thanks for pointing it out. We have cited it in Results now.

      Reviewer #2 (Significance (Required)):

      *Strengths: This work utilized a new single-cell multi-omics method and generated abundant epigenomics and transcriptomics datasets for cells covering multiple key developmental stages of zebrafish. *

      *Limitations: The data analysis was superficial and mainly focused on the correspondence between the two modalities. The discussion of developmental biology was limited. *

      *Advance: The zebrafish single-cell datasets are valuable. The T-ChIC method is new and interesting. *

      *The audience will be specialized and from basic research fields, such as developmental biology, epigenomics, bioinformatics, etc. *

      *I'm more specialized in the direction of single-cell epigenomics, gene regulation, 3D genomics, etc. *

      Thank you for your remarks.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      *This manuscript introduces T‑ChIC, a single‑cell multi‑omics workflow that jointly profiles full‑length transcripts and histone modifications (H3K27me3 and H3K4me1) and applies it to early zebrafish embryos (4-24 hpf). The study convincingly demonstrates that chromatin-transcription coupling strengthens during gastrulation and somitogenesis, that promoter‑anchored H3K27me3 spreads in cis to enforce developmental gene silencing, and that integrating TF chromatin status with expression can predict lineage‑specific activators and repressors. *

      *Major concerns *

      1. *Independent biological replicates are absent, so the authors should process at least one additional clutch of embryos for key stages (e.g., 6 hpf and 12 hpf) with T‑ChIC and demonstrate that the resulting data match the current dataset. *

      Thanks for pointing this out. We had, in fact, performed T-ChIC experiments in four rounds of biological replicates (independent clutch of embryos) and merged the data to create our resource. Although not all timepoints were profiled in each replicate, two timepoints (10 and 24hpf) are present in all four, and the celltype composition of these replicates from these 2 timepoints are very similar. We have added new plots in figure S2f and added (new) supplementary table (#1) to highlight the presence of biological replicates.

      2. *The TF‑activity regression model uses an arbitrary R² {greater than or equal to} 0.6 threshold; cross‑validated R² distributions, permutation‑based FDR control, and effect‑size confidence intervals are needed to justify this cut‑off. *

      Thank you for this suggestion. We did use 10-fold cross validation during training and obtained the R2 values of TF motifs from the independent test set as an unbiased estimate. However, the cutoff of R2 > 0.6 to select the TFs for classification was indeed arbitrary. In the revised version, we now report the FDR-adjusted p-values for these R2 estimates based on permutation tests, and select TFs with a cutoff of padj supplementary table #4 to include the p-values for all tested TFs. However, we see that our arbitrary cutoff of 0.6 was in fact, too stringent, and we can classify many more TFs based on the FDR cutoffs. We also updated our reported numbers in Fig. 4c to reflect this. Moreover, supplementary table #4 contains the complete list of TFs used in the analysis to allow others to choose their own cutoff.

      3. *Predicted TF functions lack empirical support, making it essential to test representative activators (e.g., Tbx16) and repressors (e.g., Zbtb16a) via CRISPRi or morpholino knock‑down and to measure target‑gene expression and H3K4me1 changes. *

      We agree that independent validation of the functions of our predicted TFs on target gene activity would be important. During this revision, we analysed recently published scRNA-seq data of Saunders et al. (2023) (Saunders et al., 2023), which includes CRISPR-mediated F0 knockouts of a couple of our predicted TFs, but the scRNAseq was performed at later stages (24hpf onward) compared to our H3K4me1 analysis (which was 4-12 hpf). Therefore, we saw off-target genes being affected in lineages where these TFs are clearly not expressed (attached Fig 1). We therefore didn't include these results in the manuscript. In future, we aim to systematically test the TFs predicted in our study with CRISPRi or similar experiments.

      4. *The study does not prove that H3K27me3 spreading causes silencing; embryos treated with an Ezh2 inhibitor or prc2 mutants should be re‑profiled by T‑ChIC to show loss of spreading along with gene re‑expression. *

      We appreciate the suggestion that indeed PRC2-disruption followed by T-ChIC or other forms of validation would be needed to confirm whether the H3K27me3 spreading is indeed causally linked to the silencing of the identified target genes. But performing this validation is complicated because of multiple reasons: 1) due to the EZH2 contribution from maternal RNA and the contradicting effects of various EZH2 zygotic mutations (depending on where the mutation occurs), the only properly validated PRC2-related mutant seems to be the maternal-zygotic mutant MZezh2, which requires germ cell transplantation (see Rougeot et al. 2019 (Rougeot et al., 2019)) , and San et al. 2019 (San et al., 2019) for details). The use of inhibitors have been described in other studies (den Broeder et al., 2020; Huang et al., 2021), but they do not show a validation of the H3K27me3 loss or a similar phenotype as the MZezh2 mutants, and can present unwanted side effects and toxicity at a high dose, affecting gene expression results. Moreover, in an attempt to validate, we performed our own trials with the EZH2 inhibitor (GSK123) and saw that this time window might be too short to see the effect within 24hpf (attached Fig. 2). Therefore, this validation is a more complex endeavor beyond the scope of this study. Nevertheless, our further analysis of H3K27me3 de-methylation on developmentally important genes (new Fig. 2e-f, Sup. table 3) adds more confidence that the polycomb repression plays an important role, and provides enough ground for future follow up studies.

      *Minor concerns *

      1. *Repressive chromatin coverage is limited, so profiling an additional silencing mark such as H3K9me3 or DNA methylation would clarify cooperation with H3K27me3 during development. *

      We agree that H3K27me3 alone would not be sufficient to fully understand the repressive chromatin state. Extension to other chromatin marks and DNA methylation would be the focus of our follow up works.

      *2. Computational transparency is incomplete; a supplementary table listing all trimming, mapping, and peak‑calling parameters (cutadapt, STAR/hisat2, MACS2, histoneHMM, etc.) should be provided. *

      As mentioned in the manuscript, we provide an open-source pre-processing pipeline "scChICflow" to perform all these steps (github.com/bhardwaj-lab/scChICflow). We have now also provided the configuration files on our zenodo repository (see below), which can simply be plugged into this pipeline together with the fastq files from GEO to obtain the processed dataset that we describe in the manuscript. Additionally, we have also clarified the peak calling and post-processing steps in the manuscript now.

      *3. Data‑ and code‑availability statements lack detail; the exact GEO accession release date, loom‑file contents, and a DOI‑tagged Zenodo archive of analysis scripts should be added. *

      We have now publicly released the .h5ad files with raw counts, normalized counts, and complete gene and cell-level metadata, along with signal tracks (bigwigs) and peaks on GEO. Additionally, we now also released the source datasets and notebooks (.Rmarkdown format) on Zenodo that can be used to replicate the figures in the manuscript, and updated our statements on "Data and code availability".

      *4. Minor editorial issues remain, such as replacing "critical" with "crucial" in the Abstract, adding software version numbers to figure legends, and correcting the SAMtools reference. *

      Thank you for spotting them. We have fixed these issues.

      Reviewer #3 (Significance (Required)):

      The method is technically innovative and the biological insights are valuable; however, several issues-mainly concerning experimental design, statistical rigor, and functional validation-must be addressed to solidify the conclusions.

      Thank you for your comments. We hope to have addressed your concerns in this revised version of our manuscript.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary

      This is a strong paper that presents a clear advance in multi-animal tracking. The authors introduce an updated version of idtracker.ai that reframes identity assignment as a contrastive learning problem rather than a classification task requiring global fragments. This change leads to gains in speed and accuracy. The method eliminates a known bottleneck in the original system, and the benchmarking across species is comprehensive and well executed. I think the results are convincing and the work is significant.

      Strengths

      The main strengths are the conceptual shift from classification to representation learning, the clear performance gains, and the fact that the new version is more robust. Removing the need for global fragments makes the software more flexible in practice, and the accuracy and speed improvements are well demonstrated. The software appears thoughtfully implemented, with GUI updates and integration with pose estimators.

      Weaknesses

      I don't have any major criticisms, but I have identified a few points that should be addressed to improve the clarity and accuracy of the claims made in the paper.

      (1) The title begins with "New idtracker.ai," which may not age well and sounds more promotional than scientific. The strength of the work is the conceptual shift to contrastive representation learning, and it might be more helpful to emphasize that in the title rather than branding it as "new."

      We considered using “Contrastive idtracker.ai”. However, we thought that readers could then think that we believe they could use both the old idtracker.ai or this contrastive version. But we want to say that the new version is the one to use as it is better in both accuracy and tracking times. We think “New idtracker.ai” communicates better that this version is the version we recommend.

      (2) Several technical points regarding the comparison between TRex (a system evaluated in the paper) and idtracker.ai should be addressed to ensure the evaluation is fair and readers are fully informed.

      (2.1) Lines 158-160: The description of TRex as based on "Protocol 2 of idtracker.ai" overlooks several key additions in TRex, such as posture image normalization, tracklet subsampling, and the use of uniqueness feedback during training. These features are not acknowledged, and it's unclear whether TRex was properly configured - particularly regarding posture estimation, which appears to have been omitted but isn't discussed. Without knowing the actual parameters used to make comparisons, it's difficult to dassess how the method was evaluated.

      We added the information about the key additions of TRex in the section “The new idtracker.ai uses representation learning”, lines 153-157. Posture estimation in TRex was not explicitly used but neither disabled during the benchmark; we clarified this in the last paragraph of “Benchmark of accuracy and tracking time”, lines 492-495.

      (2.2) Lines 162-163: The paper implies that TRex gains speed by avoiding Protocol 3, but in practice, idtracker.ai also typically avoids using Protocol 3 due to its extremely long runtime. This part of the framing feels more like a rhetorical contrast than an informative one.

      We removed this, see new lines 153-157.

      (2.3) Lines 277-280: The contrastive loss function is written using the label l, but since it refers to a pair of images, it would be clearer and more precise to write it as l_{I,J}. This would help readers unfamiliar with contrastive learning understand the formulation more easily.

      We added this change in lines 613-620.

      (2.4) Lines 333-334: The manuscript states that TRex can fail to track certain videos, but this may be inaccurate depending on how the authors classify failures. TRex may return low uniqueness scores if training does not converge well, but this isn't equivalent to tracking failure. Moreover, the metric reported by TRex is uniqueness, not accuracy. Equating the two could mislead readers. If the authors did compare outputs to human-validated data, that should be stated more explicitly.

      We observed TRex crashing without outputting any trajectories on some occasions (Appendix 1—figure 1), and this is what we labeled as “failure”. These failures happened in the most difficult videos of our benchmark, that’s why we treated them the same way as idtracker.ai going to P3. We clarified this in new lines 464-469.

      The accuracy measured in our benchmark is not estimated but it is human-validated (see section Computation of tracking accuracy in Appendix 1). Both softwares report some quality estimators at the end of a tracking (“estimated accuracy” for idtracker.ai and "uniqueness” for TRex) but these were not used in the benchmark.

      (2.5) Lines 339-341: The evaluation approach defines a "successful run" and then sums the runtime across all attempts up to that point. If success is defined as simply producing any output, this may not reflect how experienced users actually interact with the software, where parameters are iteratively refined to improve quality.

      Yes, our benchmark was designed to be agnostic to the different experiences of the user. Also, our benchmark was designed for users that do not inspect the trajectories to choose parameters again not to leave room for potential subjectivity.

      (2.6) Lines 344-346: The simulation process involves sampling tracking parameters 10,000 times and selecting the first "successful" run. If parameter tuning is randomized rather than informed by expert knowledge, this could skew the results in favor of tools that require fewer or simpler adjustments. TRex relies on more tunable behavior, such as longer fragments improving training time, which this approach may not capture.

      We precisely used the TRex parameter track_max_speed to elongate fragments for optimal tracking. Rather than randomized parameter tuning, we defined the “valid range” for this parameter so that all values in it would produce a decent fragment structure. We used this procedure to avoid worsening those methods that use more parameters.

      (2.7) Line 354 onward: TRex was evaluated using two varying parameters (threshold and track_max_speed), while idtracker.ai used only one (intensity_threshold). With a fixed number of samples, this asymmetry could bias results against TRex. In addition, users typically set these parameters based on domain knowledge rather than random exploration.

      idtracker.ai and TRex have several parameters. Some of them have a single correct value (e.g. number of animals) or the default value that the system computes is already good (e.g. minimum blob size). For a second type of parameters, the system finds a value that is in general not as good, so users need to modify them. In general, users find that for this second type of parameter there is a valid interval of possible values, from which they need to choose a single value to run the system. idtracker.ai has intensity_threshold as the only parameter of this second type and TRex has two: threshold and track_max_speed. For these parameters, choosing one value or another within the valid interval can give different tracking results. Therefore, when we model a user that wants to run the system once except if it goes to P3 (idtracker.ai) or except if it crashes (TRex), it is these parameters we sample from within the valid interval to get a different value for each run of the system. We clarify this in lines 452-469 of the section “Benchmark of accuracy and tracking time”.

      Note that if we chose to simply run old idtracker.ai (v4 or v5) or TRex a single time, this would benefit the new idtracker.ai (v6). This is because old idtracker.ai can enter the very slow protocol 3 and TRex can fail to track. So running old idtracker.ai or TRex up to 5 times until old idtracker.ai does not use Protocol 3 and TRex does not fail is to make them as good as they can be with respect to the new idtracker.ai

      (2.8) Figure 2-figure supplement 3: The memory usage comparison lacks detail. It's unclear whether RAM or VRAM was measured, whether shared or compressed memory was included, or how memory was sampled. Since both tools dynamically adjust to system resources, the relevance of this comparison is questionable without more technical detail.

      We modified the text in the caption (new Figure 1-figure supplement 2) adding the kind of memory we measured (RAM) and how we measured it. We already have a disclaimer for this plot saying that memory management depends on the machine's available resources. We agree that this is a simple analysis of the usage of computer resources.

      (3) While the authors cite several key papers on contrastive learning, they do not use the introduction or discussion to effectively situate their approach within related fields where similar strategies have been widely adopted. For example, contrastive embedding methods form the backbone of modern facial recognition and other image similarity systems, where the goal is to map images into a latent space that separates identities or classes through clustering. This connection would help emphasize the conceptual strength of the approach and align the work with well-established applications. Similarly, there is a growing literature on animal re-identification (ReID), which often involves learning identity-preserving representations across time or appearance changes. Referencing these bodies of work would help readers connect the proposed method with adjacent areas using similar ideas, and show that the authors are aware of and building on this wider context.

      We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently.

      (4) Some sections of the Results text (e.g., lines 48-74) read more like extended figure captions than part of the main narrative. They include detailed explanations of figure elements, sorting procedures, and video naming conventions that may be better placed in the actual figure captions or moved to supplementary notes. Streamlining this section in the main text would improve readability and help the central ideas stand out more clear

      Thank you for pointing this out. We have rewritten the Results, for example streamlining the old lines 48-74 (new lines 42-48)  by moving the comments about names, files and order of videos to the caption of Figure 1.

      Overall, though, this is a high-quality paper. The improvements to idtracker.ai are well justified and practically significant. Addressing the above comments will strengthen the work, particularly by clarifying the evaluation and comparisons.

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #2 (Public review):

      Summary:

      This work introduces a new version of the state-of-the-art idtracker.ai software for tracking multiple unmarked animals. The authors aimed to solve a critical limitation of their previous software, which relied on the existence of "global fragments" (video segments where all animals are simultaneously visible) to train an identification classifier network, in addition to addressing concerns with runtime speed. To do this, the authors have both re-implemented the backend of their software in PyTorch (in addition to numerous other performance optimizations) as well as moving from a supervised classification framework to a self-supervised, contrastive representation learning approach that no longer requires global fragments to function. By defining positive training pairs as different images from the same fragment and negative pairs as images from any two co-existing fragments, the system cleverly takes advantage of partial (but high-confidence) tracklets to learn a powerful representation of animal identity without direct human supervision. Their formulation of contrastive learning is carefully thought out and comprises a series of empirically validated design choices that are both creative and technically sound. This methodological advance is significant and directly leads to the software's major strengths, including exceptional performance improvements in speed and accuracy and a newfound robustness to occlusion (even in severe cases where no global fragments can be detected). Benchmark comparisons show the new software is, on average, 44 times faster (up to 440 times faster on difficult videos) while also achieving higher accuracy across a range of species and group sizes. This new version of idtracker.ai is shown to consistently outperform the closely related TRex software (Walter & Couzin, 2021\), which, together with the engineering innovations and usability enhancements (e.g., outputs convenient for downstream pose estimation), positions this tool as an advancement on the state-of-the-art for multi-animal tracking, especially for collective behavior studies.

      Despite these advances, we note a number of weaknesses and limitations that are not well addressed in the present version of this paper:

      Weaknesses

      (1) The contrastive representation learning formulation. Contrastive representation learning using deep neural networks has long been used for problems in the multi-object tracking domain, popularized through ReID approaches like DML (Yi et al., 2014\) and DeepReID (Li et al., 2014). More recently, contrastive learning has become more popular as an approach for scalable self-supervised representation learning for open-ended vision tasks, as exemplified by approaches like SimCLR (Chen et al., 2020), SimSiam (Chen et al., 2020\), and MAE (He et al., 2021\) and instantiated in foundation models for image embedding like DINOv2 (Oquab et al., 2023). Given their prevalence, it is useful to contrast the formulation of contrastive learning described here relative to these widely adopted approaches (and why this reviewer feels it is appropriate):

      (1.1) No rotations or other image augmentations are performed to generate positive examples. These are not necessary with this approach since the pairs are sampled from heuristically tracked fragments (which produces sufficient training data, though see weaknesses discussed below) and the crops are pre-aligned egocentrically (mitigating the need for rotational invariance).

      (1.2) There is no projection head in the architecture, like in SimCLR. Since classification/clustering is the only task that the system is intended to solve, the more general "nuisance" image features that this architectural detail normally affords are not necessary here.

      (1.3) There is no stop gradient operator like in BYOL (Grill et al., 2020\) or SimSiam. Since the heuristic tracking implicitly produces plenty of negative pairs from the fragments, there is no need to prevent representational collapse due to class asymmetry. Some care is still needed, but the authors address this well through a pair sampling strategy (discussed below).

      (1.4) Euclidean distance is used as the distance metric in the loss rather than cosine similarity as in most contrastive learning works. While cosine similarity coupled with L2-normalized unit hypersphere embeddings has proven to be a successful recipe to deal with the curse of dimensionality (with the added benefit of bounded distance limits), the authors address this through a cleverly constructed loss function that essentially allows direct control over the intra- and inter-cluster distance (D\_pos and D\_neg). This is a clever formulation that aligns well with the use of K-means for the downstream assignment step.

      No concerns here, just clarifications for readers who dig into the review. Referencing the above literature would enhance the presentation of the paper to align with the broader computer vision literature.

      Thank you for this detailed comparison. We have now added a new section in Appendix 3, “Differences with previous work in contrastive/metric learning” (lines 792-841) to include references to previous work and a description of what we do differently, including the points raised by the reviewer.

      (2) Network architecture for image feature extraction backbone. As most of the computations that drive up processing time happen in the network backbone, the authors explored a variety of architectures to assess speed, accuracy, and memory requirements. They land on ResNet18 due to its empirically determined performance. While the experiments that support this choice are solid, the rationale behind the architecture selection is somewhat weak. The authors state that: "We tested 23 networks from 8 different families of state-of-the-art convolutional neural network architectures, selected for their compatibility with consumer-grade GPUs and ability to handle small input images (20 × 20 to 100 × 100 pixels) typical in collective animal behavior videos."

      (2.1) Most modern architectures have variants that are compatible with consumer-grade GPUs. This is true of, for example, HRNet (Wang et al., 2019), ViT (Dosovitskiy et al., 2020), SwinT (Liu et al., 2021), or ConvNeXt (Liu et al., 2022), all of which report single GPU training and fast runtime speeds through lightweight configuration or subsequent variants, e.g., MobileViT (Mehta et al., 2021). The authors may consider revising that statement or providing additional support for that claim (e.g., empirical experiments) given that these have been reported to outperform ResNet18 across tasks.

      Following the recommendation of the reviewer, we tested the architectures SwinT, ConvNeXt and ViT. We found out that none of them outperformed ResNet18 since they all showed a slower learning curve. This would result in higher tracking times. These tests are now included in the section “Network architecture” (lines 550-611).

      (2.2) The compatibility of different architectures with small image sizes is configurable. Most convolutional architectures can be readily adapted to work with smaller image sizes, including 20x20 crops. With their default configuration, they lose feature map resolution through repeated pooling and downsampling steps, but this can be readily mitigated by swapping out standard convolutions with dilated convolutions and/or by setting the stride of pooling layers to 1, preserving feature map resolution across blocks. While these are fairly straightforward modifications (and are even compatible with using pretrained weights), an even more trivial approach is to pad and/or resize the crops to the default image size, which is likely to improve accuracy at a possibly minimal memory and runtime cost. These techniques may even improve the performance with the architectures that the authors did test out.

      The only two tested architectures that require a minimum image size are AlexNet and DenseNet. DenseNet proved to underperform ResNet18 in the videos where the images are sufficiently large. We have tested AlexNet with padded images to see that it also performs worse than ResNet18 (see Appendix 3—figure 1).

      We also tested the initialization of ResNet18 with pre-trained weights from ImageNet (in Appendix 3—figure 2) and it proved to bring no benefit to the training speed (added in lines 591-592).

      (2.3) The authors do not report whether the architecture experiments were done with pretrained or randomly initialized weights.

      We adapted the text to make it clear that the networks are always randomly initialized (lines 591-592, lines 608-609 and the captions of Appendix 3—figure 1 and 2).

      (2.4) The authors do not report some details about their ResNet18 design, specifically whether a global pooling layer is used and whether the output fully connected layer has any activation function. Additionally, they do not report the version of ResNet18 employed here, namely, whether the BatchNorm and ReLU are applied after (v1) or before (v2) the conv layers in the residual path.

      We use ResNet18 v1 with no activation function nor bias in its last layer (this has been clarified in the lines 606-608). Also, by design, ResNet has a global average pool right before the last fully connected layer which we did not remove. In response to the reviewer, Resnet18 v2 was tested and its performance is the same as that of v1 (see Appendix 3—figure 1 and lines 590-591).

      (3) Pair sampling strategy. The authors devised a clever approach for sampling positive and negative pairs that is tailored to the nature of the formulation. First, since the positive and negative labels are derived from the co-existence of pretracked fragments, selection has to be done at the level of fragments rather than individual images. This would not be the case if one of the newer approaches for contrastive learning were employed, but it serves as a strength here (assuming that fragment generation/first pass heuristic tracking is achievable and reliable in the dataset). Second, a clever weighted sampling scheme assigns sampling weights to the fragments that are designed to balance "exploration and exploitation". They weigh samples both by fragment length and by the loss associated with that fragment to bias towards different and more difficult examples.

      (3.1) The formulation described here resembles and uses elements of online hard example mining (Shrivastava et al., 2016), hard negative sampling (Robinson et al., 2020\), and curriculum learning more broadly. The authors may consider referencing this literature (particularly Robinson et al., 2020\) for inspiration and to inform the interpretation of the current empirical results on positive/negative balancing.

      Following this recommendation, we added references of hard negative mining in the new section “Differences with previous work in contrastive/metric learning”, lines 792-841. Regarding curriculum learning, even though in spirit it might have parallels with our sampling method in the sense that there is a guided training of the network, we believe the approach is more similar to an exploration-exploitation paradigm.

      (4) Speed and accuracy improvements. The authors report considerable improvements in speed and accuracy of the new idTracker (v6) over the original idTracker (v4?) and TRex. It's a bit unclear, however, which of these are attributable to the engineering optimizations (v5?) versus the representation learning formulation.

      (4.1) Why is there an improvement in accuracy in idTracker v5 (L77-81)? This is described as a port to PyTorch and improvements largely related to the memory and data loading efficiency. This is particularly notable given that the progression went from 97.52% (v4; original) to 99.58% (v5; engineering enhancements) to 99.92% (v6; representation learning), i.e., most of the new improvement in accuracy owes to the "optimizations" which are not the central emphasis of the systematic evaluations reported in this paper.

      V5 was a two year-effort designed to improve time efficiency of v4. It was also a surprise to us that accuracy was higher, but that likely comes from the fact that the substituted code from v4 contained some small bug/s. The improvements in v5 are retained in v6 (contrastive learning) and v6 has higher accuracy and shorter tracking times. The difference in v6 for this extra accuracy and shorter tracking times is contrastive learning.

      (4.2) What about the speed improvements? Relative to the original (v4), the authors report average speed-ups of 13.6x in v5 and 44x in v6. Presumably, the drastic speed-up in v6 comes from a lower Protocol 2 failure rate, but v6 is not evaluated in Figure 2 - figure supplement 2.

      Idtracker.ai v5 runs an optimized Protocol 2 and, sometimes, the Protocol 3. But v6 doesn’t run either of them. While P2 is still present in v6 as a fallback protocol when contrastive fails, in our v6 benchmark P2 was never needed. So the v6 speedup comes from replacing both P2 and P3 with the contrastive algorithm.

      (5) Robustness to occlusion. A major innovation enabled by the contrastive representation learning approach is the ability to tolerate the absence of a global fragment (contiguous frames where all animals are visible) by requiring only co-existing pairs of fragments owing to the paired sampling formulation. While this removes a major limitation of the previous versions of idtracker.ai, its evaluation could be strengthened. The authors describe an ablation experiment where an arc of the arena is masked out to assess the accuracy under artificially difficult conditions. They find that the v6 works robustly up to significant proportions of occlusions, even when doing so eliminates global fragments.

      (5.1) The experiment setup needs to be more carefully described.

      (5.1.1) What does the masking procedure entail? Are the pixels masked out in the original video or are detections removed after segmentation and first pass tracking is done?

      The mask is defined as a region of interest in the software. This means that it is applied at the segmentation step where the video frame is converted to a foreground-background binary image. The region of interest is applied here, converting to background all pixels not inside of it. We clarified this in the newly added section Occlusion tests, lines 240-244.

      (5.1.2) What happens at the boundary of the mask? (Partial segmentation masks would throw off the centroids, and doing it after original segmentation does not realistically model the conditions of entering an occlusion area.)

      Animals at the boundaries of the mask are partially detected. This can change the location of their detected centroid. That’s why, when computing the ground-truth accuracy for these videos, only the groundtruth centroids that were at minimum 15 pixels further from the mask were considered. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.1.3) Are fragments still linked for animals that enter and then exit the mask area?

      No artificial fragment linking was added in these videos. Detected fragments are linked the usual way. If one animal hides into the mask, the animal disappears so the fragment breaks.  We clarified this in the newly added section Occlusion tests, lines 245-247.

      (5.1.4) How is the evaluation done? Is it computed with or without the masked region detections?

      The groundtruth used to validate these videos contains the positions of all animals at all times. But only the positions outside the mask at each frame were considered to compute the tracking accuracy. We clarified this in the newly added section Occlusion tests, lines 248-251.

      (5.2) The circular masking is perhaps not the most appropriate for the mouse data, which is collected in a rectangular arena.

      We wanted to show the same proof of concept in different videos. For that reason, we used to cover the arena parametrized by an angle. In the rectangular arena the circular masking uses an external circle, so it is covering the rectangle parametrized by an angle.

      (5.3) The number of co-existing fragments, which seems to be the main determinant of performance that the authors derive from this experiment, should be reported for these experiments. In particular, a "number of co-existing fragments" vs accuracy plot would support the use of the 0.25(N-1) heuristic and would be especially informative for users seeking to optimize experimental and cage design. Additionally, the number of co-existing fragments can be artificially reduced in other ways other than a fixed occlusion, including random dropout, which would disambiguate it from potential allocentric positional confounds (particularly relevant in arenas where egocentric pose is correlated with allocentric position).

      We included the requested analysis about the fragment connectivity in Figure 3-figure supplement 1. We agree that there can be additional ways of reducing co-existing fragments, but we think the occlusion tests have the additional value that there are many real experiments similar to this test.

      (6) Robustness to imaging conditions. The authors state that "the new idtracker.ai can work well with lower resolutions, blur and video compression, and with inhomogeneous light (Figure 2 - figure supplement 4)." (L156). Despite this claim, there are no speed or accuracy results reported for the artificially corrupted data, only examples of these image manipulations in the supplementary figure.

      We added this information in the same image, new Figure 1 - figure supplement 3.

      (7) Robustness across longitudinal or multi-session experiments. The authors reference idmatcher.ai as a compatible tool for this use case (matching identities across sessions or long-term monitoring across chunked videos), however, no performance data is presented to support its usage. This is relevant as the innovations described here may interact with this setting. While deep metric learning and contrastive learning for ReID were originally motivated by these types of problems (especially individuals leaving and entering the FOV), it is not clear that the current formulation is ideally suited for this use case. Namely, the design decisions described in point 1 of this review are at times at odds with the idea of learning generalizable representations owing to the feature extractor backbone (less scalable), low-dimensional embedding size (less representational capacity), and Euclidean distance metric without hypersphere embedding (possible sensitivity to drift). It's possible that data to support point 6 can mitigate these concerns through empirical results on variations in illumination, but a stronger experiment would be to artificially split up a longer video into shorter segments and evaluate how generalizable and stable the representations learned in one segment are across contiguous ("longitudinal") or discontiguous ("multi-session") segments.

      We have now added a test to prove the reliability of idmatcher.ai in v6. In this test, 14 videos are taken from the benchmark and split in two non-overlapping parts (with a 200 frames gap in between). idmatcher.ai is run between the two parts presenting a 100% accuracy identity matching across all of them (see section “Validity of idmatcher.ai in the new idtracker.ai”, lines 969-1008).

      We thank the reviewer for the detailed suggestions. We believe we have taken all of them into consideration to improve the ms.

      Reviewer #3 (Public review):

      Summary

      The authors propose a new version of idTracker.ai for animal tracking. Specifically, they apply contrastive learning to embed cropped images of animals into a feature space where clusters correspond to individual animal identities.

      Strengths

      By doing this, the new software alleviates the requirement for so-called global fragments - segments of the video, in which all entities are visible/detected at the same time - which was necessary in the previous version of the method. In general, the new method reduces the tracking time compared to the previous versions, while also increasing the average accuracy of assigning the identity labels.

      Weaknesses

      The general impression of the paper is that, in its current form, it is difficult to disentangle the old from the new method and understand the method in detail. The manuscript would benefit from a major reorganization and rewriting of its parts. There are also certain concerns about the accuracy metric and reducing the computational time.

      We have made the following modifications in the presentation:

      (1) We have added section tiles to the main text so it is clearer what tracking system we are referring to. For example, we now have sections “Limitation of the original idtracker.ai”, “Optimizing idtracker.ai without changes in the learning method” and “The new idtracker.ai uses representation learning”.

      (2) We have completely rewritten all the text of the ms until we start with contrastive learning. Old L20-89 is now L20-L66, much shorter and easier to read.

      (3) We have rewritten the first 3 paragraphs in the section “The new idtracker.ai uses representation learning” (lines 68-92).

      (4) We now expanded Appendix 3 to discuss the details of our approach  (lines 539-897).  It discusses in detail the steps of the algorithm, the network architecture, the loss function, the sampling strategy, the clustering and identity assignment, and the stopping criteria in training

      (5) To cite previous work in detail and explain what we do differently, we have now added in Appendix 3 the new section “Differences with previous work in contrastive/metric learning” (lines 792-841).

      Regarding accuracy metrics, we have replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” (lines 414-436) explaining IDF1 and why this is an appropriate metric for our goal.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy over all our benchmark for our previous accuracy score and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      We thank the reviewer for the suggestions about presentation and about the use of more standard metrics.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors):

      (1) Figure 1a: A graphical legend inset would make it more readable since there are multiple colors, line styles, and connecting lines to parse out.

      Following this recommendation, we added a graphical legend in the old Figure 1 (new Figure 2).

      (2) L46: "have images" → "has images".

      We applied this correction. Line 35.

      (3) L52: "videos start with a letter for the species (z,**f**,m)", but "d" is used for fly videos.

      We applied this correction in the caption of Figure 1.

      (4) L62: "with Protocol 3 a two-step process" → "with Protocol 3 being a two-step process".

      We rewrote this paragraph without mentioning Protocol 3, lines 37-41.

      (5) L82-89: This is the main statement of the problems that are being addressed here (speed and relaxing the need for global fragments). This could be moved up, emphasized, and made clearer without the long preamble and results on the engineering optimizations in v5. This lack of linearity in the narrative is also evident in the fact that after Figure 1a is cited, inline citations skip to Figure 2 before returning to Figure 1 once the contrastive learning is introduced.

      We have rewritten all the text until the contrastive learning, (old lines 20-89 are now lines 20-66). The text is shorter, more linear and easier to read.

      (6) L114: "pairs until the distance D_{pos}" → "pairs until the distance approximates D_{pos}".

      We rewrote as “ pairs until the distance 𝐷pos (or 𝐷neg) is reached” in line 107.

      (7) L570: Missing a right parenthesis in the equation.

      We no longer have this equation in the ms.

      (8) L705: "In order to identify fragments we, not only need" → "In order to identify fragments, we not only need".

      We applied this correction, Line 775.

      (9) L819: "probably distribution" → "probability distribution".

      We applied this correction, Line 776.

      (10) L833: "produced the best decrease the time required" → "produced the best decrease of the time required".

      We applied this correction, Line 746.

      Reviewer #3 (Recommendations for the authors):

      (1) We recommend rewriting and restructuring the manuscript. The paper includes a detailed explanation of the previous approaches (idTracker and idTracker.ai) and their limitations. In contrast, the description of the proposed method is short and unstructured, which makes it difficult to distinguish between the old and new methods as well as to understand the proposed method in general. Here are a few examples illustrating the problem. 

      (1.1) Only in line 90 do the authors start to describe the work done in this manuscript. The previous 3 pages list limitations of the original method.

      We have now divided the main text into sections, so it is clearer what is the previous method (“Limitation of the original idtracker.ai”, lines 28-51), the new optimization we did of this method (“Optimizing idtracker.ai without changes in the learning method”, lines 52-66) and the new contrastive approach that also includes the optimizations (“The new idtracker.ai uses representation learning”, lines 66-164). Also, the new text has now been streamlined until the contrastive section, following your suggestion. You can see that in the new writing the three sections are 25 , 15 and 99 lines. The more detailed section is the new system, the other two are needed as reference, to describe which problem we are solving and the extra new optimizations.  

      (1.2) The new method does not have a distinct name, and it is hard to follow which idtracker.ai is a specific part of the text referring to. Not naming the new method makes it difficult to understand.

      We use the name new idtracker.ai (v6) so it becomes the current default version. v5 is now obsolete, as well as v4. And from the point of view of the end user, no new name is needed since v6 is just an evolution of the same software they have been using. Also, we added sections in the main text to clarify the ideas in there and indicate the version of idtracker.ai we are referring to.

      (1.3) There are "Protocol 2" and "Protocol 3" mixed with various versions of the software scattered throughout the text, which makes it hard to follow. There should be some systematic naming of approaches and a listing of results introduced.

      Following this recommendation we no longer talk about the specific protocols of the old version of idtracker.ai in the main text. We rewritten the explanation of these versions in a more clear and straightforward way, lines 29-36.

      (2) To this end, the authors leave some important concepts either underexplained or only referenced indirectly via prior work. For example, the explanation of how the fragments are created (line 15) is only explained by the "video structure" and the algorithm that is responsible for resolving the identities during crossings is not detailed (see lines 46-47, 149-150). Including summaries of these elements would improve the paper's clarity and accessibility.

      We listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (3) Accuracy metrics are not clear. In line 319, the authors define it as based on "proportion of errors in the trajectory". This proportion is not explained. How is the error calculated if a trajectory is lost or there are identity swaps? Multi-object tracking has a range of accuracy metrics that account for such events but none of those are used by the authors. Estimating metrics that are common for MOT literature, for example, IDF1, MOTA, and MOTP, would allow for better method performance understanding and comparison.

      In the new ms, we replaced our accuracy metric with the standard metric IDF1. IDF1 is the standard metric that is applied to systems in which the goal is to maintain consistent identities across time. See also the section in Appendix 1 "Computation of tracking accuracy” explaining why IDF1 and not MOTA or MOTP is the adequate metric for a system that wants to give correct tracking by identification in time. See lines 416-436.

      Using IDF1 we obtain slightly higher accuracies for the idtracker.ai systems. This is the comparison of mean accuracy four our previous accuracy and the new one for the full trajectories:

      v4:   97.42% -> 98.24%

      v5:   99.41% -> 99.49%

      v6:   99.74% -> 99.82%

      trex: 97.89% -> 97.89%

      (4) Additionally, the authors distinguish between tracking with and without crossings, but do not provide statistics on the frequency of crossings per video. It is also unclear how the crossings are considered for the final output. Including information such as the frame rate of the videos would help to better understand the temporal resolution and the differences between consecutive frames of the videos.

      We added this information in the Appendix 1 “Benchmark of accuracy and tracking time”, lines 445-451. The framerate in our benchmark videos goes from 25 to 60 fps (average of 37 fps). On average 2.6% of the blobs are crossings (1.1% for zebrafish 0.7% for drosophila 9.4% for mice).

      (5) In the description of the dataset used for evaluation (lines 349-365), the authors describe the random sampling of parameter values for each tracking run. However, it is unclear whether the same values were used across methods. Without this clarification, comparisons between the proposed method, older versions, and TRex might be biased due to lucky parameter combinations. In addition, the ranges from which the values were randomly sampled were also not described.

      Only one parameter is shared between idtracker.ai and TRex: intensity_threshold (in idtracker.ai) and threshold (in TRex). Both are conceptually equivalent but differ in their numerical values since they affect different algorithms. V4, v5, and TRex each required the same process of independent expert visual inspection of the segmentation to select the valid value range. Since versions 5 and 6 use exactly the same segmentation algorithm, they share the same parameter ranges.

      All the ranges of valid values used in our benchmark are public here https://drive.google.com/drive/folders/1tFxdtFUudl02ICS99vYKrZLeF28TiYpZ as stated in the section “Data availability”, lines 227-228.

      (6) Lines 122-123, Figure 1c. "batches" - is an imprecise metric of training time as there is no information about the batch size.

      We clarified the Figure caption, new Figure 2c.

      (7) Line 145 - "we run some steps... For example..." leaves the method description somewhat unclear. It would help if you could provide more details about how the assignments are carried out and which metrics are being used.

      Following this recommendation, we listed the specific sections from our previous publication where the reader can find information about the entire tracking pipeline (lines 539-549). This way, we keep the ms clear and focused on the new identification algorithm while indicating where to find such information.

      (8) Figure 3. How is tracking accuracy assessed with occlusions? Are the individuals correctly recognized when they reappear from the occluded area?

      The groundtruth for this video contains the positions of all animals at all times. Only the groundtruth points inside the region of interest are taken into account when computing the accuracy. When the tracking reaches high accuracy, it means that animals are successfully relabeled every time they enter the non-masked region. Note that this software works all the time by identification of animals, so crossings and occlusion are treated the same way. What is new here is that the occlusions are so large that there are no global fragments. We clarified this in the new section “Occlusion tests” in Methods, lines 239-251.

      (9) Lines 185-187 this part of the sentence is not clear.

      We rewrote this part in a clearer way, lines 180-182.

      (10) The authors also highlight the improved runtime performance. However, they do not provide a detailed breakdown of the time spent on each component of the tracking/training pipeline. A timing breakdown would help to compare the training duration with the other components. For example, the calculation of the Silhouette Score alone can be time-consuming and could be a bottleneck in the training process. Including this information would provide a clearer picture of the overall efficiency of the method.

      We measured that the training of ResNet takes on average in our benchmark 47% of the tracking time (we added this information line 551 section “Network Architecture”). In this training stage the bottleneck becomes the network forward and backward pass, limited by the GPU performance. All other processes happening during training have been deeply optimized and parallelized when needed so their contribution to the training time is minimal. Apart from the training, we also measured 24.4% of the total tracking time spent in reading and segmenting the video files and 11.1% in processing the identification images and detecting crossings.

      (11) An important part of the computational cost is related to model training. It would be interesting to test whether a model trained on one video of a specific animal type (e.g., zebrafish_5) generalizes to another video of the same type (e.g., zebrafish_7). This would assess the model's generalizability across different videos of the same species and spare a lot of compute. Alternatively, instead of training a model from scratch for each video, the authors could also consider training a base model on a superset of images from different videos and then fine-tuning it with a lower learning rate for each specific video. This could potentially save time and resources while still achieving good performance.

      Already before v6, there was the possibility for the user to start training the identification network by copying the final weights from another tracking session. This knowledge transfer feature is still present in v6 and it still decreases the training times significatively. This information has been added in Appendix 4, lines 906-909.

      We have already begun working on the interesting idea of a general base model but it brings some complex challenges. It could be a very useful new feature for future idtracker.ai releases.

      We thank the reviewer for the many suggestions. We have implemented all of them.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #2 (Public review):

      (1) Vglut2 isn't a very selective promoter for the STN. Did the authors verify every injection across brain slices to ensure the para-subthalamic nucleus, thalamus, lateral hypothalamus, and other Vglut2-positive structures were never infected?

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      (2) The authors say in the methods that the high vs low power laser activation for optogenetic experiments was defined by the behavioral output. This is misleading, and the high vs low power should be objectively stated and the behavioral results divided according to the power used, not according to the behavioral outcome.

      Optogenetic excitation is no longer part of the study.

      (3) In the fiber photometry experiments exposing mice to the range of tones, it is impossible to separate the STN response to the tone from the STN response to the movement evoked by the tone. The authors should expose the mouse to the tones in a condition that prevents movement, such as anesthetized or restrained, to separate out the two components.

      The new mixed-effects modeling approach clearly differentiates sensory (auditory) from motor contributions during tone-evoked STN activation. In prior work (see Hormigo et al, 2023, eLife), we explored experimental methods such as head restraint or anesthesia to reduce movement, but we concluded that these approaches are unsuitable for addressing this question. Mice exhibit substantial residual movement even when head-fixed, and anesthesia profoundly alters neural excitability and behavioral state, introducing major confounds. To fully eliminate movement would require paralysis and artificial ventilation, which would again disrupt physiological network dynamics and raise ethical concerns. Therefore, the current modeling approach—incorporating window-specific covariates for movement—is the most appropriate and rigorous way to dissociate tone-evoked sensory activity from motor activity in behaving animals.

      (4) The claim 'STN activation is ideally suited to drive active avoids' needs more explanation. This claim comes after the fiber photometry experiments during active avoidance tasks, so there has been no causality established yet.

      Text adjusted. 

      (5) The statistical comparisons in Figure 7E need some justification and/or clarification. The 9 neuron types are originally categorized based on their response during avoids, then statistics are run showing that they respond differently during avoids. It is no surprise that they would have significantly different responses, since that is how they were classified in the first place. The authors must explain this further and show that this is not a case of circular reasoning.

      Statistically verifying the clustering is useful to ensure that the selected number of clusters reflects distinct classes. It is also necessary when different measurements are used to classify (movement time series classified the avoids) and to compare neuronal types within each avoid mode/class (know called “mode”). Moreover, the new modeling approach goes beyond the prior statistical limitations related to considering movement and neuronal variables separately. 

      (6) The authors show that neurons that have strong responses to orientation show reduced activity during avoidance. What are the implications of this? The author should explain why this is interesting and important.

      The new modeling approach goes beyond the prior analysis limitations. For instance, it shows that most of the prior orienting related activations closely reflect the orienting movement, and only in a few cases (noted and discussed in the results) orienting activations are related to the behavioral contingencies or behavioral outcomes in the task. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study. 

      (7) It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1 that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presenting CS1+CS2 at the same time and could be confusing.

      Optogenetic excitation is no longer part of the study. 

      (8) The experiments in Figure 10 are used to say that STN stimulation is not aversive, but they only show that STN stimulation cannot be used as punishment in place of a shock. This doesn't mean that it is not aversive; it just means it is not as aversive as a shock. The authors should do a simpler aversion test, such as conditioned or real-time place preference, to claim that STN stimulation is not aversive. This is particularly surprising as previous work (Serra et al., 2023) does show that STN stimulation is aversive.

      Optogenetic excitation is no longer part of the study.

      (9) In the discussion, the idea that the STN encodes 'moving away' from contralateral space is pretty vague and unsupported. It is puzzling that the STN activates more strongly to contraversive turns, but when stimulated, it evokes ipsiversive turns; however, it seems a stretch to speculate that this is related to avoidance. In the last experiments of the paper, the axons from the STN to the GPe and to the midbrain are selectively stimulated. Do these evoke ipsiversive turns similarly?

      Optogenetic excitation is no longer part of the study. 

      (10) In the discussion, the authors claim that the STN is essential for modulating action timing in response to demands, but their data really only show this in one direction. The STN stimulation reliably increases the speed of response in all conditions (except maximum speed conditions such as escapes). It seems to be over-interpreting the data to say this is an inability to modulate the speed of the task, especially as clear learning and speed modulation do occur under STN lesion conditions, as shown in Figure 12B. The mice learn to avoid and increase their latency in AA2 vs AA1, though the overall avoids and latency are different from controls. The more parsimonious conclusion would be that STN stimulation biases movement speed (increasing it) and that this is true in many different conditions.

      Optogenetic excitation is no longer part of the study.

      (11)  In the discussion, the authors claim that the STN projections to the midbrain tegmentum directly affect the active avoidance behavior, while the STN projections to the SNr do not affect it. This seems counter to their results, which show STN projections to either area can alter active avoidance behavior. What is the laser power used in these terminal experiments? If it is high (3mW), the authors may be causing antidromic action potentials in the STN somas, resulting in glutamate release in many brain areas, even when terminals are only stimulated in one area. The authors could use low (0.25mW) laser power in the terminals to reduce the chance of antidromic activation and spatially restrict the optical stimulation.

      Optogenetic excitation is no longer part of the study. 

      (12) Was normality tested for data prior to statistical testing?

      Yes, although now we use mixed models

      (13) Why are there no error bars on Figure 5B, black circles and orange triangles?

      When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Reviewer #3 (Public review):

      (1) I really don't understand or accept this idea that delayed movement is necessarily indicative of cautious movements. Is the distribution of responses multi-modal in a way that might support this idea, or do the authors simply take a normal distribution and assert that the slower responses represent 'caution'? Even if responses are multi-modal and clearly distinguished by 'type', why should readers think this that delayed responses imply cautious responding instead of say: habituation or sensitization to cue/shock, variability in attention, motivation, or stress; or merely uncertainty which seems plausible given what I understand of the task design where the same mice are repeatedly tested in changing conditions. This relates to a major claim (i.e., in the work's title).

      In our study, “caution” is defined operationally as the tendency to delay initiation of an avoidance response in demanding situations (e.g., taking more time or care before crossing a busy street). The increase in avoidance latency with task difficulty is highly robust, as we have shown previously through detailed analyses of timing distributions and direct comparisons with appetitive behaviors (e.g., Zhou et al., 2022 JNeurosci). Moreover, we used the tracked movement time series to statistically classify responses into cautious modes, which is likely novel. This definition can dissociate cautious responding from broader constructs listed by a reviewer, such as attention, motivation, or stress, which must be explicitly defined to be rigorously considered in this context, including the likelihood that they covary with caution without being equivalent to it. 

      Cue-evoked orienting responses at CS onset are directly measured, and their habituation and sensitization have been characterized in our prior work (e.g., Zhou et al., 2023 JNeurosci). US-evoked escapes are also measured in the present study and directly compared with avoidance responses. Together, these analyses provide a rigorous and consistent framework for defining and quantifying caution within our behavioral procedures.

      Importantly, mice exhibit cautious responding as defined here across different tasks, making it more informative to classify avoidance responses by behavioral mode rather than by task alone. Accordingly, in the miniscope, single-neuron, and mixed-effects model analyses, we classified active avoids into distinct modes reflecting varying levels of caution. Although these modes covary with task contingencies, their explicit classification improves model predictability and interpretability with respect to cautious responding.

      (2) Related to the last, I'm struggling to understand the rationale for dividing cells into 'types' based the their physiological responses in some experiments (e.g., Figure 7).

      This section has now been expanded into 3 figures (Fig. 7-9) with new modeling approaches that should make the rationale more straight forward.

      By emphasizing the mixed-effects modeling results and integrating these analyses directly into the figures, the revised manuscript now more clearly delineates what is encoded at the population and single-neuron levels. Including movement and baseline covariates allowed us to dissociate motor-related modulation from other neural signals, substantially clarifying the distinction between movement encoding and other task-related variables, which we focus on in the paper. These analyses confirm the strong role of the STN in representing movement while revealing additional signals related to aversive stimulation and cautious responding that persist after accounting for motor effects. These signals arise from distinct neuronal populations that can be differentiated by their movement sensitivity and activation patterns across avoidance modes, reflecting varying levels of caution. At the same time, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (3)The description and discussion of orienting head movements were not well supported, but were much discussed in the avoidance datasets. The initial speed peaks to cue seem to be the supporting data upon which these claims rest, but nothing here suggests head movement or orientation responses.

      As described in the methods (and noted above), we track the head and decompose the movement into rotational and translational components. With the new approach, several effects that initially reflected orienting-related activity at CS-onset (note that our movement tracking captures both head position and orientation as a directional vector) dissipated once movement and baseline covariates were included in the models, emphasizing the utility of the analytical improvements in the revision.

      (4) Similar to the last, the authors note in several places, including abstract, the importance of STN in response timing, i.e., particularly when there must be careful or precise timing, but I don't think their data or task design provides a strong basis for this claim.

      The avoidance modes and the measured latencies directly support the relation to action timing, but now the portion of the previous paper about optogenetic excitation and apparently the main source of criticism is no longer in the present study. 

      (5) I think that other reports show that STN calcium activity is recruited by inescapable foot shock as well. What do these authors see? Is shock, independent of movement, contributing to sharp signals during escapes?

      The question, “Is shock, independent of movement, contributing to sharp signals during escapes?” is now directly addressed in the revised analyses. By incorporating movement and baseline covariates into the mixed-effects models, we dissociate STN activity related to aversive stimulation from that associated with motor output. The results show that shock-evoked STN activation persists even after controlling for movement within defined neuronal populations, supporting a specific nociceptive contribution independent of motor dynamics—a dissociation that appears to be new in this field.

      (6) In particular, and related to the last point, the following work is very relevant and should be cited:  Note that the focus of this other paper is on a subset of VGLUT2+ Tac1 neurons in paraSTN, but using VGLUT2-Cre to target STN will target both STN and paraSTN.

      We appreciate the reviewer’s reference to the recent preprint highlighting the role of the para-subthalamic nucleus in avoidance learning. However, our study focused specifically on performance in well-trained mice rather than on learning processes. Behavioral learning is inherently more variable and can be disrupted by less specific manipulations, whereas our experiments targeted the stable execution of learned avoidance behaviors. Future work will extend these findings to the learning phase and examine potential contributions of subthalamic subdivisions, which our current Vglut2-based manipulations do not dissociate. We will consider this and related work more closely in those studies.

      (7) In multiple other instances, claims that were more tangential to the main claims were made without clearly supporting data or statistics. E.g., claim that STN activation is related to translational more than rotational movement; claim that GCaMP and movement responses to auditory cues were small; claims that 'some animals' responded differently without showing individual data.

      We have adjusted the text accordingly.

      (8) In several figures, the number of subjects used was not described. This is necessary. Also necessary is some assessment of the variability across subjects. The only measure of error shown in many figures relates to trial-to-trial or event variability, which is minimal because, in many cases, it appears that hundreds of trials may have been averaged per animal, but this doesn't provide a strong view of biological variability. When bar/line plots are used to display data, I recommend showing individual animals where feasible.

      All experiments report number of mice and sessions. Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeated-measures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (9) Can the authors consider the extent to which calcium imaging may be better suited to identify increases compared to decreases and how this may affect the results, particularly related to the GRIN data when similar numbers of cells show responses in both directions (e.g., Figure 3)?

      This is an interesting issue related to a widely used technique beyond the scope of our study.

      (10) Raw example traces are not provided.

      We do not think raw traces are useful here. All figures contain average traces to reflect the activity of the estimated population.

      (11) The timeline of the spontaneous movement and avoidance sessions was not clear, nor was the number of events or sessions per animal nor how this was set. It is not clear if there was pre-training or habituation, if many or variable sessions were combined per animal, or what the time gaps between sessions were, or if or how any of these parameters might influence interpretation of the results.

      We have enhanced the description of the sessions, including the number of animals and sessions, which are daily and always equal per animals in each group of experiments. As noted, the sessions are part of the random effects in the model.

      (12) It is not clear if or how the spread of expression outside of the target STN was evaluated, and if or how many mice were excluded due to spread or fiber placements.

      The STN is anatomically well-confined, with its borders and the overlying zona incerta (composed of GABAergic neurons) providing protection against off-target expression in most neighboring forebrain regions. All viral injections were histologically verified and did not into extend into thalamic or hypothalamic areas. As described in the Methods, we employed an app we developed (Brain Atlas Analyzer, available on OriginLab) that aligns serial histological sections with the Allen Brain Atlas to precisely assess viral spread and confirm targeting accuracy. The experiments included in the revised manuscript now focus on optogenetic inhibition and irreversible lesion approaches—three complementary methods that consistently targeted the STN and yielded similar behavioral effects.

      Recommendations for the authors:

      Reviewing Editor Comments:

      The primary feedback agreed upon by all the reviewers was that the manuscript requires significant streamlining as it is currently overly long and convoluted.

      We thank the reviewers and editors for their thoughtful and constructive feedback. In response to the primary comment that “the manuscript requires significant streamlining as it is currently overly long and convoluted,” we have substantially revised and refocused the paper. Specifically, we streamlined the included data and enhanced the analyses to emphasize the central findings: the encoding of movement, cautious responding, and punishment in the STN during avoidance behavior. We also focused the causal component of the study by including only the loss-of-function experiments—both optogenetic inhibition and irreversible viral/electrolytic lesions—that establish the critical role of STN circuits in generating active avoidance. Together, these revisions enhance clarity, tighten the narrative focus, and align the manuscript more closely with the reviewers’ recommendations.

      Major revisions include the addition of mixed-effects modeling to dissociate the contributions of movement from other STN-encoded signals related to caution and punishment. This modeling approach allowed us to reveal that these components are statistically separable, demonstrating that movement, cautious responding, and aversive input are encoded by neuronal subsets. To streamline the manuscript and address reviewer concerns, we removed the optogenetic excitation experiments. As revised, the paper presents a more concise and cohesive narrative showing that STN neurons differentially encode movement, caution, and aversive stimuli, and that this circuitry is essential for generating active avoidance behavior.

      Many of the specific points raised by reviewers now fall outside the scope of the revised manuscript. This is primarily because the revised version omits data and analyses related to optogenetic excitation and associated control experiments. By removing these components, the paper now presents a streamlined and internally consistent dataset focused on how the STN encodes movement, cautious responding, and aversive outcomes during avoidance behavior, as well as on loss-of-function experiments demonstrating its necessity for generating active avoidance. Below, we address the points that remain relevant across reviews.

      Following extensive revisions, the current manuscript differs in several important ways from what the assessment describes:

      The description that the study “uses fiber photometry, implantable lenses, and optogenetics” is more accurately represented as using both fiber photometry and singleneuron calcium imaging with miniscopes, combined with optogenetic and irreversible lesion approaches.

      The phrase stating that “active but not passive avoidance depends in part on STN projections to substantia nigra” is better characterized as “STN projections to the midbrain,” since our data show that optogenetic inhibition of STN terminals in both the mesencephalic reticular tegmentum (MRT) and substantia nigra pars reticulata (SNr) produce equivalent effects, and thus these sites are combined in the study. 

      Finally, the original concern that evidence for STN involvement in cautious responding or avoidance speed was incomplete no longer applies. The revised focus on encoding, through the inclusion of mixed-effects modeling, now dissociates movement-related, cautious, and aversive components of STN activity. By removing the optogenetic excitation data, we no longer claim that the STN controls caution but rather that it encodes cautious responding, alongside movement and punishment signals. Furthermore, loss-of-function experiments demonstrate that silencing STN output abolishes active avoidance entirely, supporting an essential role for the STN in generating goal-directed avoidance behavior—a behavioral domain that, unlike appetitive responding, is fundamentally defined by caution and the need to balance action timing under threat.

      Reviewer #2 (Recommendations for the authors):

      (1) Show individual data points on bar plots.

      Wherever feasible, we display individual data points (e.g., Figures 1 and 2) to convey variability directly. However, in cases where figures depict hundreds of paired (repeatedmeasures) data points, showing all points without connecting them would not be appropriate, while linking them would make the figures visually cluttered and uninterpretable. All plots and traces include measures of variability (SEM), and the raw data will be shared on Dryad. When error bars are not visible, they are smaller than the trace thickness or bar line—for example, in Figure 5B, the black circles and orange triangles include error bars, but they are smaller than the symbol size.

      Also, to minimize visual clutter, only a subset of relevant comparisons is highlighted with asterisks, whereas all relevant statistical results, comparisons, and mouse/session numbers are fully reported in the Results section, with statistical analyses accounting for the clustering of data within subjects and sessions.

      (2) The active avoidance experiments are confusing when they are introduced in the results section. More explanation of what paradigms were used and what each CS means at the time these are introduced would add clarity. For example, AA1, AA2, etc, are explained only with references to other papers, but a brief description of each protocol and a schematic figure would really help.

      The avoidance protocols (AA1–4) are now described briefly but clearly in the Results section (second paragraph of “STN neurons activate during goal-directed avoidance contingencies”) and in greater detail in the Methods section. As stated, these tasks were conducted sequentially, and mice underwent the same number of sessions per procedure, which are indicated. All relevant procedural information has been included in these sections. Mice underwent daily sessions and learnt these tasks within 1-2 sessions, progressing sequentially across tasks with an equal number of sessions per task (7 per task), and the resulting data were combined and clustered by mouse/session in the statistical models.

      (3) How do the Class 1, 2, 3 avoids relate to Class 1, 2, 3 neural types established in Figure 3? It seems like they are not related, and if that is the case, they should be named something different from each other to avoid confusion. (4) Similarly, having 3 different cell types (a,b,c) in the active avoidance seems unrelated to the original classification of cell types (1,2,3), and these are different for each class of avoid. This is very confusing, and it is unclear how any of these types relate to each other. Presumably, the same mouse has all three classes of avoids, so there are recordings from each cell during each type of avoid.

      The terms class, mode, and type are now clearly distinguished throughout the manuscript. Modes refer to distinct patterns of avoidance behavior that differ in the level of cautious responding (Mode 3 is most cautious). Within each mode, types denote subgroups of neurons identified based on their ΔF/F activity profiles. In contrast, classes categorize neurons according to their relationship to movement, determined by cross-correlation analyses between ΔF/F and head speed (Class1-4; Fig. 7 is a new analysis) or head turns (ClassA-C, renamed from 1-3). This updated terminology clarifies the analytic structure, highlighting distinct neuronal populations within each analysis. For example, during avoidance behaviors, these classifications distinguish neurons encoding movement-, caution-, and outcome-related signals. Comparisons are conducted within each analytical set, within classes (A-C or 1-4 separately), within avoidance modes, or within modespecific neuronal types.

      …So the authors could compare one cell during each avoid and determine whether it relates to movement or sound, or something else. It is interesting that types a,b, and c have the exact same proportions in each class of avoid, and makes it important to investigate if these are the exact same cells or not.

      That previous table with the a,b,c % in the three figure panels was a placeholder, which was not updated in the included figure. It has now been correctly updated. They do not have the same proportions as shown in Fig. 9, although they are similar.

      Also, these mice could be recorded during the open field, so the original neural classification (class 1, 2,3) could be applied to these same cells, and then the authors can see whether each cell type defined in the open field has a different response to the different avoid types. As it stands, the paper simply finds that during movement and during avoidance behaviors, different cells in the STN do different things.

      We included a new analysis in Fig. 7 that classifies neurons based on the cross-correlation with movement. The inclusion of the models now clearly assigns variance to movement versus the other factors, and this analysis leads to the classification based on avoid modes. 

      (5) The use of the same colors to mean two different things in Figure 9 is confusing. AA1 vs AA2 shouldn't be the same colors as light-naïve vs light signaling CS.

      Optogenetic excitation is no longer part of the study.

      (6) The exact timeline of the optogenetics experiments should be presented as a schematic for understanding. It is not clear which conditions each mouse experienced in which order. This is critical to the interpretation of Figure 9 and the reduction of passive avoids during STN stimulation. Did these mice have the CS1+STN stimulation pairing or the STN+US pairing prior to this experiment? If they did, the stimulation of the STN could be strongly associated with either punishment or with the CS1that predicts punishment. If that is the case, stimulating the STN during CS2 could be like presentingCS1+CS2 at the same time and could be confusing. The authors should make it clear whether the mice were naïve during this passive avoid experiment or whether they had experienced STN stimulation paired with anything prior to this experiment.

      Optogenetic excitation is no longer part of the study.

      (20) Similarly, the duration of the STN stimulation should be made clear on the plots that show behavior over time (e.g., Figure 9E).

      Optogenetic excitation is no longer part of the study.

      (21) There is just so much data and so many conditions for each experiment here. The paper is dense and difficult to read. It would really benefit readability if the authors put only the key experiments and key figure panels in the main text and moved much of the repetitive figure panels to supplemental figures. The addition of schematic drawings for behavioral experiment timing and for the different AA1, AA2, and AA3 conditions would also really improve clarity.

      By focusing the study, we believe it has substantially improved clarity and readability. 

      Reviewer #3 (Recommendations for the authors):

      (1) Minor error in results 'Cre-AAV in the STN of Vglut2-Cre' Fixed.

      (2) In some Figure 2 panels, the peaks appear to be cut off, and blue traces are obscured by red.

      In Fig. 2, the peaks of movement (speed) traces are intentionally truncated to emphasize the rising phase of the turn, which would otherwise be obscured if the full y-axis range were displayed (peaks and other measures are statistically compared). This adjustment enhances clarity without omitting essential detail and is now noted in the legend.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Artiushin et al. establish a comprehensive 3D atlas of the brain of the orb-web building spider Uloborus diversus. First, they use immunohistochemistry detection of synapsin to mark and reconstruct the neuropils of the brain of six specimens and they generate a standard brain by averaging these brains. Onto this standard 3D brain, they plot immunohistochemical stainings of major transmitters to detect cholinergic, serotonergic, octopaminergic/taryminergic and GABAergic neurons, respectively. Further, they add information on the expression of a number of neuropeptides (Proctolin, AllatostatinA, CCAP, and FMRFamide). Based on this data and 3D reconstructions, they extensively describe the morphology of the entire synganglion, the discernible neuropils, and their neurotransmitter/neuromodulator content.

      Strengths:

      While 3D reconstruction of spider brains and the detection of some neuroactive substances have been published before, this seems to be the most comprehensive analysis so far, both in terms of the number of substances tested and the ambition to analyze the entire synganglion. Interestingly, besides the previously described neuropils, they detect a novel brain structure, which they call the tonsillar neuropil.<br /> Immunohistochemistry, imaging, and 3D reconstruction are convincingly done, and the data are extensively visualized in figures, schemes, and very useful films, which allow the reader to work with the data. Due to its comprehensiveness, this dataset will be a valuable reference for researchers working on spider brains or on the evolution of arthropod brains.

      Weaknesses:

      As expected for such a descriptive groundwork, new insights or hypotheses are limited, apart from the first description of the tonsillar neuropil. A more comprehensive labeling in the panels of the mentioned structures would help to follow the descriptions. The reconstruction of the main tracts of the brain would be a very valuable complementary piece of data.

      Reviewer #2 (Public review):

      Summary

      Artiushin et al. created the first three-dimensional atlas of a synganglion in the hackled orb-weaver spider, which is becoming a popular model for web-building behavior. Immunohistochemical analysis with an impressive array of antisera reveals subcompartments of neuroanatomical structures described in other spider species as well as two previously undescribed arachnid structures, the protocerebral bridge, hagstone, and paired tonsillar neuropils. The authors describe the spider's neuroanatomy in detail and discuss similarities and differences from other spider species. The final section of the discussion examines the homology between onychophoran and chelicerate arcuate bodies and mandibulate central bodies.

      Strengths

      The authors set out to create a detailed 3D atlas and accomplished this goal.

      Exceptional tissue clearing and imaging of the nervous system reveal the three-dimensional relationships between neuropils and some connectivity that would not be apparent in sectioned brains.

      A detailed anatomical description makes it easy to reference structures described between the text and figures.

      The authors used a large palette of antisera which may be investigated in future studies for function in the spider nervous system and may be compared across species.

      Weaknesses

      It would be useful for non-specialists if the authors would introduce each neuropil with some orientation about its function or what kind of input/output it receives, if this is known for other species. Especially those structures that are not described in other arthropods, like the opisthosomal neuropil. Are there implications for neuroanatomical findings in this paper on the understanding of how web-building behaviors are mediated by the brain?

      Likewise, where possible, it would be helpful to have some discussion of the implications of certain neurotransmitters/neuropeptides being enriched in different areas. For example, GABA would signal areas of inhibitory connections, such as inhibitory input to mushroom bodies, as described in other arthropods. In the discussion section on relationships between spider and insect midline neuropils, are there similarities in expression patterns between those described here and in insects?

      Reviewer #3 (Public review):

      Summary:

      This is an impressive paper that offers a much-needed 3D standardized brain atlas for the hackled-orb weaving spider Uloborus diversus, an emerging organism of study in neuroethology. The authors used a detailed immunohistological whole-mount staining method that allowed them to localize a wide range of common neurotransmitters and neuropeptides and map them on a common brain atlas. Through this approach, they discovered groups of cells that may form parts of neuropils that had not previously been described, such as the 'tonsillar neuropil', which might be part of a larger insect-like central complex. Further, this work provides unique insights into the previously underappreciated complexity of higher-order neuropils in spiders, particularly the arcuate body, and hints at a potentially important role for the mushroom bodies in vibratory processing for web-building spiders.

      Strengths:

      To understand brain function, data from many experiments on brain structure must be compiled to serve as a reference and foundation for future work. As demonstrated by the overwhelming success in genetically tractable laboratory animals, 3D standardized brain atlases are invaluable tools - especially as increasing amounts of data are obtained at the gross morphological, synaptic, and genetic levels, and as functional data from electrophysiology and imaging are integrated. Among 'non-model' organisms, such approaches have included global silver staining and confocal microscopy, MRI, and, more recently, micro-computed tomography (X-ray) scans used to image multiple brains and average them into a composite reference. In this study, the authors used synapsin immunoreactivity to generate an averaged spider brain as a scaffold for mapping immunoreactivity to other neuromodulators. Using this framework, they describe many previously known spider brain structures and also identify some previously undescribed regions. They argue that the arcuate body - a midline neuropil thought to have diverged evolutionarily from the insect central complex - shows structural similarities that may support its role in path integration and navigation.

      Having diverged from insects such as the fruit fly Drosophila melanogaster over 400 million years ago, spiders are an important group for study - particularly due to their elegant web-building behavior, which is thought to have contributed to their remarkable evolutionary success. How such exquisitely complex behavior is supported by a relatively small brain remains unclear. A rich tradition of spider neuroanatomy emerged in the previous century through the work of comparative zoologists, who used reduced silver and Golgi stains to reveal remarkable detail about gross neuroanatomy. Yet, these techniques cannot uncover the brain's neurochemical landscape, highlighting the need for more modern approaches-such as those employed in the present study.

      A key insight from this study involves two prominent higher-order neuropils of the protocerebrum: the arcuate body and the mushroom bodies. The authors show that the arcuate body has a more complex structure and lamination than previously recognized, suggesting it is insect central complex-like and may support functions such as path integration and navigation, which are critical during web building. They also report strong synapsin immunoreactivity in the mushroom bodies and speculate that these structures contribute to vibratory processing during sensory feedback, particularly in the context of web building and prey localization. These findings align with prior work that noted the complex architecture of both neuropils in spiders and their resemblance (and in some cases greater complexity) compared to their insect counterparts. Additionally, the authors describe previously unrecognized neuropils, such as the 'tonsillar neuropil,' whose function remains unknown but may belong to a larger central complex. The diverse patterns of neuromodulator immunoreactivity further suggest that plasticity plays a substantial role in central circuits.

      Weaknesses:

      My major concern, however, is that some of the authors' neuroanatomical descriptions rely too heavily on inference rather than what is currently resolvable from their immunohistochemistry stains alone.

      We would like to thank the reviewers for their time and effort in carefully reading our manuscript and providing helpful feedback, and particularly for their appreciation and realistic understanding of the scope of this study and its context within the existing spider neuroanatomical literature.

      Regarding the limitations and potential additions to this study, we believe these to be well-reasoned and are in agreement. We plan to address some of these shortcomings in future publications.

      As multiple reviewers remarked, a mapping of the major tracts of the brain would be a welcome addition to understanding the neuroanatomy of U. diversus. This is something which we are actively working on and hope to provide in a forthcoming publication. Given the length of this paper as is, we considered that a treatment of the tracts would be better served as an additional paper. Likewise, mapping of the immunoreactive somata of the currently investigated targets is a component which we would like to describe as part of a separate paper, keeping the focus of the current one on neuropils, in order to leverage our aligned volumes to describe co-expression patterns, which is not as useful for the more widely dispersed somata. Furthermore, while we often see somata through immunostaining, the presence and intensity of the signal is variable among immunoreactive populations. We are finding that these populations are more consistently and comprehensively revealed thru fluorescent in situ hybridization.

      We appreciate the desire of the reviewers for further information regarding the connectivity and function of the described neuropils, and where possible we have added additional statements and references. That being said, where this context remains sparse is largely a reflection of the lack of information in the literature. This is particularly the case for functional roles for spider neuropils, especially higher order ones of the protocerebrum, which are essentially unexamined. As summarized in the quite recent update to Foelix’s Spider Neuroanatomy, a functional understanding for protocerebral neuropil is really only available for the visual pathway. Consequently, it is therefore also difficult to speak of the implications for presence or absence of particular signaling elements in these neuropils, if no further information about the circuitry or behavioral correlates are available. Finally, multiple reviewers suggested that it might be worthwhile to explore a comparison of the arcuate body layer innervation to that of the central bodies of insects, of which there is a richer literature. This is an idea which we were also initially attracted to, and have now added some lines to the discussion section. Our position on this is a cautious one, as a series of more recent comparative studies spanning many insect species using the same antibody, reveals a considerable amount of variation in central body layering even within this clade, which has given us pause in interpreting how substantive similarities and differences to the far more distant spiders would be. Still, this is an interesting avenue which merits an eventual comprehensive analysis, one which would certainly benefit from having additional examples from more spider species, in order to not overstate conclusions based on the currently limited neuroanatomical representation.

      Given our framing for the impetus to advance neuroanatomical knowledge in orb-web builders, the question of whether the present findings inform the circuitry controlling web-building is one that naturally follows. While we are unable with this dataset alone to define which brain areas mediate web-building - something which would likely be beyond any anatomical dataset lacking complementary functional data – the process of assembling the atlas has revealed structures and defined innervation patterns in previously ambiguous sectors of the spider brain, particularly in the protocerebrum. A simplistic proposal is that such regions, which are more conspicuous by our techniques and in this model species, would be good candidates for further inquiries into web-building circuitry, as their absence or oversight in past work could be attributable to the different behavioral styles of those model species. Regardless, granted that such a hypothesis cannot be readily refuted by the existing neuroanatomical literature, underscores the need to have more finely refined models of the spider brain, to which we hope that we have positively contributed to and are gratified by the reviewer’s enthusiasm for the strengths of this study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Brenneis 2022 has done a very nice and comprehensive study focused on the visual system - this might be worth including.

      Thank you, we have included this reference on Line 34.

      (2) L 29: When talking about "connectivity maps", the emerging connectomes based on EM data could be mentioned.

      Additional references have been added, thank you. Line 35.

      (3) L 99: Please mention that you are going to describe the brain from ventral to dorsal.

      Thank you, we have added a comment to Line 99.

      (4) L 13: is found at the posterior.

      Thank you, revised.

      (5) L 168: How did you pick those two proctolin+ somata, given that there is a lot of additional punctate signal?

      Although not visible in this image, if you scroll through the stack there is a neurite which extends from these neurons directly to this area of pronounced immunoreactivity.

      (6) Figure 1: Please add the names of the neuropils you go through afterwards.

      We have added labels for neuropils which are recognizable externally.

      (7) Figure 1 and Figure 5: Please mark the esophagus.

      Label has now been added to Figure 1. In Figure 5, the esophagus should not really be visible because these planes are just ventral to its closure.

      (8) Figure 5A: I did not see any CCAP signal where the arrow points to; same for 5B (ChAT).

      In hindsight, the CCAP point is probably too minor to be worth mentioning, so we have removed it.

      The ChAT signal pattern in 5B has been reinforced by adding a dashed circle to show its location as well.

      (9) L 249: Could the circular spot also be a tract (many tracts lack synapsin - at least in insects)?

      Yes, thank you for pointing this out – the sentence is revised (L274). We are currently further analyzing anti-tubulin volumes and it seem that indeed there are tracts which occupy these synapsin-negative spaces, although interestingly they do not tend to account for the entire space.

      (10) L 302: Help me see the "conspicuous" thing.

      Brace added to Fig. 8B, note in caption.

      (11) L 315: Please first introduce the number of the eyes and how these relate to 1{degree sign} and 2{degree sign} pathway. Are these separate pathways from separate eyes or two relay stations of one visual pathway?

      We have expanded the introduction to this section (L336). Yes, these are considered as two separate visual pathways, with a typical segregation of which eyes contribute to which pathway – although there is evidence for species-specific differences in these contributions. In the context of this atlas, we are not currently able to follow which eyes are innervating which pathway.

      (12) L 343: It seems that the tonsillar neuropil could be midline spanning (at least this is how I interpret the signal across the midline). Would it make sense to re-formulate from a paired structure to midline-spanning? Would that make it another option for being a central complex homolog?

      In the spectrum from totally midline spanning and unpaired (e.g., arcuate body (at least in adults)) to almost fully distinct and paired (e.g., mushroom bodies (although even here there is a midline spanning ‘bridge’)), we view the tonsillar to be more paired due to the oval components, although it does have a midline spanning section, particularly unambiguous just posterior to the oval sections.

      Regarding central complex homology, if the suggestion is that the tonsillar with its midline spanning component could represent the entire central complex, then this is a possibility, but it would neglect the highly innervated and layered arcuate body, which we think represent a stronger contender – at least as a component of the central complex. For this reason, we would still be partial to the possibility that the tonsillar is a part of the central complex, but not the entire complex.

      (13) L 407: ...and dorsal (..) lobe...

      Added the word ‘lobe’ to this sentence (L429).

      (14) L 620ff: Maybe mention the role of MBs in learning and memory.

      A reference has been added at L661.

      (15) L 644: In the context of arcuate body homology with the central body, I was missing a discussion of the neurotransmitters expressed in the respective parts in insects. Would that provide additional arguments?

      This is an interesting comparison to explore, and is one that we initially considered making as well. There are certainly commonalities that one could point to, particularly in trying to build the case of whether particular lobes of the arcuate body are similar to the fan-shaped or ellipsoid bodies in insects. Nevertheless, something which has given us pause is studying the more recent comparative works between insect species (Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro), which also reveal a fair degree of heterogeneity in expression patterns between species – and this is despite the fact that the neuropils are unambiguously homologous. When comparing to a much more evolutionarily distant organism such as the spider, it becomes less clear which extant species should serve as the best point of comparison, and therefore we fear making specious arguments by focusing on similarities when there are also many differences. We have added some of these comments to the discussion (L699-725).

      Throughout the text, I frequently had difficulties in finding the panels right away in the structures mentioned in the text. It would help to number the panels (e.g., 6Ai, Aii, Aii,i etc) and refer to those in the text. Further, all structures mentioned in the text should be labelled with arrows/arrowheads unless they are unequivocally identified in the panel

      Thank you for the suggestion. We have adopted the additional numbering scheme for panels, and added additional markers where suggested.

      Reviewer #2 (Recommendations for the authors):

      (1) L 18: "neurotransmitter" should be pluralized.

      Thank you, revised (L18).

      (2) L 55: Missing the word "the" before "U. diversus".

      Thank you, revised (L57).

      (3) L 179: Change synaptic dense to "synapse-dense".

      Thank you, revised (L189).

      (4) L 570: "present in" would be clearer than "presented on in".

      Our intention here was to say that Loesel et al did not show slices from the subesophageal mass for CCAP, so it was ambiguous as to whether it had immunoreactivity there but they simply did not present it, or if it indeed doesn’t show signal in the subesophageal. But agreed, this is awkward phrasing which has been revised (L606-608), thank you.

      (5) L 641: It would be worth noting that the upper and lower central bodies are referred to as the fan-shaped and ellipsoid bodies in many insects.

      Thank you, this has been added in L694.

      (6) L 642: Although cited here regarding insect central body layers, Strausfeld et al. 2006 mainly describe the onychophoran brain and the evolutionary relationship between the onychophoran and chelicerate arcuate bodies. The phylogenetic relationships described here would strengthen the discussion in the section titled "A spider central complex?"

      The phylogenetic relationship of onychophorans and chelicerates remains controversial and therefore we find it tricky to use this point to advance the argument in that discussion section, as one could make opposing arguments. The homology of the arcuate body (between chelicerates, onychophorans, and mandibulates) has likewise been argued over, with this Strausfeld et al paper offering one perspective, while others are more permissive (good summary at end of Doeffinger et al., 2010). Our thought was simply to draw attention to grossly similar protocerebral neuropils in examples from distantly related arthropods, without taking a stance, as our data doesn’t really deeply advance one view over the other.

      (7) L 701- Noduli have been described in stomatopods (Thoen et al., Front. Behav. Neurosci., 2017).

      This is an important addition, thank you – it has been incorporated and cited (L766).

      (8) Antisera against DC0 (PKA-C alpha) may distinguish globuli cells from other soma surrounding the mushroom bodies, but this may be accomplished in future studies.

      Agreed, this is something we have been interested in, but have not yet acquired the antibody.

      Reviewer #3 (Recommendations for the authors):

      Overall, this paper is both timely and important. However, it may face some resistance from classically trained arthropod neuroanatomists due to the authors' reliance on immunohistochemistry alone. A method to visualize fiber tracts and neuropil morphology would have been a valuable and grounding complement to the dataset and can be added in future publications. Tract-tracing methods (e.g., dextran injections) would strengthen certain claims about connectivity - particularly those concerning the mushroom bodies. For delineating putative cell populations across regions, fluorescence in situ hybridization for key transcripts would offer convincing evidence, especially in the context of the arcuate body, the tonsillar neuropil, and proposed homologies to the insect central complex.

      That said, the dataset remains rich and valuable. Outlined below are a number of issues the authors may wish to address. Most are relatively minor, but a few require further clarification.

      (1) Abstract

      (a) L 12-14: The authors should frame their work as a novel contribution to our understanding of the spider brain, rather than solely as a tool or stepping stone for future studies. The opening sentences currently undersell the significance of the study.

      Thank you for your encourament! We have revised the abstract.

      (b) Rather than touting "first of its kind" in the abstract, state what was learned from this.

      Thank you, we have revised the abstract.

      (c) The abstract does not mention the major results of the study. It should state which brain regions were found. It should list all of the peptides and transmitters that were tested so that they can be discoverable in searches.

      Thank you, revised.

      (2) Introduction

      (a) L 38: There's a more updated reference for Long (2016): Long, S. M. (2021). Variations on a theme: Morphological variation in the secondary eye visual pathway across the order of Araneae. Journal of Comparative Neurology, 529(2), 259-280.

      Thank you, this has been updated (L41 and elsewhere).

      (b) L 47: While whole-mount imaging offers some benefits, a downside is the need for complete brain dissection from the cuticle, which in spiders likely damages superficial structures (such as the secondary eye pathways).

      True – we have added this caveat to the section (L48-51).

      (c) L 49-52: If making this claim, more explicit comparisons with non-web building C. saeli in terms of neuropil presence, volume, or density later in the paper would be useful.

      We do not have the data on hand to make measured comparisons of C. salei structures, and the neuropils identified in this study are not clearly identifiable in the slices provided in the literature, so would likely require new sample preparations. We’ve removed the reference to proportionality and softened this sentence slightly – we are not trying to make a strong claim, but simply state that this is a possibility.

      (3) Results

      (a) The authors should state how they accounted for autofluorescence.

      While we did not explicitly test for autofluorescence, the long process of establishing a working whole-mount immuno protocol and testing antibodies produced many examples of treated brains which did not show any substantial signal.  We have added a note to the methods section (L866).

      (b) L 69: There is some controversy in delineating the subesophageal and supraesophageal mass as the two major divisions despite its ubiquity in the literature. It might be safer to delineate the protocerebrum, deutocerebrum, and fused postoral ganglia (including the pedipalp ganglion) instead.

      Thank you for this insight, we have modified the section, section headings and Figure 1 to account for this delineation as well. We have chosen to include both ways of describing the synganglion, in order to maintain a parallel with the past literature, and to be further accessible to non-specialist readers. L73-77

      (c) L 90: It might be useful to include a justification for the use of these particular neuropeptides.

      Thank you, revised. L97-99.

      (d) L 106 - 108: It is stated that the innervation pattern of the leg neuropils is generally consistent, but from Figure 2, it seems that there are differences. The density of 5HT, Proctolin, ChAT, and FMRFamide seems to be higher in the posterior legs. AstA seems to have a broader distribution in L1 and is absent in L4.

      We would still stand by the generalization that the innervation pattern is fairly similar for each leg. The L1 neuropils tend to be bigger than the posterior legs, which might explain the difference in density. Another important aspect to keep in mind is that not all of the leg neuropils appear at the exact same imaging plane as we move from ventral to dorsal. If you scroll through the synapsin stack (ventral to dorsal), you will see that L2 and L3 appear first, followed shortly by L1, and then L4, and at the dorsal end of the subesophageal they disappear in the opposite order. The observations listed here are true for the single z-plane in Figure 2, but the fact that they don’t appear at the same time seems to mainly account for these differences. For example, if you scroll further ventrally in the AstA volume, you will see a very similar innervation appear in L4 as well, even though it is absent in the Fig. 2 plane. We plan to have these individual volumes available from a repository so that they can be individually examined to better see the signal at all levels. At the moment, the entire repository can be accessed here: https://doi.org/10.35077/ace-moo-far.

      (e) Figure 1 and elsewhere: The axes for the posterior and lateral views show Lateral and Medial. It would be more accurate to label them Left and Right. because it does not define the medial-to-lateral axis. The medial direction is correct for only one hemiganglion, and it's the opposite for the contralateral side.

      Thank you, revised.

      (f) In Figures that show particular sections, it might be helpful to include a plane in the standard brain to illustrate where that section is.

      Yes, we agree and it was our original intention. It is something we can attempt to do, but there is not much room in the corners of many of the synapsin panels, making it harder to make the 3D representation big enough to be clear.

      (g) Figure 2, 3: Presenting the z-section stack separately in B and C is awkward because it makes it seem that they are unrelated. I think it would be better to display the z160-190 directly above its corresponding z230-260 for each of the exemplars in B and C. Since there's no left-right asymmetry, a hemibrain could be shown for all examples as was done for TH in D. It's not clear why TH was presented differently.

      Thank you for this suggestion. We rearranged the figure as described, but ultimately still found the original layout to be preferrable, in part because the labelling becomes too cramped. We hope that the potential confusion of the continuity of the B and C sections will be mitigated by focusing on the z plane labels and overall shape – which should suggest that the planes are not far from each other. We trust that the form of the leg neuropils is recognizable in both B and C synapsin images, and so readers will make the connection.

      Regarding TH, this panel is apart from the rest because we were unable to register the TH volume to the standard brain because the variant of the protocol which produced good anti-TH staining conflicted with synapsin, and we could not simultaneously have adequate penetration of the synapsin signal. We did not want to align the TH panel with the others to avoid potential confusion that this was a view from the same z-plane of a registered volume, as the others are. We have added a note to the figure caption.

      (h) The locations of the labels should be consistent. The antisera are below the images in Figure 2, above in Figure 3, and to the bottom left in Figure 5. The slices are shown above in Figure 2 and below in Figure 3.

      Thank you, this has been revised for better consistency.

      (i) It is surprising to me that there is no mention of the neuronal somata visible in Figure 2 and Figure 3. A typical mapping of the brain would map the locations of the neurons, not just the neuropils.

      Our first arrangement of this paper described each immunostain individually from ventral to dorsal, including locations of the immunoreactive somata which could be observed. To aid the flow of the paper and leverage the aligned volumes to emphasize co-expression in the function divisions of the brain, we re-formulated to this current layout which is organized around neuropils. Somata locations are tricky to incorporate in this format of the paper which focuses on key z-planes or tight max projections, because the relevant immunoreactive somata are more dispersed throughout the synganglion, not always overlapping in neighboring z-planes. Further, since only a minority of the antisera we used can reveal traceable projections from the supplying somata in the whole-mount preparation, we would be quite limited in the degree to which we could integrate the specific somata mapping with expression patterns in the neuropil.  Finally, compared to immuno, which can be variable in staining intensity between somata for the same target, we find that FISH reveals these locations more clearly and comprehensively – so while we agree that this mapping would also be useful for the atlas, we would like to better provide this information in a future publication using whole-mount FISH.

      (j) L 139: There is a reference to a "brace" in Figure 3B, which does not seem to exist. There's one in Figure 3C.

      There is a smaller brace near the bottom of the TDC2 panel in Fig. 3B.

      (k) L 151 should be "3D".

      Thank you, revised (L160).

      (l) Figure 4C: It is not mentioned in the legend that the bottom inset is Proctolin without synapsin.

      Thank you, revised (L1213).

      (m) L 199: Are the authors sure this subdivision is solely on the anterior-posterior axis? Could it also be dorsal ventral? (i.e., could this be an artifact of the protocerebrum and deutocerebrum?)

      Yes, this division can be appreciated to extend somewhat in the dorsal-ventral axis and it is possible that this is the protocerebrum emerging after the deutocerebrum, although this area is largely dorsal to the obvious part of the deutocerebrum. In the horizontal planes there appears to be a boundary line which we use for this subdivision in order to assist in better describing features within this generally ventral part of the protocerebrum – referred to as “stalk” because it is thinner before the protocerebrum expands in size, dorsally. Our intention was more organizational, and as stated in the text, this area is likely heterogenous and we are not suggesting that it has a unified function, so being a visual artifact would not be excluded.

      (n) L 249: Could it also indicate large tracts projecting elsewhere?

      Yes, definitely, we have evidence that part of the space is occupied by tracts. Revised, thank you (L262).

      (o) L 281: Several investigators, including Long (2021,) noted very large and robust mushroom bodies of Nephila.

      Thank you – the point is well taken that there are examples of orb-web builders that do have appreciable mushroom bodies. We have added a note in this section (L295), giving the examples of Deinopis spinosa and Argiope trifasciata (Figure 4.20 and 4.22 in Long, 2016).

      It looks like these species make the point better than Nephila, as Long lists the mushroom body percentage of total protocerebral volume for D. spinosa as 4.18%, for A. trifasciata as 2.38%, but doesn’t give a percentage for Nephila clavipes (Figure 4.24) and only labels the mushroom bodies structures as “possible” in the figure.

      In Long (2021), Nephilidae is described as follows: “In Nephilidae, I found what could be greatly reduced medullae at the caudal end of the laminae, as well as a structure that has many physical hallmarks of reduced mushroom bodies”

      (p) L 324: If the authors were able to stain for histamine or supplement this work with a different dissection technique for the dorsal structures, the visual pathways might have been apparent, which seems like a very important set of neuropils to include in a complete brain atlas.

      Yes, for this reason histamine has been an interesting target which we have attempted to visualize, but unfortunately have not yet been able to successfully stain for in U. diversus. An additional complication is that the antibodies we have seen call for glutaraldehyde fixation, which may make them incompatible with our approach to producing robust synapsin staining throughout the brain. 

      We agree that the lack of the complete visual pathway is a substantial weakness of our preparation, and should be amended in future work, but this will likely require developing a modified approach in order to preserve these delicate structures in U. diversus.

      (q) L 331: Is this bulbous shape neuropil, or just the remains of neuropil that were not fully torn away during dissection?

      This certainly is a severed part of the primary pathway, although it seems more likely that the bulbous shape is indicative of a neuropil form, rather than just being a happenstance shape that occurred during the breakage. We have examples where the same bulbous shape appears on both sides, and in different brains. It is possible that this may be the principal eye lamina – although we did not see co-staining with expected markers in examples where it did appear, so cannot be sure.

      (r) L 354: Is tyraminergic co-staining with the protocerebral bridge enough evidence to speculate that inputs are being supplied?

      We agree that this is not compelling, and have removed the statement.

      (s) L 372: This whole structure appears to be a previously described structure in spiders, the 'protocerebral commissure'.

      We are reasonably sure that what we are calling the PCB is a distinct structure from the protocerebral bridge (PCC). In Babu and Barth’s (1984) horizontal slice (Fig. 11b), you can see the protocerebral commissure immediately adjacent to the mushroom body bridge. It is found similarly located in other species, as can be seen in the supplementary 3D files provided by Steinhoff et al., (2024).

      While not visible with synapsin in U. diversus, we likewise can make out a commissure in this area in close proximity to the mushroom body bridge using tubulin staining. What we are calling the protocerebral bridge is a structure which is much more dorsal to the protocerebral commissure, not appearing in the same planes as the MB bridge.

      (t) L 377: Do you have an intuition why the tonsillar neuropil and the protocerebral bridge would show limited immunoreactivity, while the arcuate body's is quite extensive?

      This is an interesting question. Given the degree of interconnection and the fact that multiple classes of neurons in insects will innervate both central body as well as PCB or noduli, perhaps it would be expected that expression in tonsillar and protocerebral bridge should be commensurate to the innervation by that particular neurotransmitter expressing population in the arcuate body. Apart from the fact that the arcuate body is just bigger, perhaps this points to a great role of the arcuate body for integration, whereas the tonsillar and PCB may engage in more particular processing, or be limited to certain sensory modalities.

      Interestingly, it seems that this pattern of more limited immunoreactivity in the PCB and noduli compared with the central bodies (fan-shaped/ellipsoid) also appears in insects (Kahsai et al., 2010, J Comp Neuro, Timm et al., 2021, J Comp Neuro, Homberg et al., 2023, J Comp Neuro) – particularly, with almost every target having at least some layering in the fan-shaped body (Kahsai et al., 2010, J Comp Neuro).  For example, serotoninergic innervation is fairly consistently seen in the upper and lower central bodies across insects, but its presence in the PCB or noduli is more variable – appearing in one or the other in a species-dependent manner (Homberg et al., 2023, J Comp Neuro).

      (4) Discussion

      (a) L 556: But if confocal images from slices are aligned, is the 3D shape not preserved?

      Yes, fair enough – the point we wanted to make was that there is still a limitation in z resolution depending on the thickness of the slices used, which could obscure structures, but perhaps this is too minor of a comment.

      (b) L 597: This is a very interesting result. I agree it's likely to do with the processing of mechanosensory information relevant to web activities, and the mushroom body seems like the perfect candidate for this.

      (c) L 638: Worth noting that neuropil volume vs density of synapses might play a role in this, as the literature is currently a bit ambiguous with regards to the former.

      Thank you, noted (L689).

      (d) L 651: The latter seems far more plausible.

      Agreed, though the presence of mushroom bodies appears to be variable in spiders, so we didn’t want to take a strong stance, here.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Public Reviews: 

      Reviewer #2 (Public review): 

      Summary: 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of mouse visual cortex. 

      Strengths: 

      This is a great start for a project addressing visual reconstruction. It is based on physiological data obtained at a single-cell resolution, the stimulus movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. There appear to be no major technical flaws in the study, and some potential confounds were addressed upon revision. The study is an enjoyable read. 

      Weaknesses: 

      The study is technically competent and benchmark-focused, but without significant conceptual or theoretical advances. The inclusion of neuronal data broadens the study's appeal, but the work does not explore potential principles of neural coding, which limits its relevance for neuroscience and may create some disappointment to some neuroscientists. The authors are transparent that their goal was methodological rather than explanatory, but this raises the question of why neuronal data were necessary at all, as more significant reconstruction improvements might be achievable using noise-less artificial video encoders alone (network-to-network decoding approaches have been done well by teams such as Han, Poggio, and Cheung, 2023, ICML). Yet, even within the methodological domain, the study does not articulate clear principles or heuristics that could guide future progress. The finding that more neurons improve reconstruction aligns with well-established results in the literature that show that higher neuronal numbers improve decoding in general (for example, Hung, Kreiman, Poggio, and DiCarlo, 2005) and thus may not constitute a novel insight. 

      We thank the reviewer for this second round of comments and hope we were able to address the remaining points below. 

      Indeed, using surrogate noiseless data is interesting and useful when developing such methods, or to demonstrate that they work in principle. But in order to evaluate if they really work in practice, we need to use real neuronal data. While we did not try movie reconstruction from layers within artificial neural networks as surrogate data, in Supplementary Figure 3C we provide the performance of our method using simulated/predicted neuronal responses from the dynamic neural encoding model alongside real neuronal responses.

      Specific issues: 

      (1)The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I was left with the question: okay, does this mean that we should all switch to DNEM for our investigations of mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301...single trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best theoretical score, given noise and other limitations? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own, if it clarified how its findings depended on this model. 

      The revision helpfully added context to the Methods about the range of scores achieved by other models, but this information remains absent from the Abstract and other important sections. For instance, the Abstract states, "We achieve a pixel-level correlation of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses," yet this point estimate (presented without confidence intervals or comparisons to controls) lacks meaning for readers who are not told how it compares to prior work or what level of performance would be considered strong. Without such context, the manuscript undercuts potentially meaningful achievements. 

      We appreciate that the additional information about the performance of the SOTA DNEM to predict neural responses could be made more visible in the paper and will therefore move it from the methods to the results section instead: 

      Line 348 “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” will be moved to the results.

      With regard to the lack of context for the performance of our reconstruction in the abstract, we may have overcorrected in the previous revision round and have tried to find a compromise which gives more context to the pixel-level correlation value: 

      Abstract: “We achieve a pixel-level correlation of 0.57 (95% CI [0.54, 0.60]) between ground-truth movies and single-trial reconstructions. Previous reconstructions based on awake mouse V1 neuronal responses to static images achieved a pixel-level correlation of 0.238 over a similar retinotopic area.”

      (2) Along those lines, the authors conclude that "the number of neurons in the dataset and the use of model ensembling are critical for high-quality reconstructions." If true, these principles should generalize across network architectures. I wondered whether the same dependencies would hold for other network types, as this could reveal more general insights. The authors replied that such extensions are expected (since prior work has shown similar effects for static images) but argued that testing this explicitly would require "substantial additional work," be "impractical," and likely not produce "surprising results." While practical difficulty alone is not a sufficient reason to leave an idea untested, I agree that the idea that "more neurons would help" would be unsurprising. The question then becomes: given that this is a conclusion already in the field, what new principle or understanding has been gained in this study? 

      As mentioned in our previous round of revisions, we chose not to pursue the comparison of reconstructions using different model architectures in this manuscript because we did not think it would add significant insights to the paper given the amount of work it would require, and we are glad the reviewer agrees. 

      While the fact that more neurons result in better reconstructions is unsurprising, how quickly performance drops off will depend on the robustness of the method, and on the dimensionality of the decoding/reconstruction task (decoding grating orientation likely requires fewer neurons than gray scale image reconstruction, which in turn likely requires fewer neurons than full color movie reconstruction). How dependent input optimization based image/movie reconstruction is on population size has not been shown, so we felt it was useful for readers to know how well movie reconstruction works with our method when recording from smaller numbers of neurons. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1000 neurons and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that 7000 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields are too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? Originally, this question was meant to prompt deeper analysis of the neural data, but the authors did not engage with it, suggesting a limited understanding of the neuronal aspects of the dataset. 

      We apologize that we did not engage with this comment enough in the previous round. We assumed that the question arose because there was a misunderstanding about figure 5: 1000 not 1 neuron is sufficient to reconstruct the movies to a pixel-level correlation of 0.344. Of course, the fact that increasing the number of neurons from 1000 to 8000 only increased the reconstruction performance from 0.344 to 0.569 (65% increase in correlation) is still worth discussing. To illustrate this drop in performance qualitatively, we show 3 example frames from movie reconstructions using 1000-8000 neurons in Author response image 1.

      Author response image 1.

      3 example frames from reconstructions using different numbers of neurons. 

      As the reviewer points out, the diminishing returns of additional neurons to reconstruction performance is at least partly because there is redundancy in how a population of neurons represents visual stimuli. In supplementary figure S2, we inferred the on-off receptive fields of the neurons and show that visual space is oversampled in terms of the receptive field positions in panel C. However, the exact slope/shape of the performance vs population size curve we show in Figure 5 will also depend on the maximum performance of our reconstruction method, which is limited in spatial resolution (Figure 4 & Supplementary Figure S5). It is possible that future reconstruction approaches will require fewer neurons than ours, so we interpret this curve rather as a description of the reconstruction method itself than a feature of the underlying neuronal code. For that reason, we chose caution and refrained from making any claims about neuronal coding principles based on this plot. 

      (4) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this originally further raised questions: what is the theoretical capability for reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? In the revision, this concern was addressed nicely in the review in Supplementary Figure 3C. Also, one appreciates that as a follow up, the team produced error maps (New Figure 6) that highlight where in the frames the reconstruction are likely to fail. But the maps went unanalyzed further, and I am not sure if there was a systematic trend in the errors. 

      We are happy to hear that we were able to answer the reviewers’ question of what the maximum theoretical performance of our reconstruction process is in figure 3C. Regarding systematic trends in the error maps, we also did not observe any clear systematic trends. If anything, we noticed that some moving edges were shifted, but we do not think we can quantify this effect with this particular dataset. 

      (5) I was encouraged by Figure 4, which shows how the reconstructions succeeded or failed across different spatial frequencies. The authors note that "the reconstruction process failed at high spatial frequencies," yet it also appears to struggle with low spatial frequencies, as the reconstructed images did not produce smooth surfaces (e.g., see the top rows of Figures 4A and 4B). In regions where one would expect a single continuous gradient, the reconstructions instead display specular, high-frequency noise. This issue is difficult to overlook and might deserve further discussion. 

      Thank you for pointing this out, this is indeed true. The reconstructions do have high frequency noise. We mention this briefly in line 102 “Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise (Figure S3) and applied the evaluation mask.” In revisiting this sentence, we think it is more appropriate to replace “remove” with “reduce”. This noise is more visible in the Gaussian noise stimuli (Figure 4) because we did not apply the 3D Gaussian filter to these reconstructions, in case it interfered with the estimates of the reconstruction resolution limits. 

      Given that the Gaussian noise and drifting grating stimuli reconstructions were from predicted activity (“noise-free”), this high-frequency noise is not biological in origin and must therefore come from errors in our reconstruction process. This kind of high-frequency noise has previously been observed in feature visualization (optimizing input to maximize the activity of a specific node within a neural network to visualize what that node encodes; Olah, et al., "Feature Visualization", https://distill.pub/2017/feature-visualization/, 2017). It is caused by a kind of overfitting, whereby a solution to the optimization is found that is not “realistic”. Ways of combating this kind of noise include gradient smoothing, image smoothing, and image transformations during optimization, but these methods can restrict the resolution of the features that are recovered. Since we were more interested in determining the maximum resolution of stimuli that can be reconstructed in Figure 4 and Supplementary Figures 5-6, we chose not to apply these methods.

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original component. 

      We thank the reviewer for their balanced assessment of our manuscript.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      This paper presents a method for reconstructing videos from mouse visual cortex neuronal activity using a state-of-the-art dynamic neural encoding model. The authors achieve high-quality reconstructions of 10-second movies at 30 Hz from two-photon calcium imaging data, reporting a 2-fold increase in pixel-by-pixel correlation compared to previous methods. They identify key factors for successful reconstruction including the number of recorded neurons and model ensembling techniques. 

      Strengths: 

      (1) A comprehensive technical approach combining state-of-the-art neural encoding models with gradient-based optimization for video reconstruction. 

      (2) Thorough evaluation of reconstruction quality across different spatial and temporal frequencies using both natural videos and synthetic stimuli. 

      (3) Detailed analysis of factors affecting reconstruction quality, including population size and model ensembling effects. 

      (4) Clear methodology presentation with well-documented algorithms and reproducible code. 

      (5) Potential applications for investigating visual processing phenomena like predictive coding and perceptual learning. 

      We thank the reviewer for taking the time to provide this valuable feedback. We would like to add that in our eyes one additional main contribution is the step of going from reconstruction of static images to dynamic videos. We trust that in the revised manuscript, we have now made the point more explicit that static image reconstruction relies on temporally averaged responses, which negates the necessity of having to account for temporal dynamics altogether. 

      Weaknesses: 

      The main metric of success (pixel correlation) may not be the most meaningful measure of reconstruction quality: 

      High correlation may not capture perceptually relevant features.

      Different stimuli producing similar neural responses could have low pixel correlations The paper doesn't fully justify why high pixel correlation is a valuable goal 

      This is a very relevant point. In retrospect, perhaps we did not justify this enough. Sensory reconstruction typically aims to reconstruct sensory input based on brain activity as faithfully as possible. A brain-to-image decoder might therefore be trained to produce images as close to the original input as possible. The loss function to train the decoder would therefore be image similarity on the pixel level. In that case, evaluating reconstruction performance based on pixel correlation is somewhat circular. 

      However, when reconstructing videos, we optimize the input video in terms of its perceptual similarity to the original video and only then evaluate pixel-level similarity. The perceptual similarity metric we optimize for is the estimate of how the neurons in mouse V1 respond to that video. We then evaluate the similarity of this perceptually optimized video to the original input video with pixel-level correlation. In other words, we optimize for perceptual similarity and then evaluate pixel similarity. If our method optimized pixel-level similarity, then we would agree that perceptual similarity is a more relevant evaluation metric. We do not think it was clear in our original submission that our optimization loss function is a perceptual loss function, and have now made this clearer in Figure 1C-D and have clarified this in the results section, line 70:

      “In effect, we optimized the input video to be perceptually similar with respect to the recorded neurons.”

      And in line 110: 

      “Because our optimization of the movies was based on a perceptual loss function, we were interested in how closely these movies matched the originals on the pixel level.”

      We chose to use pixel correlation to measure pixel-level similarity for several reasons. 1) It has been used in the past to evaluate reconstruction performance (Yoshida et al., 2020), 2) It is contrast and luminance insensitive, 3) correlation is a common metric so most readers will have an intuitive understanding of how it relates to the data. 

      To further highlight why pixel similarity might be interesting to visualize, we have included additional analysis in Figure 6 illustrating pixel-level differences between reconstructions from experimentally recorded activity and predicted activity. 

      We expect that the type of perceptual similarity the reviewer is alluding to is pretrained neural network image embedding similarity (Zhang et al., 2018: https://doi.org/10.48550/arXiv.1801.03924). While these metrics seem to match human perceptual similarity, it is unclear if they reflect mouse vision. We did try to compare the embedding similarity from pretrained networks such as VGG16, but got results suggesting the reconstructed frames were no more similar to the ground truth than random frames, which is obviously not true. This might be because the ground truth videos were too different in resolution from the training data of these networks and because these metrics are typically very sensitive to decreases in resolution. 

      The best alternative approach to evaluate mouse perceptual similarity would be to show the reconstructed videos to the same animals while recording the same neurons and to compare these neural activation patterns to those evoked by the original ground truth videos. This has been done for static images in the past: Cobos et al., bioRxiv 2022, found that static image reconstructions generated using gradient descent evoked more similar trial-averaged (40 trials) responses to those evoked by ground truth images compared to other reconstruction methods. Unfortunately, we are currently not able to perform these in vivo experiments, which is why we used publicly available data for the current paper. We plan to use this method in the future. But this method is also not flawless as it assumes that the average response to an image is the best reflection of how that image is represented, which may not be the case for an individual trial.

      As far as we are aware, there is currently no method that, given a particular activity pattern in response to an image/video, can produce an image/video that induces a neural activity pattern that is closer to the original neural response than simply showing the same image/video again. Hypothetically, such a stimulus exists because of various visual processing phenomena we mention in our discussion (e.g., predictive coding and selective attention), which suggest that the image that is represented by a population of neurons likely differs from the original sensory input. In other words, what the brain represents is an interpretation of reality not a pure reflection. Experimentally verifying this is difficult, as these variations might be present on a single trial level. The first step towards establishing a method that captures the visual representation of a population of neurons is sensory reconstruction, where the aim is to get as close as possible to the original sensory input. We think pixel-level correlation is a stringent and interpretable metric for this purpose, particularly when optimizing for perceptual similarity rather than image similarity directly.

      Comparison to previous work (Yoshida et al.) has methodological concerns: Direct comparison of correlation values across different datasets may be misleading; Large differences in the number of recorded neurons (10x more in the current study); Different stimulus types (dynamic vs static) make comparison difficult; No implementation of previous methods on the current dataset or vice versa. 

      Yes, we absolutely agree that direct comparison to previous static image reconstruction methods is problematic. We primarily do so because we think it is standard practice to give related baselines. We agree that direct comparison of the performance of video reconstruction methods to image reconstruction methods is not really possible. It does not make sense to train and apply a dynamic model on a static image data set where neural activity is time-averaged, as the temporal kernels could not be learned. Conversely, for a static model, which expects a single image as input and predicts time averaged responses, it does not make sense to feed it a series of temporally correlated movie frames and to simply concatenate the resulting activity perdition. The static model would need to be substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have now added these caveats in line 119:

      “However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      We have also toned down the language, emphasising the comparison to previous image reconstruction performance in the abstract, results, and conclusion. 

      Abstract: We removed “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” and replaced with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Discussion: we removed “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” and replaced with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      Limited exploration of how the reconstruction method could provide insights into neural coding principles beyond demonstrating technical capability. 

      The aim of this paper was not to reveal principles of neural coding. Instead, we aimed to achieve the best possible performance of video reconstructions and to quantify the limitations. But to highlight its potential we have added two examples of how sensory reconstruction has been applied in human vision research in line 321: 

      “Although fMRI-based reconstruction techniques are starting to be used to investigate visual phenomena in humans (such as illusions [Cheng et al., 2023] and mental imagery [Shen et al., 2019; Koide-Majima et al., 2024; Kalantari et al., 2025]), visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data.”

      We have also added a demonstration of how this method could be used to investigate which parts of a reconstruction from a single trial response differs from the model's prediction (Figure  6). We do this by calculating pixel-level differences between reconstructions from the recorded neural activity and reconstructions from the expected neural activity (predicted activity by the neural encoding model). Although difficult to interpret, this pixel-by-pixel error map could represent trial-by-trial deviations of the neural code from pure sensory representation. But at this point we cannot know whether these errors are nothing more than errors in the reconstruction process. To derive meaningful interpretations of these maps would require a substantial amount of additional work and in vivo experiments and so is outside the scope of this paper, but we include this additional analysis now to highlight a) why pixel-level similarity might be interesting to quantify and visualize and b) to demonstrate how video reconstruction could be used to provide insights into neural coding, namely as a tool to identify how sensory representations differ from a pure reflection of the visual input.  

      The claim that "stimulus reconstruction promises a more generalizable approach" (line 180) is not well supported with concrete examples or evidence. 

      What we mean by generalizable is the ability to apply reconstruction to novel stimuli, which is not possible for stimulus classification. We now explain this better in the paragraph in line 211: 

      “Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al.,2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction could provide a more generalizable approach, because it can be applied to novel stimuli.”

      All the stimuli we reconstructed were not in the training set of the model, i.e., novel. We have also downed down the claim: we have replaced “promises” with “could provide”. 

      The paper would benefit from addressing how the method handles cases where different stimuli produce similar neural responses, particularly for high-speed moving stimuli where phase differences might be lost in calcium imaging temporal resolution. 

      Thank you for this suggestion, we think this is a great question. Calcium dynamics are slow and some of the high temporal frequency information could indeed be lost, particularly phase information. In other words, when the stimulus has high temporal frequency information, it is harder to decode spatial information because of the slow calcium dynamics. Ideally, we would look at this effect using the drifting grating stimuli; however, this is problematic because we rely on predicted activity from the SOTA DNEM, and due to the dilation of the first convolution, the periodic grating stimulus causes aliasing. At 15Hz, when the temporal frequency of the stimulus is half the movie frame rate, the model is actually being given two static images, and so the predicted activity is the interleaved activity evoked by two static images. We therefore do not think using the grating stimuli is a good idea. But we have used the Gaussian stimuli as it is not periodic, and is therefore less of a problem. 

      We have now also reconstructed phase-inverted Gaussian noise stimuli and plotted the video correlation between the reconstructions from activity evoked by phase-inverted stimuli. On the one hand, we find that even for the fastest changing stimuli, the correlation between the reconstructions from phase inverted stimuli are negative, meaning phase information is not lost at high temporal frequencies. On the other hand, for the highest spatial frequency stimuli, the correlation is negative. So, the predicted neural activity (and therefore the reconstructions) are phase-insensitive when the spatial frequency is higher than the reconstruction resolution limit we identified (spatial length constant of 1 pixel, or 3.38 degrees). Beyond this limit, the DNEM predicts activity in response to phase-inverted stimuli, which, when used for reconstruction, results in movies which are more similar to each other than the stimulus that actually evokes them. 

      However, not all information is lost at these high spatial frequencies. If we plot the Shannon entropy in the spatial domain or the motion energy in the temporal domain, we find that even when the reconstructions fail to capture the stimulus at a pixel-specific level (spatial length constant of 1 pixel, or 3.38 degrees), they do capture the general spatial and temporal qualities of the videos. 

      We have added these additional analyses to Figure 4 and Supplementary Figure 5.

      Reviewer #2 (Public review): 

      This is an interesting study exploring methods for reconstructing visual stimuli from neural activity in the mouse visual cortex. Specifically, it uses a competition dataset (published in the Dynamic Sensorium benchmark study) and a recent winning model architecture (DNEM, dynamic neural encoding model) to recover visual information stored in ensembles of the mouse visual cortex. 

      This is a great project - the physiological data were measured at a single-cell resolution, the movies were reasonably naturalistic and representative of the real world, the study did not ignore important correlates such as eye position and pupil diameter, and of course, the reconstruction quality exceeded anything achieved by previous studies. Overall, it is great that teams are working towards exploring image reconstruction. Arguably, reconstruction may serve as an endgame method for examining the information content within neuronal ensembles - an alternative to training interminable numbers of supervised classifiers, as has been done in other studies. Put differently, if a reconstruction recovers a lot of visual features (maybe most of them), then it tells us a lot about what the visual brain is trying to do: to keep as much information as possible about the natural world in which its internal motor circuits may act consequently. 

      While we enjoyed reading the manuscript, we admit that the overall advance was in the range of those that one finds in a great machine learning conference proceedings paper. More specifically, we found no major technical flaws in the study, only a few potential major confounds (which should be addressable with new analyses), and the manuscript did not make claims that were not supported by its findings, yet the specific conceptual advance and significance seemed modest. Below, we will go through some of the claims, and ask about their potential significance. 

      We thank the reviewer for the positive feedback on our paper.

      (1) The study showed that it could achieve high-quality video reconstructions from mouse visual cortex activity using a neural encoding model (DNEM), recovering 10-second video sequences and approaching a two-fold improvement in pixel-by-pixel correlation over attempts. As a reader, I am left with the question: okay, does this mean that we should all switch to DNEM for our investigations of the mouse visual cortex? What makes this encoding model special? It is introduced as "a winning model of the Sensorium 2023 competition which achieved a score of 0.301... single-trial correlation between predicted and ground truth neuronal activity," but as someone who does not follow this competition (most eLife readers are not likely to do so, either), I do not know how to gauge my response. Is this impressive? What is the best achievable score, in theory, given data noise? Is the model inspired by the mouse brain in terms of mechanisms or architecture, or was it optimized to win the competition by overfitting it to the nuances of the data set? Of course, I know that as a reader, I am invited to read the references, but the study would stand better on its own if clarified how its findings depended on this model. 

      This is a very good point. We do not think that everyone should switch to using this particular DNEM to investigate the mouse visual cortex, but we think DNEMs and stimulus reconstruction in general has a lot of potential. We think static neural encoding models have already been demonstrated to be an extremely valuable tool to investigate visual coding (Walker et al., 2019; Yoshida et al., 2021; Willeke et al., bioRxiv 2023). DNEMs are less common, largely because they are very large and are technically more demanding to train and use. That makes static encoding models more practical for some applications, but they do not have temporal kernels and are therefore only used for static stimuli. They cannot, for instance, encode direction tuning, only orientation tuning. But both static and dynamic encoding models have advantages over stimulus classification methods which we outline in our discussion. Here we provide the first demonstration that previous achievements in static image reconstruction are transferable to movies.

      It has been shown in the past for static neural encoding models that choosing a better-performing model produces reconstructed static images that are closer to the original image (Pierzchlewicz et al., 2023). The factors in choosing this particular DNEM were its capacity to predict neural activity (benchmarked against other models), it was open source, and the data it was designed for was also available. 

      To give more context to the model used in the paper, we have included the following, line 348:

      “This model achieved an average single-trial correlation between predicted and ground truth neural activity of 0.291 during the competition, this was later improved to 0.301. The competition benchmark models achieved 0.106, 0.164 and 0.197 single-trial correlation, while the third and second place models achieved 0.243 and 0.265. Across the models, a variety of architectural components were used, including 2D and 3D convolutional layers, recurrent layers, and transformers, to name just a few.” 

      Concerning biologically inspired model design. The winning model contained 3 fully connected layers comprising the “Cortex” just before the final readout of neural activity, but we would consider this level of biological inspiration as minor. We do not think that the exact architecture of the model is particularly important, as the crucial aspect of such neural encoders is their ability to predict neural activity irrespective of how they achieve it. There has been a move towards creating foundation models of the brain (Wang et al., 2025) and the priority so far has been on predictive performance over mechanistic interpretability or similarity to biological structures and processes. 

      Finally, we would like to note that we do not know what the maximum theoretical score for single-trial responses might be, and don't think there is a good way of estimating it in this context. 

      (2) Along those lines, two major conclusions were that "critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling." If true, then these principles should be applicable to networks with different architectures. How well can they do with other network types? 

      This is a good question. Our method critically relies on the accurate prediction of neural activity in response to new videos. It is therefore expected that a model that better predicts neural responses to stimuli will also be better at reconstructing those stimuli given population activity. This was previously shown for static images (Pierzchlewicz et al., 2023). It is also expected that whenever the neural activity is accurately predicted, the corresponding reconstructed frames will also be more similar to the ground truth frames. We have now demonstrated this relationship between prediction accuracy and reconstruction accuracy in supplementary figure 4.

      Although it would be interesting to compare the movie reconstruction performance of many different models with different architectures and activity prediction performances, this would involve quite substantial additional work because movie reconstruction is very resource- and time-intensive. Finding optimal hyperparameters to make such a comparison fair and informative would therefore be impractical and likely not yield surprising results. 

      We also think it is unlikely that ensembling would not improve reconstruction performance in other models because ensembling across model predictions is a common way of improving single-model performance in machine learning. Likewise, we think it is unlikely that the relationship between neural population size and reconstruction performance would differ substantially when using different models, because using more neurons means that a larger population of noisy neurons is “voting” on what the stimulus is. However, we would expect that if the model were worse at predicting neural activity, then more neurons are needed for an equivalent reconstruction performance. In general, we would recommend choosing the best possible DNEM available, in terms of neural activity prediction performance, when reconstructing movies using input optimization through gradient descent. 

      (3) One major claim was that the quality of the reconstructions depended on the number of neurons in the dataset. There were approximately 8000 neurons recorded per mouse. The correlation difference between the reconstruction achieved by 1 neuron and 8000 neurons was ~0.2. Is that a lot or a little? One might hypothesize that ~7,999 additional neurons could contribute more information, but perhaps, those neurons were redundant if their receptive fields were too close together or if they had the same orientation or spatiotemporal tuning. How correlated were these neurons in response to a given movie? Why did so many neurons offer such a limited increase in correlation? 

      In the population ablation experiments, we compared the performance using ~1000, ~2000, ~4000, ~8000 neurons, and found an attenuation of 39.5% in video correlation when dropping 87.5% of the neurons (~1000 neurons remaining), we did not try reconstruction using just 1 neuron. 

      (4) On a related note, the authors address the confound of RF location and extent. The study resorted to the use of a mask on the image during reconstruction, applied during training and evaluation (Line 87). The mask depends on pixels that contribute to the accurate prediction of neuronal activity. The problem for me is that it reads as if the RF/mask estimate was obtained during the very same process of reconstruction optimization, which could be considered a form of double-dipping (see the "Dead salmon" article, https://doi.org/10.1016/S1053-8119(09)71202-9). This could inflate the reconstruction estimate. My concern would be ameliorated if the mask was obtained using a held-out set of movies or image presentations; further, the mask should shift with eye position, if it indeed corresponded to the "collective receptive field of the neural population." Ideally, the team would also provide the characteristics of these putative RFs, such as their weight and spatial distribution, and whether they matched the biological receptive fields of the neurons (if measured independently). 

      We can reassure the reviewer that there is no double-dipping. We would like to clarify that the mask was trained only on videos from the training set of the DNEM and not the videos which were reconstructed. We have added the sentence, line 91: 

      “None of the reconstructed movies were used in the optimization of this transparency mask.”

      Making the mask dependent on eye position would be difficult to implement with the current DNEM, where eye position is fed to the model as an additional channel. When using a model where the image is first transformed into retinotopic coordinates in an eye position-dependent manner (such as in Wang et al., 2025) the mask could be applied in retinotopic coordinates and therefore be dependent on eye position. 

      Effectively, the alpha mask defines the relative level of influence each pixel contributes to neural activity prediction. We agree it is useful to compare the shape of the alpha mask with the location of traditional on-off receptive fields (RFs) to clarify what the alpha mask represents and characterise the neural population available for our reconstructions. We therefore presented the DNEM with on-off patches to map the receptive fields of single neurons in an in silico experiment (the experimentally derived RF are not available). As expected, there is a rough overlap between the alpha mask (Supplementary Figure 2D), the average population receptive field (Supplementary Figure 2B), and the location of receptive field peaks (Supplementary Figure 2C). In principle, all three could be used during training or evaluation for masking, but we think that defining a mask based on the general influence of images on neural activity, rather than just on off patch responses, is a more elegant solution.

      One idea of how to go a step further would be to first set the alpha mask threshold during training based on the % loss of neural activity prediction performance that threshold induces (in our case alpha=0.5 corresponds to ~3% loss in correlation between predicted vs recorded neural responses, see Supplementary Figure 3D), and second base the evaluation mask on a pixel correlation threshold (see example pixel correlation map in Supplementary Figure 2E) instead to avoid evaluating areas of the image with low image reconstruction confidence. 

      We referred to this figure in the result section, line 83:

      “The transparency masks are aligned with but not identical to the On-Off receptive field distribution maps using sparse-noise (Figure S2).” 

      We have also done additional analysis on the effect of masking during training and evaluation with different thresholds in Supplementary Figure 3.

      (5) We appreciated the experiments testing the capacity of the reconstruction process, by using synthetic stimuli created under a Gaussian process in a noise-free way. But this further raised questions: what is the theoretical capability for the reconstruction of this processing pipeline, as a whole? Is 0.563 the best that one could achieve given the noisiness and/or neuron count of the Sensorium project? What if the team applied the pipeline to reconstruct the activity of a given artificial neural network's layer (e.g., some ResNet convolutional layer), using hidden units as proxies for neuronal calcium activity? 

      That’s a very interesting point. It is very hard to know what the theoretical best reconstruction performance of the model would be. Reconstruction performance could be decreased due to neural variability, experimental noise, the temporal kernel of the calcium indicator and the imaging frame rate, information compression along the visual hierarchy, visual processing phenomena (such as predictive coding and selective attention), failure of the model to predict neural activity correctly, or failure of the reconstruction process to find the best possible image which explains the neural activity. We don't think we can disentangle the contribution of all these sources, but we can provide a theoretical maximum assuming that the model and the reconstruction process are optimal. To that end, we performed additional simulations and reconstructed the natural videos using the predicted activity of the neurons in response to the natural videos as the target (similar to the synthetic stimuli) and got a correlation of 0.766. So, the single trial performance of 0.569 is ~75% of this theoretical maximum. This difference can be interpreted as a combination of the losses due to neuronal variability, measurement noise, and actual deviations in the images represented by the brain compared to reality. 

      We thank the reviewer for this suggestion, as it gave us the idea of looking at error maps (Figure 6), where the pixel-level deviation of the reconstructions from recorded vs predicted activity is overlaid on the ground truth movie.

      (6) As the authors mentioned, this reconstruction method provided a more accurate way to investigate how neurons process visual information. However, this method consisted of two parts: one was the state-of-the-art (SOTA) dynamic neural encoding model (DNEM), which predicts neuronal activity from the input video, and the other part reconstructed the video to produce a response similar to the predicted neuronal activity. Therefore, the reconstructed video was related to neuronal activity through an intermediate model (i.e., SOTA DNEM). If one observes a failure in reconstructing certain visual features of the video (for example, high-spatial frequency details), the reader does not know whether this failure was due to a lack of information in the neural code itself or a failure of the neuronal model to capture this information from the neural code (assuming a perfect reconstruction process). Could the authors address this by outlining the limitations of the SOTA DNEM encoding model and disentangling failures in the reconstruction from failures in the encoding model? 

      To test if a better neural prediction by the DNEM would result in better reconstructions, we ran additional simulations and now show that neural activity prediction performance correlates with reconstruction performance (Supplementary Figure 4B). This is consistent with Pierzchlewicz et al., (2023) who showed that static image reconstructions using better encoding models leads to better reconstruction performance. As also mentioned in the answer to the previous comment, untangling the relative contributions of reconstruction losses is hard, but we think that improvements to the DNEM performance are key. Two suggestions to improving the DNEM we used would be to translate the input image into retinotopic coordinates and shift this image relative to eye position before passing it to the first convolutional layer (as is done in Wang et al. 2025), to use movies which are not spatially down sampled as heavily, to not use a dilation of 2 in the temporal convolution of the first layer and to train on a larger dataset. 

      (7) The authors mentioned that a key factor in achieving high-quality reconstructions was model assembling. However, this averaging acts as a form of smoothing, which reduces the reconstruction's acuity and may limit the high-frequency content of the videos (as mentioned in the manuscript). This averaging constrains the tool's capacity to assess how visual neurons process the low-frequency content of visual input. Perhaps the authors could elaborate on potential approaches to address this limitation, given the critical importance of high-frequency visual features for our visual perception. 

      This is exactly what we also thought. To answer this point more specifically, we ran additional simulations where we also reconstruct the movies using gradient ensembling instead of reconstruction ensembling. Here, the gradients of the loss with respect to each pixel of the movie is calculated for each of the model instances and are averaged at every iteration of the reconstruction optimization. In essence, this means that one reconstruction solution is found, and the averaging across reconstructions, which could degrade high-frequency content, is skipped. The reconstructions from both methods look very similar, and the video correlation is, if anything, slightly worse (Supplemental Figure 3A&C). This indicates that our original ensembling approach did not limit reconstruction performance, but that both approaches can be used, depending on what is more convenient given hardware restrictions. 

      Reviewer #3 (Public review): 

      Summary: 

      This paper presents a method for reconstructing input videos shown to a mouse from the simultaneously recorded visual cortex activity (two-photon calcium imaging data). The publicly available experimental dataset is taken from a recent brain-encoding challenge, and the (publicly available) neural network model that serves to reconstruct the videos is the winning model from that challenge (by distinct authors). The present study applies gradient-based input optimization by backpropagating the brain-encoding error through this selected model (a method that has been proposed in the past, with other datasets). The main contribution of the paper is, therefore, the choice of applying this existing method to this specific dataset with this specific neural network model. The quantitative results appear to go beyond previous attempts at video input reconstruction (although measured with distinct datasets). The conclusions have potential practical interest for the field of brain decoding, and theoretical interest for possible future uses in functional brain exploration. 

      Strengths: 

      The authors use a validated optimization method on a recent large-scale dataset, with a state-of-the-art brain encoding model. The use of an ensemble of 7 distinct model instances (trained on distinct subsets of the dataset, with distinct random initializations) significantly improves the reconstructions. The exploration of the relation between reconstruction quality and the number of recorded neurons will be useful to those planning future experiments. 

      Weaknesses: 

      The main contribution is methodological, and the methodology combines pre-existing components without any new original components. 

      We thank the reviewer for taking the time to review our paper and for their overall positive assessment. We would like to emphasise that combining pre-existing machine learning techniques to achieve top results in a new modality does require iteration and innovation. While gradient-based input optimization by backpropagating the brain-encoding error through a neural encoding model has been used in 2D static image optimization to generate maximally exciting images and reconstruct static images, we are the first to have applied it to movies which required accounting for the time domain. Previous methods used time averaged responses and were limited to the reconstruction of static images presented with fixed image intervals.

      The movie reconstructions include a learned "transparency mask" to concentrate on the most informative area of the frame; it is not clear how this choice impacts the comparison with prior experiments. Did they all employ this same strategy? If not, shouldn't the quantitative results also be reported without masking, for a fair comparison? 

      Yes, absolutely. All reconstruction approaches limit the field of view in some way, whether this is due to the size of the screen, the size of the image on the screen, or cropping of the presented/reconstructed images during analysis due to the retinotopic coverage of the recorded neurons. Note that we reconstruct a larger field of view than Yoshida et al. In Yoshida et al., the reconstructed field of view was 43 by 43 retinal degrees. we show the size of an example evaluation mask in comparison. 

      To address the reviewer’s concern more specifically, we performed additional simulations and now also show the performance using a variety of different training and evaluation masks, including different alpha thresholds for training and evaluation masks as well as the effective retinotopic coverage at different alpha thresholds. Despite these comparisons, we would also like to highlight that the comparison to the benchmark is problematic itself. This is because image and movie reconstruction are not directly comparable. It does not make sense to train and apply a dynamic model on a static image dataset where neural activity is time averaged. Conversely, it does not make sense to train or apply a static model that expects time-averaged neural responses on continuous neural activity unless it is substantially augmented to incorporate temporal dynamics, which in turn would make it a new method. This puts us in the awkward position of being expected to compare our video reconstruction performance to previous image reconstruction methods without a fair way of doing so. We have therefore de-emphasised the phrasing comparing our method to previous publications in the abstract, results, and discussion. 

      Abstract: “We achieve a ~2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics.” with “We achieve a pixel-level correction of 0.57 between the ground truth movie and the reconstructions from single-trial neural responses.”

      Results: “This represents a ~2x higher pixel-level correlation over previous single-trial static image reconstructions from V1 in awake mice (image correlation 0.238 +/- 0.054 s.e.m for awake mice) [Yoshida et al., 2020] over a similar retinotopic area (~43° x 43°) while also capturing temporal dynamics. However, we would like to stress that directly comparing static image reconstruction methods with movie reconstruction approaches is fundamentally problematic, as they rely on different data types both during training and evaluation (temporally averaged vs continuous neural activity, images flashed at fixed intervals vs continuous movies).”

      Discussion: “In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ~2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods.” with “In conclusion, we reconstruct videos presented to mice based on single-trial activity of neurons in the mouse visual cortex.”

      We have also removed the performance table and have instead added supplementary figure 3 with in-depth comparison across different versions of our reconstruction method (variations of masking, ensembling, contrast & luminance matching, and Gaussian blurring). 

      We believe that we have given enough information in our paper now so that readers can make an informed decision whether our movie reconstruction method is appropriate for the questions they are interested in.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) "Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth." This was not clear: was it done by the investigating team? I imagine that one of the most easily captured visual features is luminance and contrast, why wouldn't the optimization titrate these well? 

      The contrast and luminance matching of the reconstructions to the ground truth videos was done by us, but this was only done to help readers assess the quality of the reconstructions by eye. Our performance metrics (frame and video correlation) are contrast and luminance insensitive. To clarify this, we have also added examples of non-adjusted frames in Supplementary Figure 3A, and added a sentence in the results, line 103: 

      “When presenting videos in this paper we normalize the mean and standard deviation of the reconstructions to the average and standard deviation of the corresponding ground truth movie before applying the evaluation masks, but this is not done for quantification except in Supplementary Figure 3D.”

      We were also initially surprised that contrast and luminance are not captured well by our reconstruction method, but this makes sense as V1 is largely luminance invariant (O’Shea et al., 2025 https://doi.org/10.1016/j.celrep.2024.115217 ) and contrast only has a gain effect on V1 activity (Tring et al., 2024 https://journals.physiology.org/doi/full/10.1152/jn.00336.2024). Decoding absolute contrast is likely unreliable because it is probably not the only factor modulating the overall gain of the neural population.

      To address the reviewer’s comment more fully, we ran additional experiments. More specifically, to test why contrast and luminance are not recovered in the reconstructions, we checked how the predicted activity between the reconstruction and the contrast/luminance corrected reconstructions differs. Contrast and luminance adjustment had little impact on predicted response similarity on average. This makes the reconstruction optimization loss function insensitive to overall contrast and luminance so it cannot be decoded. There is a small effect on activity correlation, however, so we cannot completely rule out that contrast and luminance could be reconstructed with a different loss function. 

      (2) The authors attempted to investigate the variability in reconstruction quality across different movies and 10-second snippets of a movie by correlating various visual features, such as video motion energy, contrast, luminance, and behavioral factors like running speed, pupil diameter, and eye movement, with reconstruction success. However, it would also be beneficial if the authors correlated the response loss (Poisson loss between neural responses) with reconstruction quality (video correlation) for individual videos, as these metrics are expected to be correlated if the reconstruction captures neural variance. 

      We thank the reviewer for this suggestion. We have now included this analysis and find that if the neural activity was better predicted by the DNEM then the reconstruction of the video was also more similar to the ground truth video. We further found that this effect is shift-dependent (in time), meaning the prediction of activity based on proximal video frames is more influential on reconstruction performance. 

      Reviewer #3 (Recommendations for the authors): 

      (1) I was confused about the choice of applying a transparency mask thresholded with alpha>0.5 during training and alpha>1 during evaluation. Why treat the two situations differently? Also, shouldn't we expect alpha to be in the [0,1] range, in which case, what is the meaning of alpha>1? (And finally, as already described in "Weaknesses", how does this choice impact the comparison with prior experiments? Did they also employ a similar masking strategy?) 

      We found that applying a mask during training increased performance regardless of the size of the evaluation mask. Using a less stringent mask during training than during evaluation increases performance slightly, but also allows inspection of the reconstruction in areas where the model will be less confident without sacrificing performance, if this is desired. The thresholds of 0.5 and 1 were chosen through trial and error, but the exact values do not hold intrinsic meaning. The alpha mask values can go above 1 during their optimization. We could have clipped alpha during the training procedure (algorithm 1), but we decided this was not worth redoing at this stage, as the alphas used for testing were not above 1. All reconstruction approaches in previous publications limit the field of view in some form, whether this is due to the size of the screen, the size of the image on the screen, or the cropping of the presented/reconstructed images during analysis. 

      To address the reviewer’s comment in detail, we have added extensive additional analysis to evaluate the coverage of the reconstruction achieved in this paper and how different masking strategies affect performance, as well as how the mask relates to more traditional receptive field mapping.  

      (2) I would not use the word "imagery" in the first sentence of the abstract, because this might be interpreted by some readers as reconstruction of mental imagery, a very distinct question. 

      We changed imagery to images in the abstract.

      (3) Line 145-146: "<1 frame, or <30Hz" should be "<1 frame, or >30Hz". 

      We have corrected the error.

      (4) Algorithm 1, Line 5, a subscript variable 'g' should be changed to 'h'

      We have corrected the error.

      Additional Changes

      (1) Minor grammatical errors

      (2) Addition of citations: We were previously not aware of a bioRxiv preprint from 2022 (Cobos et al., 2022), which used gradient descent-based input optimization to reconstruct static images but without the addition of a diffusion model. Instead, we had cited for this method Pierzchlewicz et al., 2023 bioRxiv/NeurIPS. In Cobos et al., 2022, they compare static image reconstruction similarity to ground truth images and the similarity of the in vivo evoked activity across multiple reconstruction methods. Performance values are only given for reconstructions from trial-averaged responses across ~40 trials (in the absence of original data or code we are also not able to retrospectively calculate single-trial performance). The authors find that optimizing for evoked activity rather than image similarity produces image reconstructions that evoke more similar in vivo responses compared to reconstructions optimized for image similarity itself. We have now added and discussed the citation in the main text. 

      (3) Workaround for error in the open-source code from https://github.com/lRomul/sensorium for video hashing function in the SOTA DNEM: By checking the most correlated first frame for each reconstructed movie, we discovered there was a bug in the open-source code and 9/50 movies we originally used for reconstruction were not properly excluded from the training data between DNEM instances. The reason for this error was that some of the movies are different by only a few pixels, and the video hashing function used to split training and test set folds in the original DNEM code classified these movies as different and split them across folds. We have replaced these 9 movies and provide a figure below showing the next closest first frame for every movie clip we reconstruct. This does not affect our claims. Excluding these 9 movie clips, did not affect the reconstruction performance (video correlation went from 0.563 to 0.568), so there was no overestimation of performance due to test set contamination. However, they should still be removed so some of the values in the paper have changed slightly. The only statistical test that was affected was the correlation between video correlation and mean motion energy (Supplementary Figure 4A), which went from p = 0.043 to 0.071. 

      Author response image 2.

      exclusion of movie clips with duplicates in the DNEM training data. A) example frame of a reconstructed movie (ground truth) and the most correlated first frame from the training data. b) all movie clips and their corresponding most correlated clip from the training data. Red boxes indicate excluded duplicates. 

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      We thank the reviewers for their overall support, thorough review, and thoughtful comments. The points raised were all warranted and we feel that addressing them has improved the quality of our manuscript. Below we respond to each of the points raised.

      2. Point-by-point description of the revisions

      Reviewer #1

      Minor comments:

      Are the lgl-1; pac-1 M-Z- double mutants dead? Only the phenotype of pac-1(M-Z-); lgl-1 (M+Z-) is shown. In figures and text throughout, it should be clear whether mutants are referring to zygotic loss or both maternal and zygotic loss, as this distinction could have major implications on the interpretation of experiments.

      Almost all experiments we performed used a combination of RNAi of lgl-1 in a homozygous pac-1 null mutant background, or the other way around. RNAi should eliminate maternal product, but we hesitate to use the terminology M/Z since it has previously been used for protein degradation strategies.

      We have updated the text and figure 1 to address the potential of maternal product masking earlier phenotypes, and performed additional RNAi experiments to demonstrate that the phenotypes obtained by RNAi for either pac-1 or lgl-1 in a homozygous mutant background for the other are the same as for the genetic double mutant. The results are shown as additional images and quantifications in figure 1B,C. We also updated the legend to figure 1 to make it clear that double genetic mutants are obtained from heterozygous lgl-1/+ parents.

      Regarding the phenotype of lgl-1; pac-1 M-Z- double mutants: assuming the reviewer refers to M-Z- double genetic mutants, we cannot make such embryos as the pac-1(M-Z-); lgl-1(M+Z-) animals are already lethal.

      In Figure 1C, it would be more appropriate to show a fully elongated WT embryo to contrast with arrested elongation in mutant embryos.

      We agree with the reviewer and have replaced the 2-fold WT embryo with a 3-fold embryo.

      Is the lateral spread of DLG-1 in double mutant embryos a result of failure to polarize DLG-1, or failure to maintain polarity? This should be straightforward to address in higher time resolution movies.

      We have analyzed additional embryos at early stages of development. In lgl-1; pac-1 embryos we never see the appearance of complete junctions: defects are apparent already at dorsal intercalation. We interpret these results as a failure to properly polarize DLG-1. We have added additional images to Figure S2 and added this sentence to the text: Imaging of embryos from early stages of development on showed that normal continuous junctional DLG-1 bands are never established in pac-1(RNAi); lgl-1(mib201) embryos (Fig. S2B).

      The lack of enhancement of hmp-1(fe4) by lgl-1(RNAi) is quite interesting, given that pac-1 does enhance hmp-1(fe4). To rule out the possibility that this result stems from incomplete lgl-1 RNAi, this experiment should be repeated using the lgl-1 null mutant.

      We have done this experiment by recreating the fe4 S823F mutation in the lgl-1(null) mutant background as well as in the wild-type CGC1 background using CRISPR/Cas9. The phenotype of both was similar, but differs from that of the original PE97 strain. In the original strain, there is ~50% embryonic lethality but worms that complete embryogenesis grow up to be fertile adults. In our new "fe4" strains, nearly all animals are severely malformed with little to no elongation taking place. We are able to maintain both strains (with and without lgl-1) homozygous but with difficulty as only ~5% of animals grow up and give progeny. Apparently, there are genetic differences between PE97 and our CGC1 background that cause phenotypic differences despite having the same amino acid change in HMP-1.

      Nevertheless, using our original embryonic viability criterium of 'hatching', loss of lgl-1 does not enhance the S823F mutation. We have included the following text in the manuscript:

      To rule out that the lack of enhancement by lgl-1(RNAi) is due to incomplete inactivation of lgl-1, we also re-created the hmp-1(fe4) mutation (S823F) by CRISPR in lgl-1(mib201) mutant animals and wild-type controls. The phenotype of the S823F mutant we created is more severe than that of the original PE97 hmp-1(fe4) strain, with only ~5% of animals becoming fertile adults (Fig. S2F). This likely represents the presence of compensatory changes that have accumulated over time in PE97. Nevertheless, consistent with our RNAi results, the presence of lgl-1(mib201) did not further exacerbate the phenotype of HMP-1(S823F) (Fig. S2E, F). Taken together, the lack of enhancement of hmp-1(S823F) mutants by inactivation of loss of lgl-1 This observation argues against a primary role for lgl-1 in regulating cell junctions.

      • Related to point 4, do pac-1 or lgl-1 null mutants enhance partial knockdown of junction protein DLG-1, or is this effect (of pac-1) specific to HMP-1/AJs?*

      We have attempted to address this point using feeding RNAi against dlg-1. However, we were not able to obtain partial depletion of DLG-1. On RNAi feeding plates, control, pac-1, and lgl-1 animals did not show significant embryonic lethality. We checked RNAi effectiveness with a DLG-1::mCherry strain and found RNAi by feeding to be very ineffective. Since we could not deplete DLG-1 to a level that results in partial embryonic lethality, we were not able to address this question properly.

      Does lgl-1 loss affect PAC-1 protein localization and vice versa?

      It does not. We have added the following text and a figure panel: Loss-of-function mutants that strongly enhance a phenotype are often interpreted as acting in parallel pathways. We therefore examined whether loss of lgl-1 or pac-1 alters the localization of endogenously GFP-tagged LGL-1 or PAC-1. In neither null background did we detect changes in the subcellular localization of the other protein, consistent with LGL-1 and PAC-1 functioning in parallel pathways (Fig. S1D).

      Reviewer #2

      Very little of the imaging data are analyzed quantitatively, and in many cases it is not clear how many embryos were analyzed. While the images that are presented show clear defects, readers cannot determine how reproducible, strong or significant the phenotypes are.

      We completely agree with the reviewer that interpretation of our data requires this information and apologize for the omission in the first manuscript version. The phenotypes are highly penetrant and consistent (timing of arrest, % lethality, junctional defects), and we have now added quantifications throughout the manuscript.

      In particular, the data below should be quantified and, where possible, analyzed statistically:

      • The frequency of the various junctional phenotypes shown in 2C

      We have now quantified the junctional phenotypes. The junctional defects are highly penetrant: >90% of lgl-1; pac-1 embryos have junctional defects (new Fig. 2B). We used airy-scan confocal imaging to analyze the distribution of the different phenotypes (unaffected, spread laterally, and ring-like pattern). The results are shown in Fig. 2G.

      • The expansion of DLG-1::mCherry in pac-1 lgl-1 embryos should be quantified (related to Figure 2B). For example, the percentage of membrane (marked by PH::GFP) occupied by DLG-1 could be quantified.

      We have performed this quantification, shown in Fig. 2D.

      - Similarly, the expansion of the aPKC domain should be quantified (Figure 3A).

      An objective quantification of aPKC signal is difficult due to the relatively weak expression of aPKC::GFP and the lack of a clear demarcating boundary. This is part of the reason we measured tortuosity as a more quantifyable indicator of apical domain expansion. We have now added a qualitative observation table as Figure 3B. In addition, we have expanded the quantification of cell geometry by measuring lateral and basal surfaces. Lateral surfaces were decreased. We added the following text:

      To better understand the reason for the change in geometry, we also measured the lengths of the lateral and basal surfaces (Fig. 3F). We found that the absolute lengths of the apical surfaces were not significantly different between pac-1(RNAi); lgl-1(mib201) and control animals. Instead, the lengths of the lateral domain were reduced (Fig. 3F). Hence, the more dome-shaped appearance of epidermal cells in pac-1; lgl-1 double mutant animals is due to a decrease in lateral domain size, which is consistent with the observed lateral spreading of aPKC.

      • How many embryos were analyzed for each marker shown in Figure 2A, and what proportion showed the described phenotypes? This could be given in the text or in a panel.

      We have added these numbers to panel 2B, and indicated the percentage in the text.

      • The frequency of the various junctional phenotypes shown in 4F.

      To address this, we have changed figure 4F to show three types of phenotype (strong, mild, no phenotype) and added how frequently we observed each to the panels. In rescue experiments, 18/24 embryos showed no junctional defects, while 6/24 showed a mild defect (compared to 100% severe in non-rescued embryos). To make room for this and other quantifications in Figure 4, we moved the demonstration that PAC-1 is depleted by RNAi to supplemental figure S4.

      Because the genetic perturbations used are global (either deletions or RNAi), it is not established whether PAC-1/LGL-1 act in epidermal epithelial cells per se (versus an earlier requirement that manifests in epidermal epithelial cells). While I agree that this is the most likely scenario, other mechanisms are possible.

      Our experiments indeed use global depletion/deletion of lgl-1 and pac-1. We cannot exclude therefore that other tissues do not contribute to the epithelial phenotypes. We assume that other tissues would be affected as well, and in fact have observed abnormal looking pharynx tissue (see our response to reviewer 3 below for examples). As the epidermis is one of the first tissue to develop it is likely the first in which phenotypes become apparent.

      In particular, the overall GFP::aPKC levels appear notably higher in pac-1 lgl-1 embryos in Figure 3A. aPKC levels should be quantified to determine if this is true of pac-1 lgl-1 embryos. If so, couldn't that explain (or at least contribute to) the observed phenotypes?

      Overall higher levels could indeed contribute to the phenotype. However, we have now quantified total aPKC levels in control and pac-1; lgl-1 embryos found no difference between them. We have added the following text to the manuscript: To determine if increased expression of aPKC might explain the broadened apical localization, we measured total intensity levels of aPKC::GFP. However, we detected no differences in fluorescence levels between control and pac-1(RNAi); lgl-1(mib201) animals (Fig. S3B, C).

      Minor

      Figure 4: For completeness, please include the embryonic viability of pac-1 lgl-1 +/- embryos treated with EV and cdc-42(RNAi), as was done for pac-1 lgl-1 pkc-3(ts) in Figure 4E. Presumably the increased proportion of viable embryos with the lgl-1 deletion allele is reflected in an overall increase in embryonic viability.

      The embryonic viability indeed increases, but not as much as one might think because 15% of embryos die from the cdc-42 RNAi itself. The most important rescue argument is that we can obtain adult pac-1; lgl-1 animals with cdc-42 RNAi.

      We have now included the overall rescue and the following text: Overall, cdc-42 RNAi caused a mild increase in embryonic viability (Fig. 4A). However, total embryonic viability may underestimate rescue of pac-1; lgl-1 embryonic lethality, because it also includes the ~15% lethality caused by cdc-42 inactivation itself, even among animals wild type for lgl-1.

      The orientation of the inset images in Figures 2C, 3A and 3D is confusing. An illustration showing how these images are oriented relative to each other would be helpful.

      We have added a figure showing how the junctions are oriented in the figures (Fig. 2E). We have also added supplemental videos S3 and S4 that should illustrate the phenotype more clearly as well.

      For completeness, it would be good to test whether lgl-1(delta) is also synthetically lethal with picc-1(RNAi) (Zilberman 2017).

      We like this idea and had already looked into this. Lgl-1 and picc-1 are not synthetic lethal (see graph in word file submitted). However, PICC-1 is not the only junctional localization signal for PAC-1, as demonstrated by the Nance lab. We find the data interesting but feel that it deserves a more thorough structure/function investigation of PAC-1 than we can provide here. Therefore we would prefer not to include this data.

      Reviewer #3

      We thank the reviewer for their support of our manuscript.

      A few small areas to improve this manuscript:

      p. 6 like 139: "remain" should be "remaining"

      We have fixed this typo.

      Could the authors mention what is the phenotype of the 10% of pac-1 animals that die?

      Yes. They die with pleotropic phenotypes not resembling those of our pac-1; lgl-1 double mutant embryos. We have added examples of these to Figure S1.

      Based on the Supplemental figures, it made me curious to ask: Did the authors notice changes in dorsal epidermal fusions? Cadherin normally disappears in the dorsal hyp7 cells at this time. Did the timing of the fusions change at all?

      We haven't analyzed this in detail but our time-lapse videos show that dorsal fusions still take place and do not seem to be particularly delayed (overall development is slightly delayed but the delay in fusion is consistent with overall delay).

      Again, curiosity driven by the Supplemental figures: did the authors notice defects in apical regions of internal organs, like the pharynx or intestine? The CDC-42 biosensor is asymmetrical in the developing intestine. See: DOI: 10.1242/bio.056911

      We did not pay much attention to the intestine as PAC-1 is barely detectable in this tissue. The pharynx is formed, which we can easily detect in arrested embryos as we use GFP or BFP expressed under the myo-2 promoter to mark the deletion of pac-1. While we did not look closely, we do observe defects in pharynx development.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review)

      Major:

      (1) In line 76, the authors make a very powerful statement: 'σRNN simulation achieves higher similarity with unseen recorded trials before perturbation, but lower than the bioRNN on perturbed trials.' I couldn't find a figure showing this. This might be buried somewhere and, in my opinion, deserves some spotlight - maybe a figure or even inclusion in the abstract.

      We agree with the reviewer that these results are important. The failure of σRNN on perturbed data could be inferred from the former Figures 1E, 2C-E, and 3D. Following the reviewers' comments, we have tried to make this the most prominent message of Figure 1, in particular with the addition of the new panel E. We also moved Table 1 from the  Supplementary to the main text to highlight this quantitatively. 

      (2) It's mentioned in the introduction (line 84) and elsewhere (e.g., line 259) that spiking has some advantage, but I don't see any figure supporting this claim. In fact, spiking seems not to matter (Figure 2C, E). Please clarify how spiking improves performance, and if it does not, acknowledge that. Relatedly, in line 246, the authors state that 'spiking is a better metric but not significant' when discussing simulations. Either remove this statement and assume spiking is not relevant, or increase the number of simulations.

      We could not find the exact quote from the reviewer, and we believe that he intended to quote “spiking is better on all metrics, but without significant margins”. Indeed, spiking did not improve the fit significantly on perturbed trials, this is particularly true in comparison with the benefits of Dale’s law and local inhibition. As suggested by the reviewer, we rephrased the sentence from this quote and more generally the corresponding paragraphs in the intro (lines 83-87) and in the results (lines 245-271). Our corrections in the results sections are also intended to address the minor point (4) raised by the same reviewer.

      (3) The authors prefer the metric of predicting hits over MSE, especially when looking at real data (Figure 3). I would bring the supplementary results into the main figures, as both metrics are very nicely complementary. Relatedly, why not add Pearson correlation or R2, and not just focus on MSE Loss?

      In Figure 3 for the in-vivo data, we do not have simultaneous electrophysiological recordings and optogenetic stimulation in this dataset.  The two are performed on different recording sessions. Therefore, we can only compare the effect of optogenetics on the behavior, and we cannot compute Pearson correlation or R2 of the perturbed network activity. To avoid ambiguity, we wrote “For the sessions of the in vivo dataset with optogenetic perturbation that we considered, only the behavior of an animal is recorded” on line 294. 

      (4) I really like the 'forward-looking' experiment in closed loop! But I felt that the relevance of micro perturbations is very unclear in the intro and results. This could be better motivated: why should an experimentalist care about this forward-looking experiment? Why exactly do we care about micro perturbation (e.g., in contrast to non-micro perturbation)? Relatedly, I would try to explain this in the intro without resorting to technical jargon like 'gradients'.

      As suggested, we updated the last paragraph of the introduction (lines 88 - 95) to give better motivation for why algorithmically targeted acute spatio-temporal perturbations can be important to dissect the function of neural circuits. We also added citations to recent studies with targeted in vivo optogenetic stimulation. As far as we know the existing previous work targeted network stimulation mostly using linear models, while we used non-linear RNNs and their gradients.

      Minor:

      (1) In the intro, the authors refer to 'the field' twice. Personally, I find this term odd. I would opt for something like 'in neuroscience'.

      We implemented the suggested change: l.27 and l.30

      (2) Line 45: When referring to previous work using data-constrained RNN models, Valente et al. is missing (though it is well cited later when discussing regularization through low-rank constraints)

      We added the citation: l.45

      (3) Line 11: Method should be methods (missing an 's').

      We fixed the typo.

      (4) In line 250, starting with 'So far', is a strange choice of presentation order. After interpreting the results for other biological ingredients, the authors introduce a new one. I would first introduce all ingredients and then interpret. It's telling that the authors jump back to 2B after discussing 2C.

      We restructured the last two paragraphs of section 2.1, and we hope that the presentation order is now more logical.

      (5) The black dots in Figure 3E are not explained, or at least I couldn't find an explanation.

      We added an explanation in the caption of Figure 3E.

      Reviewer #2 (Public review):

      (1) Some aspects of the methods are unclear. For comparisons between recurrent networks trained from randomly initialized weights, I would expect that many initializations were made for each model variant to be compared, and that the performance characteristics are constructed by aggregating over networks trained from multiple random initializations. I could not tell from the methods whether this was done or how many models were aggregated.

      The expectation of the reviewer is correct, we trained multiple models with different random seeds (affecting both the weight initialization and the noise of our model) for each variant and aggregated the results. We have now clarified this in Methods 4.6. lines 658-662.

      (2) It is possible that including perturbation trials in the training sets would improve model performance across conditions, including held-out (untrained) perturbations (for instance, to units that had not been perturbed during training). It could be noted that if perturbations are available, their use may alleviate some of the design decisions that are evaluated here.

      In general, we agree with the reviewer that including perturbation trials in the training set would likely improve model performance across conditions. One practical limitation explaining partially why we did not do it with our dataset is the small quantity of perturbed trials for each targeted cortical area: the number of trials with light perturbations is too scarce to robustly train and test our models.

      More profoundly, to test hard generalizations to perturbations (aka perturbation testing), it will always be necessary that the perturbations are not trivially represented in the training data. Including perturbation trials during training would compromise our main finding: some biological model constraints improve the generalization to perturbation. To test this claim, it was necessary to keep the perturbations out of the training data.

      We agree that including all available data of perturbed and non-perturbed recordings would be useful to build the best generalist predictive system. It could help, for instance, for closed-loop circuit control as we studied in Figure 5. Yet, there too, it will be important for the scientific validation process to always keep some causal perturbations of interest out of the training set. This is necessary to fairly measure the real generalization capability of any model. Importantly, this is why we think out-of-distribution “perturbation testing” is likely to have a recurring impact in the years to come, even beyond the case of optogenetic inactivation studied in detail in our paper.

      Recommendation for the authors:

      Reviewer #1 (Recommendation for the authors):

      The code is not very easy to follow. I know this is a lot to ask, but maybe make clear where the code is to train the different models, which I think is a great contribution of this work? I predict that many readers will want to use the code and so this will improve the impact of this work.

      We updated the code to make it easier to train a model from scratch.

      Reviewer #2 (Recommendation for the authors):

      The figures are really tough to read. Some of that small font should be sized up, and it's tough to tell in the posted paper what's happening in Figure 2B.

      We updated Figures 1 and 2 significantly, in part to increase their readability. We also implemented the "Superficialities" suggestions.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This Review Article explores the intricate relationship between humans and Mycobacterium tuberculosis (Mtb), providing an additional perspective on TB disease. Specifically, this review focuses on the utilization of systems-level approaches to study TB, while highlighting challenges in the frameworks used to identify the relevant immunologic signals that may explain the clinical spectrum of disease. The work could be further enhanced by better defining key terms that anchor the review, such as "unified mechanism" and "immunological route." This review will be of interest to immunologists as well as those interested in evolution and host-pathogen interactions.

      We thank the editors for reviewing our article and for the primarily positive comments. We accept that better definition and terminology will improve the clarity of the message, and so have changed the wording as suggested above in the revised manuscript.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This is an interesting and useful review highlighting the complex pathways through which pulmonary colonisation or infection with Mycobacterium tuberculosis (Mtb) may progress to develop symptomatic disease and transmit the pathogen. I found the section on immune correlates associated with individuals who have clearly been exposed to and reacted to Mtb but did not develop latent infections particularly valuable. However, several aspects would benefit from clarification.

      Strengths:

      The main strengths lie in the arguments presented for a multiplicity of immune pathways to TB disease.

      Weaknesses:

      The main weaknesses lie in clarity, particularly in the precise meanings of the three figures.

      We accept this point, and have completely changed figure 2, and have expanded the legends for figure 1 and 3 to maximise clarity.

      I accept that there is a 'goldilocks zone' that underpins the majority of TB cases we see and predominantly reflects different patterns of immune response, but the analogies used need to be more clearly thought through.

      We are glad the reviewer agrees with the fundamental argument of different patterns of immunity, and have revised the manuscript throughout where we feel the analogies could be clarified.

      Reviewer #2 (Public review):

      Summary:

      This is a thought-provoking perspective by Reichmann et al, outlining supportive evidence that Mycobacterium tuberculosis co-evolved with its host Homo Sapiens to both increase susceptibility to infection and reduce rates of fatal disease through decreased virulence. TB is an ancient disease where two modes of virulence are likely to have evolved through different stages of human evolution: one before the Neolithic Demographic Transition, where humans lived in sparse hunter-gatherer communities, which likely selected for prolonged Mtb infection with reduced virulence to allow for transmission across sparse populations. Conversely, following the agricultural and industrial revolutions, Mtb virulence is likely to have evolved to attack a higher number of susceptible individuals. These different disease modalities highlight the central idea that there are different immunological routes to TB disease, which converge on a disease phenotype characterized by high bacterial load and destruction of the extracellular matrix. The writing is very clear and provides a lot of supportive evidence from population studies and the recent clinical trials of novel TB vaccines, like M72 and H56. However, there are areas to support the thesis that have been described only in broad strokes, including the impact of host and Mtb genetic heterogeneity on this selection, and the alternative model that there are likely different TB diseases (as opposed to different routes to the same disease), as described by several groups advancing the concept of heterogeneous TB endotypes. I expand on specific points below.

      Strengths:

      The idea that Mtb evolved to both increase transmission (and possible commensalism with humans) with low rates of reactivation is intriguing. The heterogeneous TB phenotypes in the collaborative cross model (PMID: 35112666) support this idea, where some genetic backgrounds can tolerate a high bacterial load with minimal pathology, while others show signs of pathogenesis with low bacterial loads. This supports the idea that the underlying host state, driven by a number of factors like genetics and nutrition, is likely to explain whether someone will co-exist with Mtb without pathology, or progress to disease. I particularly enjoyed the discussion of the protective advantages provided by Mtb infection, which may have rewired the human immune system to provide protection against heterologous pathogens- this is supported by recent studies showing that Mtb infection provides moderate protection against SARS-CoV-2 (PMID: 35325013, and 37720210), and may have applied to other viruses that are likely to have played a more significant role in the past in the natural selection of Homo Sapiens.

      We thank the reviewer for their positive comments, and also for pointing out work that we have overlooked citing previously. We now discuss and cite the work above as suggested

      Modeling from Marcel Behr and colleagues (PMID: 31649096) indeed suggests that there are at least TB clinical phenotypes that likely mirror the two distinct phases of Mtb co-evolution with humans. Most of the TB disease progression occurs rapidly (within 1-2 years of exposure), and the rest are slow cases of reactivation over time. I enjoyed the discussion of the difference between the types of immune hits needed to progress to disease in the two scenarios, where you may need severe immune hits for rapid progression, a phenotype that likely evolved after the Neolithic transition to larger human populations. On the other hand, a series of milder immune events leading to reactivation after a long period of asymptomatic infection likely mirrors slow progression in the hunter-gatherer communities, to allow for prolonged transmission in scarce populations. Perhaps a clearer analysis of these models would be helpful for the reader.

      We agree that we did not present these concepts in as much detail as we should, and so we now discuss this more on lines 81 – 83 and 184 - 187)

      Weaknesses:

      The discussion of genetic heterogeneity is limited and only discusses evidence from MSMD studies. Genetics is an important angle to consider in the co-evolution of Mtb and humans. There is a large body of literature on both host and Mtb genetic associations with TB disease. The very fact that host variants in one population do not necessarily cross-validate across populations is evidence in support of population-specific adaptations. Specific Mtb lineages are likely to have co-evolved with distinct human populations. A key reference is missing (PMID: 23995134), which shows that different lineages co-evolved with human migrations. Also, meta-analyses of human GWAS studies to define variants associated with TB are very relevant to the topic of co-evolution (e.g., PMID: 38224499). eQTL studies can also highlight genetic variants associated with regulating key immune genes involved in the response to TB. The authors do mention that Mtb itself is relatively clonal with ~2K SNPs marking Mtb variation, much of which has likely evolved under the selection pressure of modern antibiotics. However, some of this limited universe of variants can still explain co-adaptations between distinct Mtb lineages and different human populations, as shown recently in the co-evolution of lineage 2 with a variant common in Peruvians (PMID: 39613754).

      We thank the reviewer for these comments and agree we failed to cite and discuss the work from Sebastian Gagneux’s group on co-migration, which we now discuss. We include a new paragraph discussing co-evolution as suggested on lines 145 – 155 and 218 -220 , citing the work proposed, which we agree enhances the arguments about co-evolution.

      Although the examples of anti-TNF and anti-PD1 treatments are relevant as drivers of TB in limited clinical contexts, the bigger picture is that they highlight major distinct disease endotypes. These restricted examples show that TB can be driven by immune deficiency (as in the case of anti-TNF, HIV, and malnutrition) or hyperactivation (as in the case of anti-PD1 treatment), but there are still certainly many other routes leading to immune suppression or hyperactivation. Considering the idea of hyper-activation as a TB driver, the apparent higher rate of recurrence in the H56 trial referenced in the review is likely due to immune hyperactivation, especially in the context of residual bacteria in the lung. These different TB manifestations (immune suppression vs immune hyperactivation) mirror TB endotypes described by DiNardo et al (PMID: 35169026) from analysis of extensive transcriptomic data, which indicate that it's not merely different routes leading to the same final endpoint of clinical disease, but rather multiple different disease endpoints. A similar scenario is shown in the transcriptomic signatures underlying disease progression in BCG-vaccinated infants, where two distinct clusters mirrored the hyperactivation and immune suppression phenotypes (PMID: 27183822). A discussion of how to think about translating the extensive information from system biology into treatment stratification approaches, or adjunct host-directed therapies, would be helpful.

      We agree with the points made and that the two publications above further enhance the paper. We have added discussion of the different disease endpoints on line 65 - 67, the evidence regarding immune herpeactivation versus suppression in the vaccination study on lines 162 - 164, and expanded on the translational implications on lines 349 – 352.

      Reviewer #3 (Public review):

      Summary:

      This perspective article by Reichmann et al. highlights the importance of moving beyond the search for a single, unified immune mechanism to explain host-Mtb interactions. Drawing from studies in immune profiling, host and bacterial genetics, the authors emphasize inconsistencies in the literature and argue for broader, more integrative models. Overall, the article is thought-provoking and well-articulated, raising a concept that is worth further exploration in the TB field.

      Strengths:

      Timely and relevant in the context of the rapidly expanding multi-omics datasets that provide unprecedented insights into host-Mtb interactions.

      Weaknesses (Minor):

      Clarity on the notion of a "unified mechanism". It remains unclear whether prior studies explicitly proposed a single unifying immunological model. While inconsistencies in findings exist, they do not necessarily demonstrate that earlier work was uniformly "single-minded". Moreover, heterogeneity in TB has been recognized previously (PMIDs: 19855401, 28736436), which the authors could acknowledge.

      We accept this point and have toned down the language, acknowledging that we are expanding on an argument that others have made, whilst focusing on the implications for the systems immunology era, and cite the previous work as suggested.

      Evolutionary timeline and industrial-era framing. The evolutionary model is outdated. Ancient DNA studies place the Mtb's most recent common ancestor at ~6,000 years BP (PMIDs: 25141181; 25848958). The Industrial Revolution is cited as a driver of TB expansion, but this remains speculative without bacterial-genomics evidence and should be framed as a hypothesis. Additionally, the claim that Mtb genomes have been conserved only since the Industrial Revolution (lines 165-167) is inaccurate; conservation extends back to the MRCA (PMID: 31448322).

      Our understanding is that the evolutionary timeline is not fully resolved, with conflicting evidence proposing different dates. The ancient DNA studies giving a timeline of 6,000 years seem to oppose the evidence of evidence of Mtb infection of humans in the middle east 10,000 years ago, and other estimates suggesting 70,000 years. Therefore, we have cited the work above and added a sentence highlighting that different studies propose different timelines. We would propose the industrial revolution created the ideal societal conditions for the expansion of TB, and this would seem widely accepted in the field, but have added a proviso as suggested. We did not intent to claim that Mtb genomes have been conserved since the industrial revolution, the point we were making is that despite rapid expansion within human populations, it has still remained conserved. We therefore have revised our discussion of the conservation of the Mtb genomes on lines and 72 – 74, 81 – 83 and 185 – 190.

      Trained immunity and TB infection. The treatment of trained immunity is incomplete. While BCG vaccination is known to induce trained immunity (ref 59), revaccination does not provide sustained protection (ref 8), and importantly, Mtb infection itself can also impart trained immunity (PMID: 33125891). Including these nuances would strengthen the discussion.

      We have refined this section. We did cite PMID: 33125891 in the original submission but have changed the wording to emphasise the point on line …

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Abstract

      Line 30: What is an immunological route? Suggest

      ”...host-pathogen interaction, with diverse immunological processes leading to TB disease (10%) or stable lifelong association or elimination. We suggest these alternate relationships result from the prolonged co-evolution of the pathogen with humans and may even confer a survival advantage in the 90% of exposures that do not progress to disease.”

      Thank you, we have reworded the abstract along the lines suggested above, but not identically to allow for other reviewer comments.

      Introduction

      Ln 43: It is misleading to suggest that the study of TB was the leading influence in establishing the Koch's postulates framework. Many other infections were involved, and Jacob Henle, one of Koch's teachers, is credited with the first clear formulation (see Evans AS. 1976 THE YALE JOURNAL OF BIOLOGY AND MEDICIN PMID: 782050).

      We have downplayed the language, stating that TB “contributed” to the formulation if Koch’s postulated.

      Ln 46: While the review rightly emphasises intracellular infection in macrophages, the importance and abundance of extracellular bacilli should not be ignored, particularly in transmission and in cavities.

      We agree, and have added text on the importance of extracellular bacteria and transmission.

      Ln: 56: This is misleading as primary disease prevention is implied, whereas the vaccine was given to individuals presumed to be already infected (TST or IGRA positive). Suggest ..."reduces by 50% progression to overt TB disease when given to those with immunological evidence of latent infection.

      Thank you, edit made as suggested

      Ln 62: Not sure why it is urgent. Suggest "high priority".

      Wording changed as suggested.

      Figure 1 needs clarification. The colour scale appears to signify the strength or vigour of the immune response so that disease is associated with high (orange/red) or low (green/blue) activity. The arrows seem to imply either a sequence or a route map when all we really have is an association with a plausible mechanistic link. They might also be taken to imply a hierarchy that is not appropriate. I'm not sure that the X-rays and arrows add anything, and the rectangle provides the key information on its own. Clarify please.

      We have clarified the figure legend. We feel the X-rays give the clinical context, and so have kept them, and now state in the legend that this is highlighting that there are diverse pathways leading to active disease to try to emphasise the point the figure is illustrating.

      Ln 149-157: I agree that the current dogma is that overt pulmonary disease is required to spread Mtb and fuel disease prevalence. It is vitally important to distinguish the spread of the organism from the occurrence of disease (which does not, of itself, spread). However, both epidemiological (e.g. Ryckman TS, et al. 2022Proc Natl Acad Sci U S A:10.1073/pnas.2211045119) and recent mechanistic (Dinkele R, et al. 2024iScience:10.1016/j.isci.2024.110731, Patterson B, et al. 2024Proc Natl Acad Sci U S A:10. E1073/pnas.2314813121, Warner DF, et al. 2025Nat Rev Microbiol:10.1038/s41579-025-01201-x) studies indicate the importance of asymptomatic infections, and those associated with sputum positivity have recently been recognised by WHO. I think it will be important to acknowledge the importance of this aspect and consider how immune responses may or may not contribute. I regard the view that Mtb is an obligate pathogen, dependent on overt pTB for transmission, as needing to be reviewed.

      We agree that we did not give sufficient emphasis to the emerging evidence on asymptomatic infections, and that this may play an important part in transmission in high incidence settings. We now include a discussion on this, and citation of the papers above, on lines 168 – 170.

      Ln 159: The terms colonise and colonisation are used, without a clear definition, several times. My view is that both refer to the establishment and replication of an organism on or within a host without associated damage. Where there is associated damage, this is often mediated by immune responses. In this header, I think "establishment in humanity" would be appropriate.

      We agree with this point and have changed the header as suggested, and clarified our meaning when we use the term colonisation, which the reviewer correctly interprets.

      Ln 181-: I strongly support the view that Mtb has contributed to human selection, even to the suggestion that humanity is adapted to maintain a long-term relationship with Mtb

      Thank you, and we have expanded on this evidence as suggested by other reviewers.

      Ln 189: improved.

      Apologies, typo corrected.

      Figure 2: I was also confused by this. The x-axis does not make sense, as a single property should increase. Moreover, does incidence refer to incidence in individuals with that specific balance of resistance and susceptibility, or contribution to overall global incidence - I suspect the latter (also, prevalence would make more sense). At the same time, the legend implies that those with high resistance to colonisation will be infrequent in the population, suggesting that the Y axis should be labelled "frequency in human population". Finally, I can't see what single label could apply to the X axis. While the implication that the majority of global infections reflect a balance between the resistance and susceptibilities is indicated, a frequency distribution does not seem an appropriate representation.

      The reviewer is correct that the X axis is aiming to represent two variables, which is not logical, and so we have completely changed this figure to a simple one that we hope makes the point clearly and have amended the legend appropriately. We are aiming to highlight the selective pressures of Mtb on the human population over millennia.

      Ln 244: Immunological failure - I agree with the statement but again find the figure (3) unhelpful. Do we start or end in the middle? Is the disease the outside - if so, why are different locations implied? The notion of a maze has some value, but the bacteria should start and finish in the same place by different routes.

      We are attempting to illustrate the concept that escape from host immunological control can occur through different mechanisms. As this comment was just from one reviewer, we have left the figure unchanged but have expanded the legend to try to make the point that this is just a conceptual illustration of multiple routes to disease.

      Ln 262 onward: I broadly agree with the points made about omic technologies, but would wish to see major emphasis on clear phenotyping of cases. There is something of a contradiction in the review between the emphasis on the multiplicity of immunological processes leading ultimately to disease and the recommendation to analyse via omics, which, in their most widely applied format, bundle these complexities into analyses of the humoral and cellular samples available in blood. Admittedly, the authors point out opportunities for 3-dimensional and single-cell analyses, but it is difficult to see where these end without extrapolation ad infinitum.

      We totally agree that clear phenotyping of infection is critical, and expand on this further on lines 307 - 309.

      Reviewer #2 (Recommendations for the authors):

      I suggest expanding on the genetic determinants of Mtb/host co-evolution.

      Thank you, we have now expanded on these sections as suggested.

      Reviewer #3 (Recommendations for the authors):

      We are in an era of exploding large-scale datasets from multi-omics profiling of Mtb and host interactions, offering an unprecedented lens to understand the complexity of the host immune response to Mtb-a pathogen that has infected human populations for thousands of years. The guiding philosophy for how to interpret this tremendous volume of data and what models can be built from it will be critical. In this context, the perspective article by Reichmann et al. raises an interesting concept: to "avoid unified immune mechanisms" when attempting to understand the immunology underpinning host-Mtb interactions. To support their arguments, the authors review studies and provide evidence from immune profiling, host and bacterial genetics, and showcase several inconsistencies. Overall, this perspective article is well articulated, and the concept is worthwhile for further exploration. A few comments for consideration:

      Clarity on the notion of a "unified mechanism". Was there ever a single, clearly proposed unified immunological mechanism? For example, in lines 64-65, the authors criticize that almost all investigations into immune responses to Mtb are based on the premise that a unifying disease mechanism exists. However, after reading the article, it was not clear to me how previous studies attempted to unify the model or what that unifying mechanism was. While inconsistencies in findings certainly exist, they do not necessarily indicate that prior work was guided by a unified framework. I agree that interpreting and exploring data from a broader perspective is valuable, but I am not fully convinced that previous studies were uniformly "single-minded". In fact, the concept of heterogeneity in TB has been previously discussed (e.g., PMIDs: 19855401, 28736436).

      We accept this point, and that we have overstated the argument and not acknowledged previous work sufficiently. We now downplay the language and cite the work as proposed.

      However, we would propose that essentially all published studies imply that single mechanisms underly development of disease. The authors are not aware of any manuscript that concludes “Therefore, xxxx pathway is one of several that can lead to TB disease”, instead they state “Therefore, xxxx pathway leads to TB disease”. The implication of this language is that the mechanism described occurs in all patients, whilst in fact it likely only is involved in a subset. We have toned down the language and expand on this concept on line 268 – 270.

      Evolutionary timeline and industrial-era framing. The evolutionary model needs updating. The manuscript cites a "70,000-year" origin for Mtb, but ancient-DNA studies place the most recent common ancestor at ~6,000 years BP (PMIDs: 25141181; 25848958). The Industrial Revolution is invoked multiple times as a driver of TB expansion, yet the magnitude of its contribution remains debated and, to my knowledge, lacks direct bacterial-genomics evidence for causal attribution; this should be framed as a hypothesis rather than a conclusion. In addition, the statement in lines 165-167 is inaccurate: at the genome level, Mtb has remained highly conserved since its most recent common ancestor-not specifically since the Industrial Revolution (PMID: 31448322).

      We accept these points and have made the suggested amendments, as outlined in the public responses. Our understanding is that the evidence about the most common ancestor is controversial; if the divergence of human populations occurred concurrently with Mtb, then this must have been significantly earlier than 6,000 years ago, and so there are conflicting arguments in this domain.

      Trained immunity and TB infection. The discussion of trained immunity could be expanded. Reference 59 suggests the induction of innate immune training, but reference 8 reports that revaccination does not confer protection against sustained TB infection, indicating that at least "re"-vaccination may not enhance protection. Furthermore, while BCG is often highlighted as a prototypical inducer of trained immunity, real-world infection occurs through Mtb itself. Importantly, a later study demonstrated that Mtb infection can also impart trained immunity (PMID: 33125891). Integrating these findings would provide a more nuanced view of how both vaccination and infection shape innate immune training in the TB context.

      We thank the reviewer for these suggestions and have edited the relevant section to include these studies.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public reviews:

      Reviewer #1 (Public review):

      In this important study, the authors characterized the transformation of neural representations of olfactory stimuli from the primary sensory cortex to multisensory regions in the medial temporal lobe and investigated how they were affected by non-associative learning. The authors used high-density silicon probe recordings from five different cortical regions while familiar vs. novel odors were presented to a head-restrained mouse. This is a timely study because unlike other sensory systems (e.g., vision), the progressive transformation of olfactory information is still poorly understood. The authors report that both odor identity and experience are encoded by all of these five cortical areas but nonetheless some themes emerge. Single neuron tuning of odor identity is broad in the sensory cortices but becomes narrowly tuned in hippocampal regions. Furthermore, while experience affects neuronal response magnitudes in early sensory cortices, it changes the proportion of active neurons in hippocampal regions. Thus, this study is an important step forward in the ongoing quest to understand how olfactory information is progressively transformed along the olfactory pathway.

      The study is well-executed. The direct comparison of neuronal representations from five different brain regions is impressive. Conclusions are based on single neuronal level as well as population level decoding analyses. Among all the reported results, one stands out for being remarkably robust. The authors show that the anterior olfactory nucleus (AON), which receives direct input from the olfactory bulb output neurons, was far superior at decoding odor identity as well as novelty compared to all the other brain regions. This is perhaps surprising because the other primary sensory region - the piriform cortex - has been thought to be the canonical site for representing odor identity. A vast majority of studies have focused on aPCx, but direct comparisons between odor coding in the AON and aPCx are rare. The experimental design of this current study allowed the authors to do so and the AON was found to convincingly outperform aPCx. Although this result goes against the canonical model, it is consistent with a few recent studies including one that predicted this outcome based on anatomical and functional comparisons between the AON-projecting tufted cells vs. the aPCx-projecting mitral cells in the olfactory bulb (Chae, Banerjee et. al. 2022). Future experiments are needed to probe the circuit mechanisms that generate this important difference between the two primary olfactory cortices as well as their potential causal roles in odor identification.

      The authors were also interested in how familiarity vs. novelty affects neuronal representation across all these brain regions. One weakness of this study is that neuronal responses were not measured during the process of habituation. Neuronal responses were measured after four days of daily exposure to a few odors (familiar) and then some other novel odors were introduced. This creates a confound because the novel vs. familiar stimuli are different odorants and that itself can lead to drastic differences in evoked neural responses. Although the authors try to rule out this confound by doing a clever decoding and Euclidian distance analysis, an alternate more straightforward strategy would have been to measure neuronal activity for each odorant during the process of habituation.

      Reviewer #2 (Public review):

      This manuscript investigates how olfactory representations are transformed along the cortico-hippocampal pathway in mice during a non-associative learning paradigm involving novel and familiar odors. By recording single-unit activity in several key brain regions (AON, aPCx, LEC, CA1, and SUB), the authors aim to elucidate how stimulus identity and experience are encoded and how these representations change across the pathway.

      The study addresses an important question in sensory neuroscience regarding the interplay between sensory processing and signaling novelty/familiarity. It provides insights into how the brain processes and retains sensory experiences, suggesting that the earlier stations in the olfactory pathway, the AON aPCx, play a central role in detecting novelty and encoding odor, while areas deeper into the pathway (LEC, CA1 & Sub) are more sparse and encodes odor identity but not novelty/familiarity. However, there are several concerns related to methodology, data interpretation, and the strength of the conclusions drawn.

      Strengths:

      The authors combine the use of modern tools to obtain high-density recordings from large populations of neurons at different stages of the olfactory system (although mostly one region at a time) with elegant data analyses to study an important and interesting question.

      Weaknesses:

      (1) The first and biggest problem I have with this paper is that it is very confusing, and the results seem to be all over the place. In some parts, it seems like the AON and aPCx are more sensitive to novelty; in others, it seems the other way around. I find their metrics confusing and unconvincing. For example, the example cells in Figure 1C show an AON neuron with a very low spontaneous firing rate and a CA1 with a much higher firing rate, but the opposite is true in Figure 2A. So, what are we to make of Figure 2C that shows the difference in firing rates between novel vs. familiar odors measured as a difference in spikes/sec. This seems nearly meaningless. The authors could have used a difference in Z-scored responses to normalize different baseline activity levels. (This is just one example of a problem with the methodology.)

      We appreciate the reviewer’s concerns regarding clarity and methodology. It is less clear why all neurons in a given brain area should have similar firing rates. Anatomically defined brain areas typically comprise of multiple cell types, which can have diverse baseline firing rates. Since we computed absolute firing rate differences per neuron (i.e., novel vs. familiar odor responses within the same neuron), baseline differences across neurons do not have a major impact.

      The suggestion to use Z-scores instead of absolute firing rate differences is well taken. However, Z-scoring assumes that the underlying data are normally distributed, which is not the case in our dataset. Specifically, when analyzing odor-evoked firing rates on a per-neuron basis, only 4% of neurons exhibit a normal distribution. In cases of skewed distributions, Z-scoring can distort the data by exaggerating small variations, leading to misleading conclusions. We acknowledge that different analysis methods exist, we believe that our chosen approach best reflects the properties of the dataset and avoids potential misinterpretations introduced by inappropriate normalization techniques.

      (2) There are a lot of high-level data analyses (e.g., decoding, analyzing decoding errors, calculating mutual information, calculating distances in state space, etc.) but very little neural data (except for Figure 2C, and see my comment above about how this is flawed). So, if responses to novel vs. familiar odors are different in the AON and aPCx, how are they different? Why is decoding accuracy better for novel odors in CA1 but better for familiar odors in SUB (Figure 3A)? The authors identify a small subset of neurons that have unusually high weights in the SVM analyses that contribute to decoding novelty, but they don't tell us which neurons these are and how they are responding differently to novel vs. familiar odors.

      We performed additional analyses to address the reviewer’s feedback (Figures 2C-E and lines 118-132) and added more single-neuron data (Figures 1, S3 and S4).

      (3) The authors call AON and aPCx "primary sensory cortices" and LEC, CA1, and Sub "multisensory areas". This is a straw man argument. For example, we now know that PCx encodes multimodal signals (Poo et al. 2021, Federman et al., 2024; Kehl et al., 2024), and LEC receives direct OB inputs, which has traditionally been the criterion for being considered a "primary olfactory cortical area". So, this terminology is outdated and wrong, and although it suits the authors' needs here in drawing distinctions, it is simplistic and not helpful moving forward.

      We appreciate the reviewer’s concern regarding the classification of brain regions as “primary sensory” versus “multisensory.” Of note, the cited studies (Poo et al., 2021; Federman et al., 2024; Kehl et al., 2024) focus on posterior PCx (pPCx), while our recordings were conducted in very anterior section of anterior PCx. The aPCx and pPCx have distinct patterns of connectivity, both anatomically and functionally. To the best of our knowledge, there is no evidence for multimodal responses in aPCx, whereas there is for LEC, CA1 and SUB. Furthermore, our distinction is not based on a connectivity argument, as the reviewer suggests, but on differences in the α-Poisson ratio (Figure 1E and F).

      To avoid confusion due to definitions of what constitutes a “primary sensory” region, we adopted a more neutral description throughout the manuscript.

      (4) Why not simply report z-scored firing rates for all neurons as a function of trial number? (e.g., Jacobson & Friedrich, 2018). Figure 2C is not sufficient.

      Regarding z-scores, please see response to 1). We further added a figure showing responses of all neurons to novel stimuli (using ROC instead of z-scoring, as described previously (e.g. Cohen et al. Nature 2012). We added the following figure to the supplementary for the completeness of the analysis (S2E).

      For example, in the Discussion, they say, "novel stimuli caused larger increases in firing rates than familiar stimuli" (L. 270), but what does this mean?

      This means that on average, the population of neurons exhibit higher firing rates in response to novel odors compared to familiar ones.

      Odors typically increase the firing in some neurons and suppress firing in others. Where does the delta come from? Is this because novel odors more strongly activate neurons that increase their firing or because familiar odors more strongly suppress neurons?

      We thank the reviewer for this valuable feedback and extended the characterization of firing rate properties, including a separate analysis of neurons i) significantly excited by odorants, ii) significantly inhibited by odorants and iii) not responsive to odorants. We added the analysis and corresponding discussion to the main manuscript (Figures 2C-E and lines 118-132)

      (5) Lines 122-124 - If cells in AON and aPCx responded the same way to novel and familiar odors, then we would say that they only encode for odor and not at all for experience. So, I don't understand why the authors say these areas code for a "mixed representation of chemical identity and experience." "On the other hand," if LEC, CA1, and SUB are odor selective and only encode novel odors, then these areas, not AON and aPCx, are the jointly encoding chemical identity and experience. Also, I do not understand why, here, they say that AON and PCx respond to both while LEC, CA1, and SUB were selective for novel stimuli, but the authors then go on to argue that novelty is encoded in the AON and PCx, but not in the LEC, CA1, and SUB.

      We appreciate the reviewer’s request for clarification. Throughout the brain areas we studied, odorant identity and experience can be decoded. However, the way information is represented is different between regions. We acknowledge that that “mixed” representation is a misleading term and removed it from the manuscript.

      In AON and aPCx, neurons significantly respond to both novel and familiar odors. However, the magnitude of their responses to novel and familiar odors is sufficiently distinct to allow for decoding of odor experience (i.e., whether an odor is novel or familiar). Moreover, novelty engages more neurons in encoding the stimulus (Figure 2D). In neural space, the position of an odor’s representation in AON and aPCx shifts depending on whether it is novel or familiar, meaning that experience modifies the neural representation of odor identity. This suggests that in these regions the two representations are intertwined.

      In contrast, some neurons in LEC, CA1, and SUB exhibit responses to novel odors, but few neurons respond to familiar odors at all. This suggests a more selective encoding of novelty.

      (6) Lines 132-140 - As presented in the text and the figure, this section is poorly written and confusing. Their use of the word "shuffled" is a major source of this confusion, because this typically is the control that produces outcomes at the chance level. More importantly, they did the wrong analysis here. The better and, I think, the only way to do this analysis correctly is to train on some of the odors and test on an untrained odor (i.e., what Bernardi et al., 2021 called "cross-condition generalization performance"; CCGP).

      We appreciate the feedback and thank the reviewer for the recommendation to implement cross-condition generalization performance (CCGP) as used in Bernardi et al., 2020. We acknowledge that the term "shuffled" may have caused confusion, as it typically refers to control analyses producing chance-level outcomes. In our case, by "shuffling" we shuffled the identity of novel and familiar odors to assess how much the decoder relies on odor identity when distinguishing novelty. This test provided insight into how novelty-based structure exists within neural activity beyond random grouping but does not directly assess generalization.

      As suggested, we used CCGP to measure how well novelty-related representations generalize across different odors. Our findings show that in AON and aPCx, novelty-related information is indeed highly generalizable, supporting the idea that these regions encode novelty in a less odor-selective manner (Figure 2K).

      Reviewer #3 (Public review):

      In this manuscript, the authors investigate how odor-evoked neural activity is modulated by experience within the olfactory-hippocampal network. The authors perform extracellular recordings in the anterior olfactory nucleus (AON), the anterior piriform (aPCx) and lateral entorhinal cortex (LEC), the hippocampus (CA1), and the subiculum (SUB), in naïve mice and in mice repeatedly exposed to the same odorants. They determine the response properties of individual neurons and use population decoding analyses to assess the effect of experience on odor information coding across these regions.

      The authors' findings show that odor identity is represented in all recorded areas, but that the response magnitude and selectivity of neurons are differentially modulated by experience across the olfactory-hippocampal pathway.

      Overall, this work represents a valuable multi-region data set of odor-evoked neural activity. However, limitations in the interpretability of odor experience of the behavioral paradigm, and limitations in experimental design and analysis, restrict the conclusions that can be drawn from this study.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Some suggestions, in no particular order, to further improve the manuscript:

      (1) The example neuronal responses for CA1 and SUB in Figure 1 are not very inspiring. To my eyes, the odor period response is not that different from the baseline period. In general, a thorough characterization of firing rate properties during the odor period between the different brain regions would be informative.

      We thank the reviewer for this valuable feedback. We have replaced the example neurons from CA1 and SUB in Figure 1C. We further extended the characterization of firing rate properties, including a separate analysis of neurons i) significantly excited by odorants, ii) significantly inhibited by odorants and iii) not responsive to odorants. We added the analysis and corresponding discussion to the main manuscript (Figures 2C-E and lines 118-132)

      (2) For the summary in Figure 1, why not show neuronal responses as z-scored firing rates as opposed to auROC?

      We chose to use auROC instead of z-scored firing rates due to the non-normality of the dataset, which can distort results when using z-scores. Specifically, z-scoring can exaggerate small deviations in neurons with low responsiveness, potentially leading to misleading conclusions. auROC provides a more robust measure of response change that is less sensitive to these distortions because it does not assume any specific distribution. This approach has been used previously (e.g. Cohen et al. 2012, Nature).

      (3) To study novelty, the authors presented odorants that were not used during four days of habituation. But this design makes it hard to dissociate odor identity from novelty. Why not track the response of the same odorants during the habituation process itself?

      We respectfully disagree with the argument that using different stimuli as novel and familiar constitutes a confound in our analysis. In our study, we used multiple different, structurally dissimilar single molecule chemicals which were randomly assigned to novel and familiar categories in each animal. If individual stimuli did cause “drastic differences in evoked neural responses”, these would be evenly distributed between novel and familiar stimuli. It is therefore extremely unlikely that the clear differences we observed between novel and familiar conditions and between brain areas can be attributed to the contribution of individual stimuli, in particular given our analyses was performed at the population level. In fact, we observed that responses between novel and familiar conditions were qualitatively very similar in the short time window after odor onset (Figure 1G and H).

      Importantly, the goal of this study was to investigate the impact of long-term habituation over more than 4 days, rather than short term habituation during one behavioral session. However, tracking the activity of large numbers of neurons across multiple days presents a significant technical challenge, due to the difficulty of identifying stable single-unit recordings over extended periods of time with sufficient certainty. Tools that facilitate tracking have recently been developed (e.g. Yuan AX et al., Elife. 2024) and it will be interesting to apply them to our dataset in the future.

      (4) Since novel odors lead to greater sniffing and sniffing strongly influences firing rates in the olfactory system, the authors decided to focus on a 400 ms window with similar sniffing rates for both novel vs. familiar odors. Although I understand the rationale for this choice, I worry that this is too restrictive, and it may not capture the full extent of the phenomenology.

      Could the authors model the effect of sniffing on firing rates of individual neurons from the data, and then check whether the odor response for novel context can be fully explained just by increased sniffing or not?

      It is an interesting suggestion to extend the window of analysis and observe how responses evolve with sniffing (and other behavioral reactions). To address this, we added an additional figure to the supplementary material, showing the mean responses of all neurons to novel stimuli during the entire odor presentation window (Fig. S1B).

      As suggested, we further created a Generalized Linear Model (GLM) for the entire 2s odor stimulation period, incorporating sniffing and novelty as independent variables. As expected, sniffing had a dominant impact on firing rate in all brain areas. A smaller proportion of neurons was modulated by novelty or by the interaction between novelty x breathing, suggesting the entrainment of neural activity by sniffing during the response to novel odors. These results support our decision to focus the analysis on the early 400ms window in order to dissociate the effects of novelty and behavioral responses. Taken together, our results suggest that odorant responses are modulated by novelty early during odorant processing, whereas at later stages sniffing becomes the predominant factor driving firing (Figure S2C-D).

      (5) The authors conclude that aPCx has a subset of neurons dedicated to familiar odors based on the distribution of SVM weights in Figure 3D. To me, this is the weakest conclusion of the paper because although significant, the effect size is paltry; the central tendencies are hardly different for the two conditions in aPCx. Could the authors show the PSTHs of some of these neurons to make this point more convincing?

      We appreciate the reviewer’s concern regarding the effect size. To strengthen our conclusion, we now include PSTHs of representative neurons in the least 10% and best 10% of neuronal population based on the SVM analysis (Figures S3 and S4). We hope this provides more clarity and support for the interpretation that there is a subset of neurons in aPCx that show greater sensitivity to familiar odors, despite the relatively modest central tendency differences.

      In the revised manuscript, we discuss the effect size more explicitly in the text to provide context for its significance (lines 193 - 195).

      Reviewer #2 (Recommendations for the authors):

      (1) The authors only talk about "responsive" neurons. Does this include neurons whose activity increases significantly (activated) and neurons whose activity decreases (suppressed)?

      Yes, the term "responsive" refers to neurons whose activity either increases significantly (excited) or decreases (inhibited) in response to the odor stimuli. We performed additional analyses to characterize responses separately for the different groups (Figure 2C-E and lines 118-132).

      (2) Line 54 - The Schoonover paper doesn't show that cells lose their responses to odors, but rather that the population of cells that respond to odors changes with time. That is, population responses don't become more sparse

      The fact that “the population of cells that respond to odors changes with time”, implies that some neurons lose their responsiveness (e.g. unit 2 in Figure 1 of Schoonover et al., 2021), while others become responsive (e.g. unit 1 in Figure 1 of Schoonover et al., 2021). Frequent responses reduce drift rate (Figure 4 of Schoonover et al., 2021), thus fewer neurons loose or gain responsiveness. We have revised the manuscript to clarify this.

      (3) Line 104 - "Recurrent" is incorrectly used here. I think the authors mean "repeated" or something more like that.

      Thank you for pointing this out. We replaced "recurrent" with "repeated".

      (4) Figure 3D - What is the scale bar here?

      We apologize for the accidental omission. The scale bar was be added to Figure 3D in the revised version of the manuscript.

      (5) Line 377 - They say they lowered their electrodes to "200 um/s per second." This must be incorrect. Is this just a typo, or is it really 200 um/s, because that's really fast?

      Thank you for pointing this out. It was 20 to 60 um/s, the change has been made in the manuscript.

      (6) Line 431: The authors say they used auROC to calculate changes in firing rates (which I think is only shown in Figure 1D). Note that auROC measures the discriminability of two distributions, not the strength or change in the strength of response.

      Indeed we used auROC to measure the discriminability of firing between baseline and during stimulus response. We have corrected the wording in the methods.

      (7) Figure 1B: The anatomical locations of the five areas they recorded from are straightforward, and this figure is not hugely helpful. However, the reader would benefit tremendously by including an experimental schematic. As is, we needed to scour the text and methods sections to understand exactly what they did when.

      We thank the reviewer for this suggestion. We included an experimental schematic in the supplementary material.

      (8) Figure 1F(left): This plot is much less useful without showing a pre-odor window, even if only times after the odor onset were used for calculation alpha

      We appreciate this concern, however the goal of Figure 1F is to illustrate the meaning of the alpha value itself. We chose not to include a pre-odor window comparison to avoid confusing the reader.

      (9) Figure 2A: What are the bar plots above the raster plots? Are these firing rates? Are the bars overlaid or stacked? Where is the y-axis scale bar?

      The bar plots above the raster plots represent a histogram of the spike count/trials over time, with a bin width of 50 ms. These bars are overlaid on the raster plot. We will include a y-axis scale bar in the revised figure to clarify the presentation.

      (10) Figure 4G: This makes no sense. First, the Y axis is supposed to measure standard deviation, but the axis label is spikes/s. Second, if responses in the AON are much less reliable than responses in "deeper" areas, why is odor decoding in AON so much better than in the other areas?

      We acknowledge the error in the axis label, and we will correct it to indicate the correct units. AON has a larger response variability but also larger responses magnitudes, which can explain the higher decoding accuracy.

      (11) From the model and text, one predicts that the lifetime sparseness increases along the pathway. The authors should use this metric as well/instead of "odor selectivity" because of problems with arbitrary thresholding.

      We acknowledge that lifetime sparseness, often computed using lifetime kurtosis, can be an informative measure of selectivity. However, we believe it has limitations that make it less suitable for our analysis. One key issue is that lifetime sparseness does not account for the stability of responses across multiple presentations of the same stimulus. In contrast, our odor selectivity measure incorporates trial-to-trial variability by considering responses over 10 trials and assessing significance using a Wilcoxon test compared to baseline. While the choice of a p-value threshold (e.g., 0.05) is somewhat arbitrary, it is a widely accepted statistical convention. Additionally, lifetime sparseness does not account for excitatory and inhibitory responses. For example, if a neuron X is strongly inhibited by odor A, strongly excited by odor B, and unresponsive to odors C and D, lifetime sparseness would classify it as highly selective for odor B, without capturing its inhibitory selectivity for odor A. The lifetime sparseness will be higher than if X was simply unresponsive for A.

      Our odor selectivity measure addresses this by considering both excitation and inhibition as potential responses. Thus, while lifetime sparseness could provide a useful complementary perspective in another type of dataset, it does not fully capture the dynamics of odor selectivity here.

      Author response 1.

      Lifetime Kurtosis distribution per region.

      Reviewer #3 (Recommendations for the authors):

      Main points:

      (1) The authors use a non-associative learning paradigm - repeated odor exposure - to test how experience modulates odor responses along the olfactory-hippocampal pathway. While repeated odor exposure clearly modulates odor-evoked neural activity, the relevance of this modulation and its differential effect across different brain areas are difficult to assess in the absence of any behavioral read-outs.

      Our experimental paradigm involves a robust, reliable behavioral readout of non-associative learning. Novel olfactory stimuli evoke a well-characterized orienting reaction, which includes a multitude of physiological reactions, including exploratory sniffing, facial movements and pupil dilation (Modirshanechi et al., Trends Neuroscience 2023). In our study, we focused on exploration sniffing.

      Compared to associative learning, non-associative learning might have received less attention. However, it is critically important because it forms the foundation for how organisms adapt to their environment through experience without forming associations. This is highlighted by the fact that non-instrumental stimuli can be remembered in large number (Standing, 1973) and with remarkable detail (Brady et al., 2008). While non-associative learning can thus create vast, implicit memory of stimuli in the environment, it is unclear how stimulus representations reflect this memory. Our study contributes to answering this question. We describe the impact of experience on olfactory sensory representations and reveal a transformation of representations from olfactory cortical to hippocampal structures. Our findings also indicate that sensory responses to familiar stimuli persist within sensory cortical and hippocampal regions, even after spontaneous orienting behaviors habituated. Further studies involving experimental manipulation techniques are needed to elucidate the causal mechanisms underlying the formation of stimulus memory during non-associative learning.

      (2) The authors discuss the olfactory-hippocampal pathway as a transition from primary sensory (AON, aPCx) to associative areas (LEC, CA1, SUB). While this is reasonable, given the known circuit connectivity, other interpretations are possible. For example, AON, aPCx, and LEC receive direct inputs from the olfactory bulb ('primary cortex'), while CA1 and SUB do not; AON receives direct top-down inputs from CA1 ('associative cortex'), while aPCx does not. In fact, the data presented in this manuscript does not appear to support a consistent, smooth transformation from sensory to associative, as implied by the authors (e.g. Figure 4A, F, and G).

      Thank you for this insightful comment. Indeed, there are complexities in the circuitry, and the relationships between different areas are not linear. We believe that AON and aPCx are distinctly different from LEC, CA1 and SUB, as the latter areas have been shown to integrate multimodal sensory information. To avoid confusion due to definitions of what constitutes a “primary sensory” region, we adopted a more neutral description throughout the manuscript. We also removed the term “gradual” to describe the transition of neural representations from olfactory cortical to hippocampal areas.

      (3) The analysis of odor-evoked responses is focused on a 400 ms window to exclude differences in sniffing behavior. This window spans 200 ms before and after the first inhalation after odor onset. Inhalation onset initiates neural odor responses - why do the authors include neural data before inhalation onset?

      The reason to include a brief time window prior to odor onset is to account for what is often called “partical” sniffs. In our experimental setup, odor delivery is not triggered by the animal’s inhalation. Therefore, it can happen that an animal has just begun to inhale when the stimulus is delivered. In this case, the animal is exposed to odorant molecules prior to the first complete inhalation after odor onset. We acknowledge that this limits the temporal resolution of our measurements, but it does not affect the comparison of sensory representations between different brain areas.

      It would also be interesting to explore the effect of sniffing behavior (see point 2) on odor-evoked neural activity.

      Thank you for your comment, we performed additional analysis including a GLM to address this question (Figure S2C-D).

      Minor points:

      (4) Figure 2A represents raster plots for 2 neurons per area - it is unclear how to distinguish between the 2 neurons in the plots.

      Figure 2A shows one example neuron per brain area. Each neurons has two raster plot which indicate responses to either a novel (orange) or a familiar stimulus (blue). We have revised the figure caption for clarity.

      (5) Overall, axes should be kept consistent and labeled in more detail. For example, Figure 2H and I are difficult to compare, given that the y-axis changes and that decoding accuracies are difficult to estimate without additional marks on the y-axis.

      Axes are indeed different, because chance level decoding accuracy is different between those two figures. The decoding between novel and familiar odors has a chance level of 0.5, while chance level decoding odors is 0.1 (there are 10 odors to decode the identity from).

      (6) Some parts of the discussion seem only loosely related to the data presented in this manuscript. For example, the statement that 'AON rather than aPCx should be considered as the primary sensory cortex in olfaction' seems out of context. Similarly, it would be helpful to provide data on the stability of subpopulations of neurons tuned to familiar odors, rather than simply speculate that they could be stable. The authors could summarize more speculative statements in an 'Ideas and Speculation' subsection.

      Thank you for your comment. We appreciate your perspective on our hypotheses. We have revised the discussion accordingly. Specifically, we removed the discussion of stable subpopulations, since we have not performed longitudinal tracking in this study.

      (7) The authors should try to reference relevant published work more comprehensively.

      Thank you for your comment. We attempted to include relevant published work without exceeding the limit for references but might have overseen important contributions. We apologize to our colleagues, whose relevant work might not have been cited.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The main contributions of this paper are: (1) a replication of the surprising prior finding that information about peripherally-presented stimuli can be decoded from foveal V1 (Williams et al 2008), (2) a new demonstration of cross-decoding between stimuli presented in the periphery and stimuli presented at the fovea, (3) a demonstration that the information present in the fovea is based on shape not semantic category, and (4) a demonstration that the strength of foveal information about peripheral targets is correlated with the univariate response in the same block in IPS.

      Strengths:

      The design and methods appear sound, and finding (2) above is new, and importantly constrains our understanding of this surprising phenomenon. The basic effect investigated here is so surprising that even though it has been replicated several times since it was first reported in 2008, it is useful to replicate it again.

      We thank the reviewer for their summary. While we agree with many points, we would like to respectfully push back on the notion that this work is a replication of Williams et al. (2008). What our findings share with those of Williams is a report of surprising decoding at the fovea without foveal stimulation. Beyond this similarity, we treat these as related but clearly separate findings, for the following reasons:

      (1) Foveal feedback, as shown by Williams et al. (2008) and others during fixation, was only observed during a shape discrimination task, specific to the presented stimulus. Control experiments without such a task (or a color-related task) did not show effects of foveal feedback. In contrast, in the present study, the participants’ task was merely to perform saccades towards stimuli, independently of target features. We thus show that foveal feedback can occur independently of a task related to stimulus features. This dissociation demonstrates that our study must be tapping into something different than reported by Williams.

      (2) In a related study, Kroell and Rolfs (2022, 2025) demonstrated a connection between foveal feedback and saccade preparation, including the temporal details of the onset of this effect before saccade execution, highlighting the close link of this effect to saccade preparation. Here we used a very similar behavioral task to capture this saccade-related effect in neural recordings and investigate how early it occurs and what its nature is. Thus, there is a clear motivation for this study in the context of eye movement preparation that is separate from the previous work by Williams.

      (3) Lastly, decoding in the experimental task was positively associated with activity in FEF and IPS, areas that have been reliably linked to saccade preparation. We have now also performed an additional analysis (see our response to Specific point 2 of Reviewer 2) showing that decoding in the control condition did not show the same association, further supporting the link of foveal feedback to saccade preparation. 

      Despite our emphasis on these critical differences in studies, covert peripheral attention, as required by the task in Williams et al., and saccade preparation in natural vision, as in our study, are tightly coupled processes. Indeed, the task in Williams et al. would, during natural vision, likely involve an eye movement to the peripheral target. While speculative, a parsimonious and ecologically valid explanation is that both ours and earlier studies involve eye movement preparation, for which execution is suppressed, however, in studies enforcing fixation (e.g., Williams et al., 2008). We now discuss this idea of a shared underlying mechanism more extensively in the revised manuscript (pg 8 ln 228-240). 

      Weaknesses:

      (1) The paper, including in the title ("Feedback of peripheral saccade targets to early foveal cortex") seems to assume that the feedback to foveal cortex occurs in conjunction with saccade preparation. However, participants in the original Williams et al (2008) paper never made saccades to the peripheral stimuli. So, saccade preparation is not necessary for this effect to occur. Some acknowledgement and discussion of this prior evidence against the interpretation of the effect as due to saccade preparation would be useful. (e.g., one might argue that saccade preparation is automatic when attending to peripheral stimuli.)

      We agree that the effects Williams et al. showed were not sufficiently discussed in the first version of this manuscript. To more clearly engage with these findings we now introduce saccade related foveal feedback (foveal prediction) and foveal feedback during fixation separately in the introduction (pg 2 ln 46-59).

      We further added another section in the discussion called “Foveal feedback during saccade preparation” in which we discuss how our findings are related to Williams et al. and how they differ (pg 8 ln 211-240). 

      As described in our previous response, we believe that our findings go beyond those described by Williams et al. (2008) and others in significant ways. However, during natural vision, the paradigm used by Williams et al. (2008) would likely be solved using an eye movement. Thus, while participants in Williams et al. (2008) did not execute saccades, it appears plausible that they have prepared saccades. Given the fact that covert peripheral attention and saccade preparation are tightly coupled processes (Kowler et al., 1995, Vis Res; Deubel & Schneider, 1996, Vis Res; Montagnini & Castet, 2007, J Vis; Rolfs & Carrasco, 2012, J Neurosci; Rolfs et al., 2011, Nat Neurosci), their results are parsimoniously explained by saccade preparation (but not execution) to a behaviorally relevant target.

      (2) The most important new finding from this paper is the cross-decodability between stimuli presented in the fovea and stimuli presented in the periphery. This finding should be related to the prior behavioral finding (Yu & Shim, 2016) that when a foveal foil stimulus identical to a peripheral target is presented 150 ms after the onset of the peripheral target, visual discrimination of the peripheral target is improved, and this congruency effect occurred even though participants did not consciously perceive the foveal stimulus (Yu, Q., & Shim, W. M., 2016). Modulating foveal representation can influence visual discrimination in the periphery (Journal of Vision, 16(3), 15-15).

      We thank the reviewer for highlighting this highly relevant reference. In the revised version of the manuscript, we now put more emphasis on the finding of cross-decodability (pg 2 ln 60-61). We now also discuss Yu et al.’s finding, which support our conclusion that foveal feedback and direct stimulus presentation share representational formats in early visual areas (pg 9 ln 277-279).

      (3) The prior literature should be laid out more clearly. For example, most readers will not realize that the basic effect of decodability of peripherally-presented stimuli in the fovea was first reported in 2008, and that that original paper already showed that the effect cannot arise from spillover effects from peripheral retinotopic cortex because it was not present in a retinotopic location between the cortical locus corresponding to the peripheral target and the fovea. (For example, this claim on lines 56-57 is not correct: "it remains unknown 1) whether information is fed back all the way to early visual areas".) What is needed is a clear presentation of the prior findings in one place in the introduction to the paper, followed by an articulation and motivation of the new questions addressed in this paper. If I were writing the paper, I would focus on the cross-decodability between foveal and peripheral stimuli, as I think that is the most revealing finding.

      We agree that the structure of the introduction did not sufficiently place our work in the context of prior literature. We have now expanded upon our Introduction section to discuss past studies of saccade- and fixation-related foveal feedback (pg 2 ln 49-59), laying out how this effect has been studied previously. We also removed the claim that "it remains unknown 1) whether information is fed back all the way to early visual areas", where our intention was to specifically focus on foveal prediction. We realize that this was not clear and hence removed this section. Instead, we now place a stronger focus on the cross-decodability finding (pg 2 ln 60-61).

      Reviewer #2 (Public review):

      Summary:

      This study investigated whether the identity of a peripheral saccade target object is predictively fed back to the foveal retinotopic cortex during saccade preparation, a critical prediction of the foveal prediction hypothesis proposed by Kroell & Rolfs (2022). To achieve this, the authors leveraged a gaze-contingent fMRI paradigm, where the peripheral saccade target was removed before the eyes landed near it, and used multivariate decoding analysis to quantify identity information in the foveal cortex. The results showed that the identity of the saccade target object can be decoded based on foveal cortex activity, despite the fovea never directly viewing the object, and that the foveal feedback representation was similar to passive viewing and not explained by spillover effects. Additionally, exploratory analysis suggested IPS as a candidate region mediating such foveal decodability. Overall, these findings provide neural evidence for the foveal cortex processing the features of the saccade target object, potentially supporting the maintenance of perceptual stability across saccadic eye movements.

      Strengths:

      This study is well-motivated by previous theoretical findings (Kroell & Rolfs, 2022), aiming to provide neural evidence for a potential neural mechanism of trans-saccadic perceptual stability. The question is important, and the gaze-contingent fMRI paradigm is a solid methodological choice for the research goal. The use of stimuli allowing orthogonal decoding of stimulus category vs stimulus shape is a nice strength, and the resulting distinctions in decoded information by brain region are clean. The results will be of interest to readers in the field, and they fill in some untested questions regarding pre-saccadic remapping and foveal feedback.

      We thank the reviewer for the positive assessment of our study.

      Weaknesses:

      The conclusions feel a bit over-reaching; some strong theoretical claims are not fully supported, and the framing of prior literature is currently too narrow. A critical weakness lies in the inability to test a distinction between these findings (claiming to demonstrate that "feedback during saccade preparation must underlie this effect") and foveal feedback previously found during passive fixation (Williams et al., 2008). Discussions (and perhaps control analysis/experiments) about how these findings are specific to the saccade target and the temporal constraints on these effects are lacking. The relationship between the concepts of foveal prediction, foveal feedback, and predictive remapping needs more thorough treatment. The choice to use only 4 stimuli is justified in the manuscript, but remains an important limitation. The IPS results are intriguing but could be strengthened by additional control analysis. Finally, the manuscript claims the study was pre-registered ("detailing the hypotheses, methodology, and planned analyses prior to data collection"), but on the OSF link provided, there is just a brief summary paragraph, and the website says "there have been no completed registrations of this project".

      We thank the reviewer for these helpful considerations. We agree that some of the claims were not sufficiently supported by the evidence, and in the revised manuscript, we added nuance to those claims (pg 8 ln 211-240). Furthermore, we now address more directly the distinction between foveal feedback during fixation and foveal feedback (foveal prediction) during saccade preparation. In particular, we now describe the literature about these two effects separately in the introduction (pg 2 ln 46-59), and we have added a new section in the discussion (“Foveal feedback during saccade preparation”) that more thoroughly explains why a passive fixation condition would have been unlikely to produce the same results we find (pg 8 ln 211-227). We also adapted the section about “Saccadic remapping or foveal prediction”, clearly delineating foveal prediction from feature remapping and predictive updating of attention pointers. As recommended by the reviewer, we conducted the parametric modulation analyses on the control condition, strengthening the claim that our findings are saccade-related. These results were added as Supplementary Figure 2 and are discussed in (pg 7 ln 190-191) and (pg 8 ln 224-227). 

      Lastly, we would like to apologize about a mistake we made with the pre-registration. We realized that the pre-registration had indeed not been submitted. We have now done so without changing the pre-registration itself, which can be seen from the recent activity of the preregistration (screenshot attached in the end). After consulting an open science expert at the University of Leipzig, we added a note of this mistake to the methods section of the revised manuscript (pg 10 ln 326-332). We could remove reference to this preregistration altogether, but would keep it at the discretion of the editor. 

      Specifics:

      (1) In the eccentricity-dependent decoding results (Figure 2B), are there any statistical tests to support the results being a U-shaped curve? The dip isn't especially pronounced. Is 4 degrees lower than the further ones? Are there alternative methods of quantifying this (e.g., fitting it to a linear and quadratic function)?

      We statistically tested the U-shaped relationship using a weighted quadratic regression, which showed significant positive curvature for decoding between fovea and periphery in all early visual areas (V1: t(27) = 3.98, p = 0.008, V2: t(27) = 3.03, p = 0.02, V3: t(27)= 2.776, p = 0.025, one-sided). We now report these results in the revised manuscript (pg 5 ln 137-138).

      (2) In the parametric modulation analysis, the evidence for IPS being the only region showing stronger fovea vs peripheral beta values was weak, especially given the exploratory nature of this analysis. The raw beta value can reflect other things, such as global brain fluctuations or signal-to-noise ratio. I would also want to see the results of the same analysis performed on the control condition decoding results.

      We appreciate the reviewer’s suggestion and repeated the same parametric modulation analysis on the control condition to assess the influence of potential confounds on the overall beta values (Supplementary Figure 2). The results show a negative association between foveal decoding and FEF and IPS (likely because eye movements in the control condition lead to less foveal presentation of the stimulus) and a positive association with LO. Peripheral decoding was not associated with significant changes in any of the ROIs, indicating that global brain fluctuations alone are not responsible for the effects reported in the experimental condition. The results of this analysis thus show a specific positive association of IPS activity with the experimental condition, not the control condition, which is in line with the idea that the foveal feedback effect reported in this study may be related to saccade preparation.

      (3) Many of the claims feel overstated. There is an emphasis throughout the manuscript (including claims in the abstract) that these findings demonstrate foveal prediction, specifically that "image-specific feedback during saccade preparation must underlie this effect." To my understanding, one of the key aspects of the foveal prediction phenomenon that ties it closely to trans-saccadic stability is its specificity to the saccade target but not to other objects in the environment. However, it is not clear to what degree the observed findings are specific to saccade preparation and the peripheral saccade target. Should the observers be asked to make a saccade to another fixation location, or simply maintain passive fixation, will foveal retinotopic cortex similarly contain the object's identity information? Without these control conditions, the results are consistent with foveal prediction, but do not definitively demonstrate that as the cause, so claims need to be toned down.

      We fully agree with the reviewer and toned down claims about foveal prediction. We engage with the questions raised by the reviewer more thoroughly in the new discussion section “Foveal feedback during saccade preparation”.

      In addition, we agree that another condition in which subjects make a saccade towards a different location would have been a great addition that we also considered, but due to concerns with statistical power did not add. While including such a condition exceeds the scope of the current study, we included this limitation in the Discussion section (pg 10 ln 316) and hope that future studies will address this question.

      (4) Another critical aspect is the temporal locus of the feedback signal. In the paradigm, the authors ensured that the saccade target object was never foveated via the gaze-contingent procedure and a conservative data exclusion criterion, thus enabling the test of feedback signals to foveal retinotopic cortex. However, due to the temporal sluggishness of fMRI BOLD signals, it is unclear when the feedback signal arrives at the foveal retinotopic cortex. In other words, it is possible that the feedback signal arrives after the eyes land at the saccade target location. This possibility is also bolstered by Chambers et al. (2013)'s TMS study, where they found that TMS to the foveal cortex at 350-400 ms SOA interrupts the peripheral discrimination task. The authors should qualify their claims of the results occurring "during saccade preparation" (e.g., pg 1 ln 22) throughout the manuscript, and discuss the importance of temporal dynamics of the effect in supporting stability across saccades.

      We fully agree that the sluggishness of the fMRI signal presents an important challenge in investigating foveal feedback. We have now included this limitation in the discussion (pg 10 ln 306-318). We also clarify that our argument connects to previous studies investigating the temporal dynamics of foveal feedback using similar tasks (pg 10 ln 313-316). Specifically, in their psychophysical work, Kroell and Rolfs (2022) and (2025) showed that foveal feedback occurs before saccade execution with a peak around 80 ms before the eye movement. 

      (5) Relatedly, the claims that result in this paradigm reflect "activity exclusively related to predictive feedback" and "must originate from predictive rather than direct visual processes" (e.g., lines 60-65 and throughout) need to be toned down. The experimental design nicely rules out direct visual foveal stimulation, but predictive feedback is not the only alternative to that. The activation could also reflect mental imagery, visual working memory, attention, etc. Importantly, the experiment uses a block design, where the same exact image is presented multiple times over the block, and the activation is taken for the block as a whole. Thus, while at no point was the image presented at the fovea, there could still be more going on than temporally-specific and saccade-specific predictive feedback.

      We agree that those claims could have misled the reader. Our intention was to state that the activation originates from feedback rather than direct foveal stimulation because of the nature of the design. We have now clarified these statements (pg 2 ln 65) and also included a discussion of other effects including imagery and working memory in the limitations section (pg 10 ln 306-313).

      (6) The authors should avoid using the terms foveal feedback and foveal prediction interchangeably. To me, foveal feedback refers to the findings of Williams et al. (2008), where participants maintained passive fixation and discriminated objects in the periphery (see also Fan et al., 2016), whereas foveal prediction refers to the neural mechanism hypothesized by Kroell & Rolfs (2022), occurring before a saccade to the target object and contains task irrelevant feature information.

      We agree, and we have now adopted a clearer distinction between these terms, referring to foveal prediction only when discussing the distinct predictive nature of the effect discovered by Kroell and Rolfs (2022). Otherwise we referred to this effect as foveal feedback.

      (7) More broadly, the treatment of how foveal prediction relates to saccadic remapping is overly simplistic. The authors seem to be taking the perspective that remapping is an attentional phenomenon marked by remapping of only attentional/spatial pointers, but this is not the classic or widely accepted definition of remapping. Within the field of saccadic remapping, it is an ongoing debate whether (/how/where/when) information about stimulus content is remapped alongside spatial location (and also whether the attentional pointer concept is even neurophysiologically viable). This relationship between saccadic remapping and foveal prediction needs clarification and deeper treatment, in both the introduction and discussion.

      We thank the reviewer for their remarks. We reformulated the discussion section on “Saccadic remapping or foveal prediction” to include the nuances about spatial and feature remapping laid out in the reviewer’s comment (pg 8-9 ln 241-269). We also put a stronger focus on the special role the fovea seems to be playing regarding the feedback of visual features (pg 8-9 ln 265-269).

      (8) As part of this enhanced discussion, the findings should be better integrated with prior studies. E.g., there is some evidence for predictive remapping inducing integration of non-spatial features (some by the authors themselves; Harrison et al., 2013; Szinte et al., 2015). How do these findings relate to the observed results? Can the results simply be a special case of non-spatial feature integration between the currently attended and remapped location (fovea)? How are the results different from neurophysiological evidence for facilitation of the saccade target object's feature across the visual field (Burrow et al., 2014)? How might the results be reconciled with a prior fMRI study that failed to find decoding of stimulus content in remapped responses (Lescroart et al, 2016)? Might this reflect a difference between peripheral-to-peripheral vs peripheral-to-foveal remapping? A recent study by Chiu & Golomb (2025) provided supporting evidence for peripheral-to-fovea remapping (but not peripheral-to-peripheral remapping) of object-location binding (though in the post-saccadic time window), and suggested foveal prediction as the underlying mechanism.

      We thank the reviewer for raising these intriguing questions. We now address them in the revised discussion. We argue that the findings by Harrison et al., 2013 and Szinte et al., 2015 of presaccadic integration of features across two peripheral locations can be explained by presaccadic updating of spatial attention pointers rather than remapping of feature information (pg 8 ln 248-253). The lack of evidence for periphery-to-periphery remapping (Lescroart et al, 2016) and the recent study by Chiu & Golomb (2025) showing object-location binding from periphery to fovea nicely align with our characterization of foveal processing as unique in predicting feature information of upcoming stimuli (pg 8-9 ln 265-269). Finally, we argue that the global (i.e., space-invariant) selection task-irrelevant saccadic target features (Burrows et al., 2014) is well-established at the neural level, but does not suffice to explain the spatially specific nature of foveal prediction (pg 8 ln 220-224). We now include these studies in the revised discussion section.

      Reviewer #3 (Public review):

      Summary:

      In this paper, the authors used fMRI to determine whether peripherally viewed objects could be decoded from the foveal cortex, even when the objects themselves were never viewed foveally. Specifically, they investigated whether pre-saccadic target attributes (shape, semantic category) could be decoded from the foveal cortex. They found that object shape, but not semantic category, could be decoded, providing evidence that foveal feedback relies on low-mid-level information. The authors claim that this provides evidence for a mechanism underlying visual stability and object recognition across saccades.

      Strengths:

      I think this is another nice demonstration that peripheral information can be decoded from / is processed in the foveal cortex - the methods seem appropriate, and the experiments and analyses are carefully conducted, and the main results seem convincing. The paper itself was very clear and well-written.

      We thank the reviewer for this positive evaluation of our work. As discussed in our response to Reviewer 1, we now elaborate on the differences between previous work showing decoding of peripheral information from foveal cortex from the effect shown here. While there are important similarities between these findings, foveal prediction in our study occurs in a saccade condition and in the absence of a task that is specific to stimulus features. 

      Weaknesses:

      There are a couple of reasons why I think the main theoretical conclusions drawn from the study might not be supported, and why a more thorough investigation might be needed to draw these conclusions.

      (1) The authors used a blocked design, with each object being shown repeatedly in the same block. This meant that the stimulus was entirely predictable on each block, which weakens the authors' claims about this being a predictive mechanism that facilitates object recognition - if the stimulus is 100% predictable, there is no aspect of recognition or discrimination actually being tested. I think to strengthen these claims, an experiment would need to have unpredictable stimuli, and potentially combine behavioural reports with decoding to see whether this mechanism can be linked to facilitating object recognition across saccades.

      We appreciate the reviewer’s point and would like to highlight that it was not our intention to claim a behavioral effect on object recognition. We believe that an ambiguous formulation in the original abstract may have been interpreted this way, and we thus removed this reference. We also speculated in our Discussion that a potential reason for foveal prediction could be a headstart in peripheral object recognition and in the revised manuscript more clearly highlight that this is a  potential future direction only.

      (2)  Given that foveal feedback has been found in previous studies that don't incorporate saccades, how is this a mechanism that might specifically contribute to stability across saccades, rather than just being a general mechanism that aids the processing/discrimination of peripherally-viewed stimuli? I don't think this paper addresses this point, which would seem to be crucial to differentiate the results from those of previous studies.

      We fully agree that this point had not been sufficiently addressed in the previous version of the manuscript. As described in our responses to similar comments from reviewers 1 and 2, we included an additional section in the Discussion (“Foveal feedback during saccade preparation”) to more clearly delineate the present study from previous findings of foveal feedback. Previous studies (Williams et al., 2008) only found foveal feedback during narrow discrimination tasks related to spatial features of the target stimulus, not during color-discrimination or fixation-only tasks, concluding that the observed effect must be related to the discrimination behavior. In contrast, we found foveal feedback (as evidenced by decoding of target features) during a saccade condition that was independent of the target features, suggesting a different role of foveal feedback than hypothesized by Williams et al. (2008).

      Recommendations for the authors:  

      Reviewer #2 (Recommendations for the authors):

      (A) Minor comments:

      (1)  The task should be clarified earlier in the manuscript.

      We now characterise the task in the abstract and clarified its description in the third paragraph, right after introducing the main literature.

      (2) Is there actually only 0.5 seconds between saccades? This feels very short/rushed.

      The inter-trial-interval was 0.5 seconds, though effectively it varied because the target only appeared once participants fixated on the fixation dot. Note that this pacing is slower than the rate of saccades in natural vision (about 3 to 4 saccades per second).Participants did not report this paradigm as rushed.

      (3) Typo on pg2 ln64 (whooe).

      Fixed.

      (4)  Can the authors also show individual data points for Figures 3 and 4?

      We added individual data points for Figures 4 and S2

      (5) The MNI coordinates on Figure 4A seem to be incorrect.

      We took out those coordinates.

      (6) Pg4 ln126 and pg6 ln194, why cite Williams et al. (2008)?

      We included this reference here to acknowledge that Williams et al. raised the same issues. We added a “cf.” before this reference to clarify this.

      (7) Pg7 ln207 Fabius et al. (2020) showed slow post-saccadic feature remapping, rather than predictive remapping of spatial attention.

      We have corrected this mistake.

      (8) The OSF link is valid, but I couldn't find a pre-registration.

      The issue with the OSF link has been resolved. The pre-registration had been set up but not published. We now published it without changing the original pre-registration (see the screenshot attached).

      (9) I couldn't access the OpenNeuro repository.

      The issue with the OpenNeuro link has been resolved.

      (B) Additional references you may wish to include:

      (1) Burrows, B. E., Zirnsak, M., Akhlaghpour, H., Wang, M., & Moore, T.  (2014). Global selection of saccadic target features by neurons in area v4. Journal of Neuroscience.

      (2) Chambers, C. D., Allen, C. P., Maizey, L., & Williams, M. A. (2013). Is delayed foveal feedback critical for extra-foveal perception?. Cortex.

      (3) Chiu, T. Y., & Golomb, J. D. (2025). The influence of saccade target status on the reference frame of object-location binding. Journal of Experimental Psychology. General.

      (4) Harrison, W. J., Retell, J. D., Remington, R. W., & Mattingley, J. B. (2013). Visual crowding at a distance during predictive remapping. Current Biology.

      (5) Lescroart, M. D., Kanwisher, N., & Golomb, J. D. (2016). No evidence for automatic remapping of stimulus features or location found with fMRI. Frontiers in Systems Neuroscience.

      (6) Moran, C., Johnson, P. A., Hogendoorn, H., & Landau, A. N. (2025). The representation of stimulus features during stable fixation and active vision. Journal of Neuroscience.

      (7) Szinte, M., Jonikaitis, D., Rolfs, M., Cavanagh, P., & Deubel, H. (2016). Presaccadic motion integration between current and future retinotopic locations of attended objects. Journal of Neurophysiology.

      We thank the reviewer for pointing out these references. We have included them in the revised version of the manuscript.

      Reviewer #3 (Recommendations for the authors):

      I just have a few minor points where I think some clarifications could be made.

      (1) Line 64 - "whooe" should be "whoose" I think.

      Fixed.

      (2) Around line 53 - you might consider citing this review on foveal feedback - https://doi.org/10.1167/jov.20.12.2

      We included the reference (pg 2 ln 55).

      (3) Line 129 - you mention a u-shaped relationship for decoding - I wasn't quite sure of the significance/relevance of this relationship - it would be helpful to expand on this / clarify what this means.

      We have expanded this section and added statistical tests of the u-shaped relationship in decoding using a weighted quadratic regression. We found significant positive curvature in all early visual areas between fovea and periphery (V1: t(27) = 3.98, p = 0.008, V2: t(27) = 3.03, p = 0.02, V3: t(27)= 2.776, p = 0.025). These findings support a u-shaped relationship. We now report these results in the revised manuscript (pg 5 ln 137-138).

      (4) Figure 1 - it would be helpful to indicate how long the target was viewed in the "stim on" panels - I assume it was for the saccade latency, but it would be good to include those values in the main text.

      We included that detail in the text (pg 3 ln 96-97).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03206

      Corresponding author(s____): Teresa M. Przytycka

      General Statements

      We thank all the reviewers for their time and their constructive criticism, based on which we have revised our manuscript. All review comments in are italics. Our responses are indicated in normal font except the excerpts from manuscript which are shown within double quote and in italics. The line numbers indicated here refer to those in the revised manuscript.

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This paper addresses the interesting question of how cell size may scale with organ size in different tissues. The approach is to mine data from the fly single cell atlas (FCA) which despite its name is a database of gene expression levels in single isolated nuclei. Using this data, they infer cell size based on ribosomal protein gene expression, and based on this approach infer that there are tissue and sex specific differences in scaling, some of which may be driven by differences in ribosomal protein gene expression.

      Response: Indeed, using the FCA dataset, we infer sex-specific differences in both cell size and cell number, which we validated with targeted experiments. We show that Drosophila cell types scale through distinct strategies-via cell size, cell number, or a mix of both-in an allometric rather than uniform fashion. We further propose that these scaling differences are driven, at least in part, by variation in translational activity, reflected in the expression of ribosomal proteins, translation elongation factors, and Myc.

      -----------------------------------------------------------

      I think the idea of mining this database is a clever one, however there a number of concerns about whether the existing data can really be used to draw the conclusions that are stated.

      __Response: __We are pleased to see that the reviewer found the question and our approach interesting.

      -----------------------------------------------------------

      *One concern has to do with the assumption that RP (ribosome protein) expression is a proxy for cell size. It is well established that ribosome abundance scales with cell size, but is there reason to believe that ribosome nuclear gene EXPRESSION correlates with ribosome abundance? *

      I'm not saying that this can't be true, but it seems like a big assumption that needs to be justified with some data. Maybe this is well known in the Drosophila literature, but in that case the relevant literature really needs to be cited.

      __Response: __To avoid any misunderstanding: we use sex-biased RP expression as an indicator of sex differences in cell size only within the same cell type or subtype, as defined by expression-based clustering in the FCA-not as a general estimator of cell size. This measure is applied strictly within the same clusters, never between different ones. To prevent overinterpretation, we replaced the term 'proxy' with 'indicator,' since the earlier wording might have implied that ribosomal gene expression was being used to estimate cell size more broadly.

      We should have begun by providing more background on the well-established link between ribosomal protein gene dosage and cell growth. This context was missing from the introduction, so we have now added a full paragraph outlining what is known about this connection:

      *Added at line 85: *

      "Cell growth, which supports both cell enlargement and cell division, demands elevated protein synthesis, accomplished by boosting translation rates. Indeed, ribosome abundance is known to scale with cell size in many organisms (Schmoller and Skotheim 2015; Cadart and Heald 2022; Serbanescu et al. 2022). Long before it was known that DNA was the carrier of genetic information, Drosophila researchers had identified a large class of mutations known as "Minutes" (Schultz 1929). These were universally haplo-insufficient. A single wild type copy resulted in a tiny slowly growing fly, and the homozygous loss-of-function alleles were lethal. In clones, the Minute cells are clearly smaller and compete poorly with surrounding wild type cells. We now know that most of the Minute loci encode ribosomal proteins (Marygold et al. 2007). Similarly, the Drosophila diminutive locus, also characterized by small flies almost a century ago, is now known to encode the Myc oncogene (Gallant 2013). This is significant as Myc is a regulator of ribosomal protein encoding genes in metazoans, including Drosophila (Grewal et al. 2005). The ribosome is assembled in a specialized nuclear structure called the nucleolus (Ponti 2025). Across species, including Drosophila (Diegmiller et al. 2021) and C. elegans (Ma et al. 2018), nucleolar size scales with cell size and is broadly correlated with growth in cell size and/or cell number, processes that are directly relevant to sex-specific allometry. Collectively, these and many other studies offer compelling evidence that ribosomal biogenesis is positively associated with cell size and growth, underscoring the value of measuring ribosome biogenesis as a metric."

      We understand that the reviewer is asking whether reduced RP mRNA expression directly leads to reduced functional ribosome assembly. We do not have a definitive answer to that specific question. However, we directly measured translation in fat body cells (section: Female bias in ribosomal gene expression in fat body cells leads to sex-biased protein synthesis), and the results show a clear correlation between RP gene expression and biosynthetic activity; even though we did not track every step from transcription to ribosome assembly to polysome loading across all cell types. This would indeed be an excellent direction for future work, including polysome profiling and related assays. Importantly, we did examine the nucleolus (Figure 4), where ribosome assembly occurs, and showed that nucleolar volume scales with RP gene expression. This strongly supports the presence of sex-specific differences in ribosome biogenesis.

      Added at line 115:

      "Building on the earlier studies noted above, as well as our direct measurements of translation bias in the fat body, nucleolar size, and cell size, we used sex-biased expression of ribosomal proteins as an indicator of sex differences in per-nucleus cell size."

      -----------------------------------------------------------

      Second, the interpretation of RP expression as a proxy for cell size seems potentially at odds with the fact that some cells are multi-nucleate. Those cells are big because of multiple nuclei, and so they might not show any increase in ribosome expression per nucleus. presumably for multi-nucleate cells, RP expression if it reflects anything at all would be something to do with cell size PER nucleus.

      Response: Yes, this is a very important point, and this is why we chose multinucleated indirect flight muscles for our direct experimental analysis. We show that in indirect flight muscle cells, adult cell size is greatly influenced by the sex-specific number of nuclei per cell. The female muscle cells are larger and have larger nuclei count per cell. Additionally, they also have higher expression of ribosomal protein coding genes. As the latter data are from the single nucleus sequencing atlas, this already demonstrates what this reviewer is asking for: per nucleus, female muscle cells express more ribosome protein coding mRNAs.

      -----------------------------------------------------------

      *Third, it is well known that many tissues in Drosophila are polyploid or polytene. I don't know enough about the methodology used to produce the FCA to know whether this is somehow normalized. Otherwise, my hypothesis would be that nuclei showing higher RP expression might just be polyploid or polytene. You might say that this could be controlled by asking if all genes are similary upregulated, but that isn't the case since at least in polytene chromosomes it is well known that only a small number of genes are expressed at a given time, while many are silent. *

      Response: Yes, this is an excellent point. As noted above, our study does not distinguish among the different potential causes of sex differences in ribosomal mRNA copy number, as these may vary across cell types. We now explicitly acknowledge it in the discussion (line 327). Importantly, even in the cases when ribosomal gene expression bias primarily reflects differences in DNA content, this still represents a plausible mechanistic route linking ribosomal gene expression to increased nucleolar ribosome biogenesis and, ultimately, larger cell size. This possibility does not alter our main conclusions.

      -----------------------------------------------------------

      Overall, I think a lot more foundational work would need to be done in order to allow the inference of cell size from RP expression. In a way, it is a bit unfortunate that they chose to do this work in Drosophila where so many cells are polyploid, although I gather that even in humans some tissues have this issue, for example large neurons in the brain.

      Response: We acknowledge that we did not clearly reference some of the foundational work in the literature. To address this, we have expanded the introduction to provide additional background and context. We also clarify that our fat body experiment offers independent support for the relationship between ribosomal gene expression bias, nuclear size bias, and corresponding biases in protein synthesis, thereby reinforcing the use of sex-specific ribosomal gene expression as an indicator of sex-specific cell size. Importantly, we assess this bias only within clusters, not between them. These clusters are derived from gene-expression-based clustering and are therefore relatively homogeneous. For example, as discussed in our response to Reviewer #3, the fat body contains several clusters that correspond to expression-defined subtypes of fat body cells. Our previous terminology may have inadvertently implied that we were using ribosomal gene expression to estimate cell size more broadly, which was not our intention.

      As for the choice of the organism, most of the authors are Drosophila researchers and we benefit from the unique, highly replicated data from whole head and whole body of both sexes. Such data is necessary for a non-biased estimation of the differences in nuclear number.

      -----------------------------------------------------------

      *Reviewer #1 (Significance (Required)):

      The idea that gene regulatory networks could "program" differences in scaling by changing levels of ribosomal protein gene expression is a tremendously important one if it can be established, because it would show a simple way for size scaling to be placed under control of developmental regulatory pathways. My original concern when I first looked at the abstract was going to be that yeah the results are interesting but a mechanism is not provided, but as I read it, that concern went away. showing that RP gene expression, which could be programmed by various driving pathways, can affect allometric scaling, would be extremely impactful and really change how we think about scaling, but putting it into the framework of gene expression networks that control other aspects of developmewnht. it would not be necessary to show which pathways actually drive these expression differences, the fact that they are different would be interesting enough to make everyone want to read this paper. But as discussed above I am not, however, convinced by the evidence presented here. So while I think it would be very significant if true, I am not convinced that the conclusion is well supported. This doesn't mean I have a reason to think it is false, just that its not well supported for the reasons I have given.*

      Response: We are grateful to the reviewer for this positive assessment of our findings despite lack of a specific mechanism. We also regret that our initial writing did not clearly situate our work within the foundational literature on the relationship between ribosomal biogenesis and scaling. The key contribution of our study is to demonstrate that sex-biased ribosomal biogenesis plays a role in allometric scaling, providing a basis for future mechanistic exploration. We hope that the revised manuscript now offers clear and compelling support for the conclusion that RP gene expression bias can influence allometric scaling.

      -----------------------------------------------------------

      I hasten to point out that I could be entirely wrong, if the missing bits of logic (i.e. that RP expression matches ribosome abundance and that gene expression in the FCA dataset isn't influenced by ploidy of the nucleus). If suitable references can be provided to support these underlying assumptions, then in fact I think these concerns could be answered with very little effort. Otherwise, I think experiments would be needed to support these assumptions, and that might be non-trivial to do in a reasonable time frame. for that reason, in the next question I have put "cannot tell" for the time estimate.

      Response: While gene expression in some FCA cell types may indeed be influenced by ploidy, our analysis does not depend on distinguishing among the possible sources of gene expression bias, which may vary across cell types. Rather, our key point is that-regardless of its origin-an increase in ribosomal gene expression is associated with enhanced ribosome biogenesis in the nucleolus and, ultimately, larger cell size. Thus, our main conclusions do not rely on any specific mechanism underlying RP gene expression upregulation. We now include additional references supporting the relationship between RP expression bias and cell size bias. We also strengthen the link between ribosomal gene expression and biosynthetic activity by clarifying its relationship with sex-biased Myc expression and the strong correlation with expression bias of EF1. We now include additional references supporting the relationship between RP expression bias and cell size bias. We also strengthen the link between ribosomal gene expression and biosynthetic activity by clarifying its relationship with sex-biased Myc expression and the strong correlation with expression bias of EF1.

      We thank the reviewer for their thoughtful and constructive comments, which have prompted us to clarify both our reasoning and the relevant literature more fully.

      -----------------------------------------------------------

      *Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The authors analyzed the FlyAtlas single-nucleus dataset to identify sex differences in gene expression and cell numbers. This led them to focus on muscles, cardiomyocytes, and fat body cells. They then measured cell and nucleolus size across different tissues and showed that reducing Myc function decreases sex differences in fat body cells. Overall, the manuscript provides a characterization of dimorphic differences in cell and organ size across three tissues.*

      Response: This is a nice synopsis of the work.

      -----------------------------------------------------------

      Major Comments: The major claims of the manuscript are well supported by the reported experiments and analyses. While Reviewer #2 considered the major claims of the manuscript to be well supported, by the reported experiments and analysesStatistical analyses appear adequate.

      Response: We agree, and we are glad that the reviewer found our work well supported.

      -----------------------------------------------------------

      *Minor Comments: The following minor issues should be addressed through textual edits:In the Introduction:

      "Disruptions in proportionality, whether due to undergrowth or overgrowth, can lead to reduced fitness or diseases such as cancer." Could the authors provide a reference for this statement, particularly for the claim that disruptions in proportion*

      Response: We apologize for this omission. The following explanation is now included starting at line 39:

      "For example, scaled cell growth is a driver of symmetry in Myc-dependent scaling of bone growth in the skeleton by chondrocyte proliferation (Ota et al. 2007; Zhou et al. 2011). Increased nucleolus size is a well known marker of cancer progression in a histopathological setting (Pianese 1896; Derenzini et al. 1998; Elhamamsy et al. 2022)."

      -----------------------------------------------------------

      *The authors state:

      "This study offers a comprehensive, cellular-resolution analysis of sexual size dimorphism in a model organism, uncovering how differences in cell number and size contribute to sex-specific body plans."*

      The study cannot be considered comprehensive, as not all organs were examined.

      Response: Indeed, "comprehensive" is a loaded word and in the revised manuscript we just omitted it.

      -----------------------------------------------------------

      *The following sentence from the abstract is unclear:

      "By uncovering how a conserved developmental system produces sex-specific proportions through distinct cellular strategies..."*

      * What do the authors mean by a conserved developmental system? Do they refer to a commonly used developmental model, or to a developmental system that is evolutionarily conserved?*

      Response: We acknowledge that the use of the word 'conserved' was inappropriate, and we have therefore removed it from the statement.

      -----------------------------------------------------------

      *Reviewer #2 (Significance (Required)):

      The manuscript presents a relevant exploration of sex-specific differences in cell size and cell number in Drosophila males and females. The limitations of the study are clearly acknowledged in the "Limitations" section. The work does not provide mechanistic insight into the causes or functional consequences of the observed differences. Nonetheless, the study extends our understanding of sexual dimorphism and establishes a foundation for future investigations into the autonomous and systemic mechanistic factors that regulate these differences.*

      Response: Thank you.

      -----------------------------------------------------------

      *Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript by Pal and colleagues addresses an important question: the cellular mechanisms underlying sex differences in organ size. By leveraging single-nucleus transcriptomic data from the adult Drosophila Cell Atlas, the authors show that different cell types adopt distinct strategies to achieve sex differences in organ size-either by increasing cell size or by altering cell number. They then focus on three organs-the indirect flight muscles, the heart, and the fat body-and provide supporting evidence for their transcriptomic analyses.*

      Response: This is a nice summary of the study. Thank you.

      -----------------------------------------------------------

      This study tackles a highly relevant and often overlooked question, as our understanding of the molecular and cellular events driving sex differences remains incomplete. The work presents interesting observations; however, it is largely descriptive, establishing correlations without providing functional evidence or mechanistic insight.

      Response: We agree that this is an often overlooked problem that has been difficult to address experimentally without single-cell genomics. Our work aims to help fill this gap. While the paper does contain descriptive elements, we believe such characterization is important at the early stages of developing a new area of inquiry. The study explores a unique dataset and includes experimental validation to support key observations. We also propose how allometry may be shaped by cell division and cell size, drawing on well-established molecular mechanisms. Thus, the reviewer's comment regarding a lack of mechanistic insight likely pertains to the absence of a direct connection to the sex-determination pathway, which is beyond the scope of the current study.

      -----------------------------------------------------------

      Below are four main points that should be addressed before publication: 1. Introduction and contextualisation of prior work The introduction does not adequately present the current state of knowledge. Several key studies are missing or insufficiently discussed. In particular, the following works should be included and integrated into the introduction: - PMID: 26710087 - shows that the sex determination gene transformer regulates male-female differences in Drosophila body size. - PMID: 28064166 - describes how differences in Myc gene dosage contribute to sex differences in body size. - PMID: 26887495 - demonstrates that the intrinsic sexual identity of adult stem cells can control sex-biased organ size through sex-biased proliferation. - PMID: 28976974 - reveals that Sxl modulates body growth through both tissue-autonomous and non-autonomous mechanisms. - PMID: 39138201 - shows that transformer drives sex differences in organ size and body weight. Incorporating and discussing these references would provide a more comprehensive and up-to-date framework for the study.

      Response: We agree that the literature suggested by the reviewer strengthens the introduction and improves the contextualization of prior work relevant to our study. Although much of it was previously included in the discussion section on cell-autonomous and hormonal regulation, it has now been moved to the introduction, along with the discussion of the papers suggested by the reviewer (beginning at line 58).

      "In Drosophila melanogaster, adult females are substantially larger than males (Fig. 1A1), yet both sexes develop from genetically similar zygotes and share most organs and cell types. In wild type flies, sex is determined by the number of X chromosomes in embryos, with XX flies developing as females and X(Y) flies developing as males due to the activation and stable expression of Sex-lethal only in XX flies (Erickson and Quintero 2007). While it is not entirely clear how sexually dimorphic size is regulated, the sex determination pathway is implicated in size regulation. Sex-reversed flies often show a size based on the X chromosome number rather than sexual morphology. Female Sex-lethal contributes to larger female size independently of sexual identity (Cline 1984), and Sex-lethal expression in insulin producing neurons in the brain also impacts body size (Sawala and Gould 2017). Female-specific Transformer protein is produced as a consequence of female-specific Sex-lethal and also contributes to increased female size (Rideout et al. 2015). This size scaling also applies to individual organs. For example, the Drosophila female gut is longer than the male gut due Transformer activity (Hudry et al. 2016). It has also been suggested that Myc dose (it is X-linked) is a regulator of body size (Mathews et al. 2017), although the failed dosage compensation model proposed has not been demonstrated."

      And again at line 74:

      "These studies show that size is regulated, but they do not address whether scaling is uniform or non-uniform and the mechanism for sexual size differences (SSD). The origins of SSD can, in principle, arise from differences in (i) gene expression, (ii) the presence of sex-specific cell types, (iii) the number of cell-specific nuclei, or (iv) the size (per nucleus) of those cells. Previous research in Drosophila has largely focused on gene expression in sex-specific organs like the gonads (Arbeitman et al. 2002; Parisi et al. 2004; Graveley et al. 2011; Pal et al. 2023), which are governed by a well-characterized sex-determination pathway (Salz and Erickson 2010; Clough and Oliver 2012; Raz et al. 2023) However, whether and how scaling differences in shared, non-sex-specific tissues are achieved via changes in cell size and number remains largely unexamined (Fig. 1A2). These studies show that size is regulated, but they do not address whether scaling is uniform or non-uniform and the mechanism for size differences."

      -----------------------------------------------------------

      2. Use of ribosomal gene expression as a proxy for cell size The authors use ribosomal gene expression levels as a proxy for cell size, but this assumption is not adequately justified. The cited references (refs. 20-22) focus on unicellular organisms (bacteria and yeast) or cleavage divisions in frog embryos, which are fundamentally different from adult Drosophila tissues. The authors should provide evidence that ribosome abundance scales with cell size across the distinct adult Drosophila cell types. Given that most adult fly tissues are post-mitotic, it is more likely that ribosomal gene expression reflects protein synthesis activity rather than cell size, particularly in secretory cell types.

      Response: Reviewer 1 raised a similar point, and we agree. We recognize that the term "proxy" may have been misleading. We use this measure only in the context of sex bias within homogeneous cell clusters, and not between clusters, even when such clusters share the same cell-type annotation. To avoid overinterpretation, we changed "poxy" to "indicator".

      In response to the reviewer's concern, we have expanded our discussion of the relevant supporting literature (additional text starting line 75). We have also directly measured translation in the fat body cells (section: Female bias in ribosomal gene expression in fat body cells leads to sex biased protein synthesis), which clearly demonstrates a correlation between ribosomal protein gene expression and biosynthetic activity. Although, we have not traced the chain of events from expression to ribosome assembly to polysome loading in all cell types, we did examine the nucleolus (Figure 4), where ribosomes are assembled, and we make a strong point that the volume of the nucleolus scales like ribosome protein gene expression. This provides strong evidence for sex-specific ribosome biogenesis contributing to cell size.

      Furthermore, the observation that ribosomal gene expression likely reflects protein synthesis activity is not at odds with increased cell size: biosynthesis increases in larger cells (Schmoller and Skotheim 2015). We have added a panel to Figure 4 showing the relationship between ribosomal gene expression bias and the average expression bias of Eukaryotic Elongation Factor 1 (eEF1).

      -----------------------------------------------------------

      3. Relationship between Myc and sex-biased Rp expression The proposed link between Myc and sex-biased Rp expression is unclear. Panels D and E of Figure 1 show no consistent relationship: some cell types with strong Rp sex bias exhibit either high or low female Myc bias, or even a male bias. The linear regression in Figure 4I (R = 0.07, p = 0.59) confirms the lack of correlation. The authors should clarify this point and adopt a more cautious interpretation regarding Myc as a potential regulator of sex-biased Rp expression and cell size differences. Experimentally, using Myc hypomorph or heterozygous conditions would be more appropriate than complete knockdown to test its role.

      Response: Thank you for noting that the relationship between Myc expression bias and sex-biased RP expression required clarification. This response was prepared in consultation with Myc expert Dr. David Levens.

      We demonstrate that both Myc and RP gene expression exhibit an overall female bias in the body. The absence of a strong correlation across cell clusters does not invalidate this conclusion. Myc is a well-established master regulator of ribosome biogenesis, but its quantitative effects are complex. According to recent models of Myc-mediated gene regulation (Nie et al. 2012; Lin et al. 2012), Myc upregulates all actively transcribed genes. Because this regulation is global, the relationship between changes in Myc expression and corresponding changes in ribosomal protein gene expression depends on cell type. Moreover, (Lorenzin et al. 2016) demonstrated that ribosomal protein genes saturate at relatively low levels of Myc, which helps explain why we observe a correlation in head cell clusters-where Myc expression is lower-but not in body clusters.

      Importantly, on average, the female-specific Myc expression bias is stronger in body cell clusters than in head cell clusters, consistent with the stronger female bias in ribosomal protein gene expression observed in the head relative to the body.

      To make this relationship more transparent, we combined the head and body clusters, which yielded a strong overall correlation (Fig. 4J, replacing the previous Fig. 4H).

      To further strengthen the evidence linking ribosomal gene expression to cell size, we also examined the relationship between ribosomal gene expression bias and Elongation Factor 1 (eEF1) expression bias, a key component of protein biosynthesis during the elongation step of translation. The resulting correlation exceeds 0.9 (new Fig. 4H, added as an additional panel in Fig. 4).

      -----------------------------------------------------------

      4. Conclusions about fat body cell number I have concerns about drawing conclusions on sex differences in fat body cell number from single-nucleus transcriptomic data for two reasons:

      1- Drosophila fat body tissue is heterogeneous, comprising distinct subpopulations (e.g., visceral fat cells, subcuticular fat cells), some of which are sex-specific-such as fat cells associated with the spermathecae in females.

      Response: Thank you for giving us the opportunity to clarify our analysis of the FCA data. Our approach does account for subpopulations within the fat body as well as within other cell types. Based on gene expression profiles, we identify three fat body clusters, all of which are reported in Table S3. One small female-specific cluster (

      When all fat body clusters are combined into a single supercluster, this supercluster still shows a male bias. We have now clarified this point in the manuscript (line 113). Note that both subclusters of fat body are already shown in Fig. 1C and 1D.

      -----------------------------------------------------------

      2- Adult fat body cells can be multinucleated (PMID: 13723227). Apparent sex differences in nucleus number may reflect differences in specific subpopulations or degrees of multinucleation rather than true differences in cell number. To strengthen the conclusions, the analysis should be performed at the level of fat body subpopulations, distinguishing clusters where possible. Additionally, quantifying nuclei relative to actual cell number-as done for muscle tissue-would clarify whether observed sex differences reflect true variation in cell number or differences in nuclear content per cell.

      Response: Yes, some cells can be multinucleate. We specifically address this in the context of muscle cells, where multinucleation is prominent, and we also conducted experimental validation in this tissue. As noted above, our analysis is performed at the subpopulation level, since clusters are defined by expression similarity (Leiden resolution 4.0) rather than by annotation.

      Because our work relies on single-nucleus data, each nucleus is treated as an individual unit of analysis. Nevertheless, we observe genuine nuclear differences within each cluster. Importantly, the presence of multinucleated cells does not alter our conclusions; it simply represents one form of variation in cell number that can be thought of as a subcomponent of cell/nuclei number.

      -----------------------------------------------------------

      Minor corrections/points: 1-The term body size in the title does not accurately reflect the content of the paper. I recommend replacing it with organ size to better align with the study's focus.

      Response: Thank you for the suggestion.

      ----------------------------------------------------------- 2-The term sexual size dimorphism is somewhat inaccurate in this context. Sex differences in size would be more appropriate. The term sexual dimorphism typically refers to traits that exhibit two distinct forms in males and females-such as primary or secondary sexual characteristics like sex organs or sex combs. In contrast, size is a quantitative trait that follows a normal distribution. Although the average female may be larger than the average male, the distributions overlap, making the term dimorphism imprecise.

      Response: Thank you for the suggestion.

      -----------------------------------------------------------

      3-In Figure 2E, there appears to be an inconsistency between the text, figure legend, and the data presented. The text and legend state that the total volume of dorsal longitudinal flight muscle cells was quantified, whereas the graph indicates measurements of nuclear size. This discrepancy should be clarified.

      Response: Thank you for pointing this out. We figured out that Y-axis label in the graph was incorrect and it is now fixed.

      -----------------------------------------------------------

      4-The authors proposed: "This increased biosynthetic activity in fat body cells may contribute to cell size differences, but also to the regulation of body size via production of factors that mediate body growth via interorgan communication". Please note that this hypothesis has already been tested functionally in PMID: 39138201 and was shown to be incorrect. Sex differences in body size are completely independent of fat body sexual identity or any intrinsic sex differences within fat cells.

      __Response: __We thank the reviewer for the opportunity to discuss why the data shown in PMID 39138201 (Hérault et al. 2024) do not rule out a model in which the fat body contributes to the sex-specific regulation of body size via interorgan communication. The main reason data in Herault et al cannot rule out such a model is that they use wing size as a proxy for body size. This is in contrast to prior studies, such as (Rideout et al. 2015), in which pupal volume was used to directly measure body size and show a non-autonomous effect of sex determination gene transformer on body size. Measuring body size directly is a more precise readout of growth during the larval stages of development, as opposed to using adult wing area which reflects the growth of a single organ. It is also important to note that the diets used to rear flies in Herault and Rideout differ, which is an important consideration as females do not achieve their maximal size without high dietary protein levels (Millington et al. 2021). To ensure all these points are communicated to readers, we added text to this effect in the revised version of our manuscript.

      Added at line 254:

      "This increased biosynthetic activity in fat body cells may contribute to cell size differences, but also to the regulation of body size via production of factors that mediate body growth via interorgan communication (Colombani et al. 2003; Géminard et al. 2009; Rajan and Perrimon 2012; Sano et al. 2015; Koyama and Mirth 2016). Indeed, one study showed the sexual identity of the fat body influenced pupal volume, which is an accurate readout of larval growth (Rideout et al. 2015; Delanoue et al. 2010). While a recent study suggests that male-female differences in body size were regulated independently of fat body sexual identity (Hérault et al. 2024), this study measured the growth of a single organ, the wing, as a proxy for body size. Additional studies are therefore needed to resolve whether fat body protein synthesis plays an important role in regulating sex differences in body size."

      -----------------------------------------------------------

      *5-The authors state: "This demonstrate that Myc plays a key role in regulating the sex difference in nucleolar size." This is an overstatement given the functional data presented. The claim should be toned down to reflect the limited evidence.

      **Referee cross-commenting**

      I completely agree with the main comments of Reviewer 1, as they address the paper's core.*

      Response: We have addressed the comments of Reviewer 1 in the response to reviewer's comments above.

      -----------------------------------------------------------

      *Reviewer #3 (Significance (Required)):

      The main novelty and strongest aspect of this study is its use of single-nucleus transcriptomic data from the adult Drosophila Cell Atlas to investigate how different cell types adopt distinct strategies to generate sex differences in organ size-either by increasing cell size or by altering cell number. Previous studies have largely focused on specific tissues, whereas this work provides a comprehensive, organism-wide view that encompasses all tissues, enabling direct cross-comparison between organs. This represents a clear advance in the field, primarily from a technical perspective, by leveraging organism-wide single-cell transcriptomics. The main limitations lie in the lack of functional experiments and mechanistic insights. Moreover, the proposed mechanism-differences in Myc gene dosage or expression levels-is not entirely novel, as Myc dosage has previously been implicated in contributing to sex differences in body size (PMID: 28064166).*

      Response: We do have some functional testing in the 3 tissues, flight muscle, heart and fat body, however, providing mechanistic insights is beyond the scope of this paper. The paper suggested by the reviewer is an example of one attempt to provide such a mechanism, probably not the only one. We hope that our rich data that we have assembled in this paper provide resources for generating hypotheses and stimulate further research.

      -----------------------------------------------------------

      References

      Cadart, Clotilde, and Rebecca Heald. 2022. "Scaling of Biosynthesis and Metabolism with Cell Size." Molecular Biology of the Cell 33 (9): pe5. https://doi.org/10.1091/mbc.E21-12-0627.

      Diegmiller, Rocky, Caroline A. Doherty, Tomer Stern, Jasmin Imran Alsous, and Stanislav Y. Shvartsman. 2021. "Size Scaling in Collective Cell Growth." Development (Cambridge, England) 148 (18): dev199663. https://doi.org/10.1242/dev.199663.

      Gallant, Peter. 2013. "Myc Function in Drosophila." Cold Spring Harbor Perspectives in Medicine 3 (10): a014324. https://doi.org/10.1101/cshperspect.a014324.

      Grewal, Savraj S., Ling Li, Amir Orian, Robert N. Eisenman, and Bruce A. Edgar. 2005. "Myc-Dependent Regulation of Ribosomal RNA Synthesis during Drosophila Development." Nature Cell Biology 7 (3): 295-302. https://doi.org/10.1038/ncb1223.

      Hérault, Chloé, Thomas Pihl, and Bruno Hudry. 2024. "Cellular Sex throughout the Organism Underlies Somatic Sexual Differentiation." Nature Communications 15 (1): 6925. https://doi.org/10.1038/s41467-024-51228-6.

      Lin, Charles Y., Jakob Lovén, Peter B. Rahl, et al. 2012. "Transcriptional Amplification in Tumor Cells with Elevated C-Myc." Cell 151 (1): 56-67. https://doi.org/10.1016/j.cell.2012.08.026.

      Lorenzin, Francesca, Uwe Benary, Apoorva Baluapuri, et al. 2016. "Different Promoter Affinities Account for Specificity in MYC-Dependent Gene Regulation." eLife 5 (July): e15161. https://doi.org/10.7554/eLife.15161.

      Ma, Tian-Hsiang, Po-Hsiang Chen, Bertrand Chin-Ming Tan, and Szecheng J. Lo. 2018. "Size Scaling of Nucleolus in Caenorhabditis Elegans Embryos." Biomedical Journal 41 (5): 333-36. https://doi.org/10.1016/j.bj.2018.07.003.

      Marygold, Steven J., John Roote, Gunter Reuter, et al. 2007. "The Ribosomal Protein Genes and Minute Loci of Drosophila Melanogaster." Genome Biology 8 (10): R216. https://doi.org/10.1186/gb-2007-8-10-r216.

      Millington, Jason W., George P. Brownrigg, Charlotte Chao, et al. 2021. "Female-Biased Upregulation of Insulin Pathway Activity Mediates the Sex Difference in Drosophila Body Size Plasticity." eLife 10 (January): e58341. https://doi.org/10.7554/eLife.58341.

      Nie, Zuqin, Gangqing Hu, Gang Wei, et al. 2012. "C-Myc Is a Universal Amplifier of Expressed Genes in Lymphocytes and Embryonic Stem Cells." Cell 151 (1): 68-79. https://doi.org/10.1016/j.cell.2012.08.033.

      Ponti, Donatella. 2025. "The Nucleolus: A Central Hub for Ribosome Biogenesis and Cellular Regulatory Signals." International Journal of Molecular Sciences 26 (9): 4174. https://doi.org/10.3390/ijms26094174.

      Rideout, Elizabeth J., Marcus S. Narsaiya, and Savraj S. Grewal. 2015. "The Sex Determination Gene Transformer Regulates Male-Female Differences in Drosophila Body Size." PLOS Genetics 11 (12): e1005683. https://doi.org/10.1371/journal.pgen.1005683.

      Schmoller, Kurt M., and Jan M. Skotheim. 2015. "The Biosynthetic Basis of Cell Size Control." Trends in Cell Biology 25 (12): 793-802. https://doi.org/10.1016/j.tcb.2015.10.006.

      Schultz, J. 1929. "The Minute Reaction in the Development of DROSOPHILA MELANOGASTER." Genetics 14 (4): 366-419. https://doi.org/10.1093/genetics/14.4.366.

      Serbanescu, Diana, Nikola Ojkic, and Shiladitya Banerjee. 2022. "Cellular Resource Allocation Strategies for Cell Size and Shape Control in Bacteria." The FEBS Journal 289 (24): 7891-906. https://doi.org/10.1111/febs.16234.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      This paper addresses the interesting question of how cell size may scale with organ size in different tissues. The approach is to mine data from the fly single cell atlas (FCA) which despite its name is a databse of gene expression levels in single isolated nuclei. Using this data, they infer cell size based on ribosomal protein gene expression, and based on this approach infer that there are tissue and sex specific differences in scaling, some of which may be driven by differences in ribosomal protein gene expression.

      I think the idea of mining this database is a clever one, however there a number of concerns about whether the existing data can really be used to draw the conclusions that are stated.

      One concern has to do with the assumption that RP (ribosome protein) expression is a proxy for cell size. It is well established that ribosome abundance sclse with cell size, but is there reason to believe that ribosome nuclear gene EXPRESSION correlates with ribosome abundance? I'm not saying that this can't be true, but it seems like a big assumption that needs to be justified with some data. Maybe this is well known in the Drosophila literature, but in that case the relevant literature really needs to be cited.

      Second, the interpretation of RP expression as a proxy for cell size seems potentially at odds with the fact that some cells are multi-nucleate. those cells are big because of multiple nuclei, and so they might not show any increase in ribosome expression per nucleus. presumably for a multi-nucleate cells, RP expression if it reflects anythnig at all would be something to do with cell size PER nucleus.

      Third, it is well known that many tissues in Drosophila are polyploid or polytene. I don't know enough about the methodology used to produce the FCA to know whether this is somehow normalized. Otherwise, my hypothesis would be that nuclei showing higher RP expression might just be polyploid or polytene. You might say that this could be controlled by asking if all genes are similary upregulated, but that isn't the case since at least in polytene chromosomes it is well known that only a small number of genes are expressed at a given time, while many are silent.

      Overall, I think a lot more foundational work would need to be done in order to allow the inference of cell size from RP expression. In a way, it is a bit unfortunate that they chose to do this work in Drosophila where so many cells are polyploid, although I gather that even in humans some tissues have this issue, for example large neurons in the brain.

      Significance

      The idea that gene regulatory networks could "program" differences in scaling by changing levels of ribosomal protein gene expression is a tremendously important one if it can be established, because it would show a simple way for size scaling to be placed under control of developmental regulatory pathways. My original concern when I first looked at the abstract was going to be that yeah the results are interesting but a mechanism is not provided, but as I read it, that concern went away. showing that RP gene expression, which could be programmed by various driving pathways, can affect allometric scaling, would be extremely impactful and really change how we think about scaling, but putting it into the framework of gene expression networks that control other aspects of developmewnht. it would not be necessary to show which pathways actually drive these expression differences, the fact that they are different would be interesting enough to make everyone want to read this paper. But as discussed above I am not, however, convinced by the evidence presented here. So while I think it would be very significant if true, I am not convinced that the conclusion is well supported. This doesn't mean I have a reason to think it is false, just that its not well supported for the reasons I have given.

      I hasten to point out that I could be entirely wrong, if the missing bits of logic (i.e. that RP expression matches ribosome abundance and that gene expression in the FCA dataset isn't influenced by ploidy of the nucleus). If suitable references can be provided to support these underlying assumptions, then in fact I think these concerns could be answered with very little effort. Otherwise, I think experiments would be needed to support these assumptions, and that might be non-trivial to do in a reasonable time frame. for that reason, in the next question I have put "cannot tell" for the time estimate.

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Public Reviews:

      Reviewer #1 (Public Review):

      Summary:

      This research group has consistently performed cutting-edge research aiming to understand the role of hormones in the control of social behaviors, specifically by utilizing the genetically-tractable teleost fish, medaka, and the current work is no exception. The overall claim they make, that estrogens modulate social behaviors in males and females is supported, with important caveats. For one, there is no evidence these estrogens are generated by "neurons" as would be assumed by their main claim that it is NEUROestrogens that drive this effect. While indeed the aromatase they have investigated is expressed solely in the brain, in most teleosts, brain aromatase is only present in glial cells (astrocytes, radial glia). The authors should change this description so as not to mislead the reader. Below I detail more specific strengths and weaknesses of this manuscript.

      We thank the reviewer for this positive evaluation of our work and for the helpful comments and suggestions. Regarding the concern that the term “neuroestrogens” may be misleading, we addressed this in the previous revision by consistently replacing it throughout the manuscript with “brain-derived estrogens” or “brain estrogens.”

      In addition, the following sentence was added to the Introduction (line 61): “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (Forlano et al., 2001; Diotel et al., 2010; Takeuchi and Okubo, 2013).”

      Strenghth:

      Excellent use of the medaka model to disentangle the control of social behavior by sex steroid hormones 

      The findings are strong for the most part because deficits in the mutants are restored by the molecule (estrogens) that was no longer present due to the mutation 

      Presentation of the approach and findings are clear, allowing the reader to make their own inferences and compare them with the authors' 

      Includes multiple follow-up experiments, which leads to tests of internal replication and an impactful mechanistic proposal 

      Findings are provocative not just for teleost researchers, but for other species since, as the authors point out, the data suggest mechanisms of estrogenic control of social behaviors may be evolutionary ancient 

      We thank the reviewer again for their positive evaluation of our work.

      Weakness:

      As stated in the summary, the authors are attributing the estrogen source to neurons and there isn't evidence this is the case. The impact of the findings doesn't rest on this either

      As mentioned above, we addressed this in the previous revision by replacing “neuroestrogens” with “brain-derived estrogens” or “brain estrogens” throughout the manuscript. In addition, the following sentence was added to the Introduction (line 61): “In teleost brains, including those of medaka, aromatase is exclusively localized in radial glial cells, in contrast to its neuronal localization in rodent brains (Forlano et al., 2001; Diotel et al., 2010; Takeuchi and Okubo, 2013).”

      The d4 versus d8 esr2a mutants showed different results for aggression. The meaning and implications of this finding are not discussed, leaving the reader wondering

      This comment is the same as one raised in the first review (Reviewer #1’s comment 2 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      Line 300: As the reviewer correctly noted, circles were significantly reduced in mutant males of the Δ8 line, whereas no significant reduction was observed in those of the Δ4 line. However, a tendency toward reduction was evident in the Δ4 line (P = 0.1512), and both lines showed significant differences in fin displays. Based on these findings, we believe our conclusion that esr2a<sup>−/−</sup> males exhibit reduced aggression remains valid. To clarify this point and address potential reader concerns, we have revised the text as follows: “esr2a<sup>−/−</sup> males exhibited significantly fewer fin displays (P = 0.0461 and 0.0293 for Δ8 and Δ4 lines, respectively) and circles (P = 0.0446 and 0.1512 for Δ8 and Δ4 lines, respectively) than their wild-type siblings (Fig. 5L; Fig. S8E), suggesting less aggression” was edited to read “esr2a<sup>−/−</sup> males from both the Δ8 and Δ4 lines exhibited significantly fewer fin displays than their wild-type siblings (P = 0.0461 and 0.0293, respectively). Circles followed a similar pattern, with a significant reduction in the Δ8 line (P = 0.0446) and a comparable but non-significant decrease in the Δ4 line (P =0.1512) (Figure 5L, Figure 5—figure supplement 3E), showing less aggression.”

      Lack of attribution of previous published work from other research groups that would provide the proper context of the present study

      This comment is also the same as one raised in the first review (Reviewer #1’s comment 3 on weaknesses). In our previous revision, in response to this comment, we cited the relevant references (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015; Yong et al., 2017; Alward et al., 2020; Ogino et al., 2023) in the appropriate sections. We also added the following new references and revised the Introduction and Discussion accordingly:

      (2) Alward BA, Laud VA, Skalnik CJ, York RA, Juntti SA, Fernald RD. 2020. Modular genetic control of social status in a cichlid fish. Proceedings of the National Academy of Sciences of the United States of America 117:28167–28174. DOI: https://doi.org/10.1073/pnas.2008925117

      (39) O’Connell LA, Hofmann HA. 2012. Social status predicts how sex steroid receptors regulate complex behavior across levels of biological organization. Endocrinology 153:1341–1351. DOI:https://doi.org/10.1210/en.2011-1663

      (54) Yong L, Thet Z, Zhu Y. 2017. Genetic editing of the androgen receptor contributes to impaired male courtship behavior in zebrafish. Journal of Experimental Biology 220:3017–3021.DOI:https://doi.org/10.1242/jeb.161596

      There are a surprising number of citations not included; some of the ones not included argue against the authors' claims that their findings were "contrary to expectation"

      In our previous revision, we cited the relevant references (Hallgren et al., 2006; O’Connell and Hofmann, 2012; Huffman et al., 2013; Jalabert et al., 2015) in the Introduction. We also revised the text to remove phrases such as “contrary to expectation” and “unexpected.”

      The experimental design for studying aggression in males has flaws. A standard test like a residentintruder test should be used.

      Following this comment, we have attempted additional aggression assays using the resident-intruder paradigm. However, these experiments did not produce consistent or interpretable results. As noted in our previous revision, medaka naturally form shoals and exhibit weak territoriality, and even slight differences in dominance between a resident and an intruder can markedly increase variability, reducing data reliability. Therefore, we believe that the approach used in the present study provides a more suitable assessment of aggression in medaka, regardless of territorial tendencies. We will continue to explore potential refinements in future studies and respectfully ask the reviewer to evaluate the present work based on the assay used here.

      While they investigate males and females, there are fewer experiments and explanations for the female results, making it feel like a small addition or an aside

      While we did not adopt this comment in our previous revision, we have carefully reconsidered the reviewers’ feedback and have now decided to remove the female data. This change allows us to present a more focused and cohesive story centered on males. The specific revisions are outlined below:

      Abstract

      Line 25: The text “, thereby revealing a previously unappreciated mode of action of brain-derived estrogens. We additionally show that female fish lacking Cyp19a1b are less receptive to male courtship and conversely court other females, highlighting the significance of brain-derived estrogens in establishing sex-typical behaviors in both sexes.” has been revised to “. Taken together, these findings reveal a previously unappreciated mode of action of brain-derived estrogens in shaping male-typical behaviors.”

      Results

      Line 88: The text “Loss of cyp19a1b function in these fish was verified by measuring brain and peripheral levels of sex steroids. As expected, brain estradiol-17β (E2) in both male and female homozygous mutants (cyp19a1b<sup>−/−</sup>) was significantly reduced to 16% and 50%, respectively, of the levels in their wild-type (cyp19a1b<sup>+/+</sup>) siblings (P = 0.0037, males; P = 0.0092, females) (Fig. 1, A and B). In males, brain E2 in heterozygotes (cyp19a1b<sup>−/−</sup>) was also reduced to 45% of the level in wild-type siblings (P = 0.0284) (Fig. 1A), indicating a dosage effect of cyp19a1b mutation. In contrast, peripheral E2 levels were unaltered in both cyp19a1b<sup>−/−</sup> males and females (Fig. S1, C and D), consistent with the expected functioning of Cyp19a1b primarily in the brain. Strikingly, brain levels of testosterone, as opposed to E2, increased 2.2-fold in cyp19a1b<sup>−/−</sup> males relative to wild-type siblings (P = 0.0006) (Fig. 1A). Similarly, brain 11KT levels in cyp19a1b<sup>−/−</sup> males and females increased 6.2- and 1.9-fold, respectively, versus wild-type siblings (P = 0.0007, males; P = 0.0316, females) (Fig. 1, A and B). These results show that cyp19a1b-deficient fish have reduced estrogen levels coupled with increased androgen levels in the brain, confirming the loss of cyp19a1b function. They also suggest that the majority of estrogens in the male brain and half of those in the female brain are synthesized locally in the brain. In addition, peripheral 11KT levels in cyp19a1b<sup>−/−</sup> males and females increased 3.7- and 1.8-fold, respectively (P = 0.0789, males; P = 0.0118, females) (Fig. S1, C and D), indicating peripheral influence in addition to central effects.” has been revised to “Loss of cyp19a1b function in these fish was verified by measuring brain and peripheral levels of sex steroids in males. As expected, brain estradiol-17β (E2) in homozygous mutants (cyp19a1b<sup>−/−</sup>) was significantly reduced to 16% of the levels in wild-type (cyp19a1b<sup>+/+</sup>) siblings (P = 0.0037) (Figure 1A). Brain E2 in heterozygotes (cyp19a1b<sup>+/−</sup>) was also reduced to 45% of wild-type levels (P = 0.0284) (Figure 1A), indicating a dosage effect of the cyp19a1b mutation. In contrast, peripheral E2 levels were unaltered in cyp19a1b<sup>−/−</sup> males (Figure 1B), consistent with the expected functioning of Cyp19a1b primarily in the brain. Strikingly, brain testosterone levels, as opposed to E2, increased 2.2-fold in cyp19a1b<sup>−/−</sup> males relative to wild-type siblings (P = 0.0006) (Figure 1A). Similarly, brain 11KT levels increased 6.2-fold (P = 0.0007) (Figure 1A). These results indicate that cyp19a1b-deficient males have reduced estrogen coupled with elevated androgen levels in the brain, confirming the loss of cyp19a1b function. They also suggest that the majority of estrogens in the male brain are synthesized locally in the brain. Peripheral 11KT levels also increased 3.7-fold in cyp19a1b<sup>−/−</sup> males (P = 0.0789) (Figure 1B), indicating peripheral influence in addition to central effects.”

      Line 211: “expression of vt in the pNVT of cyp19a1b<sup>−/−</sup> males was significantly reduced to 18% as compared with cyp19a1b<sup>+/+</sup> males (P = 0.0040), a level comparable to that observed in females” has been revised to “expression of vt in the pNVT of cyp19a1b<sup>−/−</sup> males was significantly reduced to 18% as compared with cyp19a1b<sup>+/+</sup> males (P = 0.0040).”

      The subsection entitled “cyp19a1b-deficient females are less receptive to males and instead court other females,” which followed line 311, has been removed.

      Discussion

      The two paragraphs between lines 373 and 374, which addressed the female data, have been removed.

      Materials and methods

      Line 433: “males and females” has been changed to “males”.

      Line 457: “focal fish” has been changed to “focal male”.

      Line 458: “stimulus fish” has been changed to “stimulus female”.

      Line 458: “Fig. 6, E and F, ” has been deleted.

      Line 460: “; wild-type males in Fig. 6, A to C” has been deleted.

      Line 466: The text “The period of interaction/recording was extended to 2 hours in tests of courtship displays received from the stimulus esr2b-deficient female and in tests of mating behavior between females, because they take longer to initiate courtship (12). In tests using an esr2b-deficient female as the stimulus fish, where the latency to spawn could not be calculated because these fish were unreceptive to males and did not spawn, the sexual motivation of the focal fish was instead assessed by counting the number of courtship displays and wrapping attempts in 30 min. The number of these mating acts was also counted in tests to evaluate the receptivity of females. In tests of mating behavior between two females, the stimulus female was marked with a small notch in the caudal fin to distinguish it from the focal female.” has been revised to “In tests using an esr2b-deficient female as the stimulus fish, the latency to spawn could not be calculated because the female was unreceptive to males and did not spawn. Therefore, the sexual motivation of the focal male was assessed by counting the number of courtship displays and wrapping attempts in 30 min. To evaluate courtship displays performed by stimulus esr2bdeficient females toward focal males, the recording period was extended to 2 hours, as these females take longer to initiate courtship (Nishiike et al., 2021). In all video analyses, the researcher was blind to the fish genotype and treatment.”

      Line 499: “brains dissected from males and females of the cyp19a1b-deficient line (analysis of ara, arb, vt, gal, npba, and esr2b) and males of the esr1-, esr2a-, and esr2b-deficient lines” has been revised to “male brains from the cyp19a1b-deficient line (analysis of ara, arb, vt, and gal) and from the esr1-, esr2a-, and esr2b-deficient lines.”

      Line 504: “After color development for 15 min (gal), 40 min (npba), 2 hours (vt), or overnight (ara, arb, and esr2b)” has been revised to “After color development for 15 min (gal), 2 hours (vt), or overnight (ara and arb).”

      Line 516: “Thermo Fisher Scientific, Waltham, MA” has been changed to “Thermo Fisher Scientific” to avoid redundancy.

      Line 565: The subsection entitled “Measurement of spatial distances between fish” has been removed.

      Line 585: “6/10 cyp19a1b<sup>+/+</sup>, 3/10 cyp19a1b<sup>+/−</sup>, and 6/10 cyp19a1b<sup>−/−</sup> females were excluded in Fig. 6B;” has been deleted.

      References

      The following references have been removed:

      Capel B. 2017. Vertebrate sex determination: evolutionary plasticity of a fundamental switch. Nature Reviews Genetics 18:675–689. DOI: https://doi.org/10.1038/nrg.2017.60

      Hiraki T, Nakasone K, Hosono K, Kawabata Y, Nagahama Y, Okubo K. 2014. Neuropeptide B is femalespecifically expressed in the telencephalic and preoptic nuclei of the medaka brain. Endocrinology 155:1021–1032. DOI: https://doi.org/10.1210/en.2013-1806

      Juntti SA, Hilliard AT, Kent KR, Kumar A, Nguyen A, Jimenez MA, Loveland JL, Mourrain P, Fernald RD. 2016. A neural basis for control of cichlid female reproductive behavior by prostaglandin F2α. Current Biology 26:943–949. DOI: https://doi.org/10.1016/j.cub.2016.01.067

      Kimchi T, Xu J, Dulac C. 2007. A functional circuit underlying male sexual behaviour in the female mouse brain. Nature 448:1009–1014. DOI: https://doi.org/10.1038/nature06089

      Kobayashi M, Stacey N. 1993. Prostaglandin-induced female spawning behavior in goldfish (Carassius auratus) appears independent of ovarian influence. Hormones and Behavior 27:38–55.

      DOI:https://doi.org/10.1006/hbeh.1993.1004

      Liu H, Todd EV, Lokman PM, Lamm MS, Godwin JR, Gemmell NJ. 2017. Sexual plasticity: a fishy tale. Molecular Reproduction and Development 84:171–194. DOI: https://doi.org/10.1002/mrd.22691

      Munakata A, Kobayashi M. 2010. Endocrine control of sexual behavior in teleost fish. General and Comparative Endocrinology 165:456–468. DOI: https://doi.org/10.1016/j.ygcen.2009.04.011

      Nugent BM, Wright CL, Shetty AC, Hodes GE, Lenz KM, Mahurkar A, Russo SJ, Devine SE, McCarthy MM. 2015. Brain feminization requires active repression of masculinization via DNA methylation. Nature Neuroscience 18:690–697. DOI: https://doi.org/10.1038/nn.3988

      Shaw K, Therrien M, Lu C, Liu X, Trudeau VL. 2023. Mutation of brain aromatase disrupts spawning behavior and reproductive health in female zebrafish. Frontiers in Endocrinology 14:1225199.

      DOI:https://doi.org/10.3389/fendo.2023.1225199

      Stacey NE. 1976. Effects of indomethacin and prostaglandins on the spawning behaviour of female goldfish. Prostaglandins 12:113–126. DOI: https://doi.org/10.1016/s0090-6980(76)80010-x

      Figure 1

      Panel B, which originally showed steroid levels in female brains, has been replaced with steroid levels in the periphery of males, originally presented in Figure S1, panel C. Accordingly, the legend “(A and B) Levels of E2, testosterone, and 11KT in the brain of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (A) and females (B) (n = 3 per genotype and sex).” has been revised to “(A, B) Levels of E2, testosterone, and 11KT in the brain (A) and periphery (B) of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 3 per genotype).”

      Figure 3

      The female data have been deleted from Figure 3. The revised Figure 3 is presented.

      The corresponding legend text has been revised as follows:

      Line 862: “males and females (n = 4 and 5 per genotype for males and females, respectively)” has been changed to “males (n = 4 per genotype)”.

      Line 864: “males and females (n = 4 except for cyp19a1b<sup>+/+</sup> males, where n = 3)” has been changed to “males (n = 3 and 4, respectively)”.

      Figure 6

      Figure 6 and its legend have been removed.

      Figure 1—figure supplement 1

      Panel C, showing male data, has been moved to Figure 1B, as described above, while panel D, showing female data, has been deleted. The corresponding legend “(C and D) Levels of E2, testosterone, and 11KT in the periphery of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (C) and females (D) (n = 3 per genotype and sex). Statistical differences were assessed by Bonferroni’s post hoc test (C and D). Error bars represent SEM. *P < 0.05.” has also been removed.

      Line 804: Following this change, the figure title has been updated from “Generation of cyp19a1bdeficient medaka and evaluation of peripheral sex steroid levels” to “Generation of cyp19a1b-deficient medaka.”

      The statistics comparing "experimental to experimental" and "control to experimental" isn't appropriate 

      This comment is the same as one raised in the first review (Reviewer #1’s comment 7 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      The reviewer raised concerns about the statistical analysis used for Figures 4C and 4E, suggesting that Bonferroni’s test should be used instead of Dunnett’s test. However, Dunnett’s test is commonly used to compare treatment groups to a reference group that receives no treatment, as in our study. Since we do not compare the treated groups with each other, we believe Dunnett’s test is the most appropriate choice.

      Line 576: The reviewer’s concern may have arisen from the phrase “comparisons between control and experimental groups” in the Materials and methods. We have revised it to “comparisons between untreated and E2-treated groups in Figure 4C and D” for clarity.

      Reviewer #3 (Public Review):

      Summary:

      Taking advantage of the existence in fish of two genes coding for estrogen synthase, the enzyme aromatase, one mostly expressed in the brain (Cyp19a1b) and the other mostly found in the gonads (Cyp19a1a), this study investigates the role of brain-derived estrogens in the control of sexual and aggressive behavior in medaka. The constitutive deletion of Cyp19a1b markedly reduced brain estrogen content in males and to a lesser extent in females. These effects are accompanied by reduced sexual and aggressive behavior in males and reduced preference for males in females. These effects are reversed by adult treatment with supporting a role for estrogens. The deletion of Cyp19a1b is associated with a reduced expression of the genes coding for the two androgen receptors, ara and arb, in brain regions involved in the regulation of social behavior. The analysis of the gene expression and behavior of mutants of estrogen receptors indicates that these effects are likely mediated by the activation of the esr1 and esr2a isoforms. These results provide valuable insight into the role of estrogens in social behavior in the most abundant vertebrate taxon, however the conclusion of brain-derived estrogens awaits definitive confirmation.

      We thank this reviewer for their positive evaluation of our work and comments that have improved the manuscript.

      Strength:

      Evaluation of the role of brain "specific" Cyp19a1 in male teleost fish, which as a taxon are more abundant and yet proportionally less studied that the most common birds and rodents. Therefore, evaluating the generalizability of results from higher vertebrates is important. This approach also offers great potential to study the role of brain estrogen production in females, an understudied question in all taxa.

      Results obtained from multiple mutant lines converge to show that estrogen signaling, likely synthesized in the brain drives aspects of male sexual behavior.

      The comparative discussion of the age-dependent abundance of brain aromatase in fish vs mammals and its role in organization vs activation is important beyond the study of the targeted species.  - The authors have made important corrections to tone down some of the conclusions which are more in line with the results. 

      We thank the reviewer again for their positive evaluation of our work and the revisions we have made.

      weaknesses:

      No evaluation of the mRNA and protein products of Cyp19a1b and ESR2a are presented, such that there is no proper demonstration that the mutation indeed leads to aromatase reduction. The conclusion that these effects dependent on brain derived estrogens is therefore only supported by measures of E2 with an EIA kit that is not validated. No discussion of these shortcomings is provided in the discussion thus further weakening the conclusion manuscript.

      In response to this and other comments, we have now provided direct validation that the cyp19a1b mutation in our medaka leads to loss of function. Real-time PCR analysis showed that cyp19a1b transcript levels in the brain were reduced by approximately half in cyp19a1b<sup>+/−</sup> males and were nearly absent in cyp19a1b<sup>−/−</sup> males, consistent with nonsense-mediated mRNA decay

      In addition, AlphaFold 3-based structural modeling indicated that the mutant Cyp19a1b protein lacks essential motifs, including the aromatic region and heme-binding loop, and exhibits severe conformational distortion (see figure; key structural features are annotated as follows: membrane helix (blue), aromatic region (red), and heme-binding loop (orange)). 

      Results:

      Line 101: The following text has been added: “Loss of cyp19a1b function was further confirmed by measuring cyp19a1b transcript levels in the brain and by predicting the three-dimensional structure of the mutant protein. Real-time PCR revealed that transcript levels were reduced by half in cyp19a1b<sup>+/−</sup> males and were nearly undetectable in cyp19a1b<sup>−/−</sup> males, presumably as a result of nonsense-mediated mRNA decay (Lindeboom et al., 2019) (Figure 1C). The wild-type protein, modeled by AlphaFold 3, exhibited a typical cytochrome P450 fold, including the membrane helix, aromatic region, and hemebinding loop, all arranged in the expected configuration (Figure 1—figure supplement 1C). The mutant protein, in contrast, was severely truncated, retaining only the membrane helix (Figure 1—figure supplement 1C). The absence of essential domains strongly indicates that the allele encodes a nonfunctional Cyp19a1b protein. Together, transcript and structural analyses consistently demonstrate that the mutation generated in this study causes a complete loss of cyp19a1b function.”

      Materials and methods

      Line 438: A subsection entitled “Real-time PCR” has been added. The text of this subsection is as follows: “Total RNA was isolated from the brains of cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males using the RNeasy Plus Universal Mini Kit (Qiagen, Hilden, Germany). cDNA was synthesized with the SuperScript VILO cDNA Synthesis Kit (Thermo Fisher Scientific, Waltham, MA). Real-time PCR was performed on the LightCycler 480 System II using the LightCycler 480 SYBR Green I Master (Roche Diagnostics). Melting curve analysis was conducted to verify that a single amplicon was obtained in each sample. The β-actin gene (actb; GenBank accession number NM_001104808) was used to normalize the levels of target transcripts. The primers used for real-time PCR are shown in Supplementary file 2.”

      Line 448: A subsection entitled “Protein structure prediction” has been added. The text of this subsection is as follows: “Structural predictions of Cyp19a1b proteins were conducted using AlphaFold 3 (Abramson et al., 2024). Amino acid sequences corresponding to the wild-type allele and the mutant allele generated in this study were submitted to the AlphaFold 3 prediction server. The resulting models were visualized with PyMOL (Schrödinger, New York, NY), and key structural features, including the membrane helix, aromatic region, and heme-binding loop, were annotated.”

      References

      The following two references have been added:

      Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, Bodenstein SW, Evans DA, Hung CC, O'Neill M, Reiman D, Tunyasuvunakool K, Wu Z, Žemgulytė A, Arvaniti E, Beattie C, Bertolli O, Bridgland A, Cherepanov A, Congreve M, CowenRivers AI, Cowie A, Figurnov M, Fuchs FB, Gladman H, Jain R, Khan YA, Low CMR, Perlin K, Potapenko A, Savy P, Singh S, Stecula A, Thillaisundaram A, Tong C, Yakneen S, Zhong ED, Zielinski M, Žídek A, Bapst V, Kohli P, Jaderberg M, Hassabis D, Jumper JM. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630:493–500. DOI: https://doi.org/10.1038/s41586-024-07487-w

      Lindeboom RGH, Vermeulen M, Lehner B, Supek F. 2019. The impact of nonsense-mediated mRNA decay on genetic disease, gene editing and cancer immunotherapy. Nature Genetics 51:1645–1651.DOI:https://doi.org/10.1038/s41588-019-0517-5

      Figure 1

      The real-time PCR results described above have been incorporated in Figure 1, panel C, with the corresponding legend provided below (line 788).

      (C) Brain cyp19a1b transcript levels in cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 6 per genotype). Mean value for cyp19a1b<sup>+/+</sup> males was arbitrarily set to 1.

      The subsequent panels have been renumbered accordingly. The entirety of the revised Figure 1.

      Figure 1—figure supplement 1

      The AlphaFold 3-generated structural models described above have been incorporated in Figure 1— figure supplement 1, panel C, with the corresponding legend provided below (line 811).

      (C) Predicted three-dimensional structures of wild-type (left) and mutant (right) Cyp19a1b proteins. Key structural features are annotated as follows: membrane helix (blue), aromatic region (red), and heme-binding loop (orange).

      The entirety of the revised Figure 1—figure supplement 1 is presented

      The information on the primers used for real-time PCR has been included in Supplementary file 2.

      The functional deficiency of esr2a was already addressed in the previous revision. For clarity, we have reproduced the relevant information here.

      A previous study reported that female medaka lacking esr2a fail to release eggs due to oviduct atresia (Kayo et al., 2019, Sci Rep 9:8868). Similarly, in this study, some esr2a-deficient females exhibited spawning behavior but were unable to release eggs, although the sample size was limited (Δ8 line: 2/3; Δ4 line: 1/1). In contrast, this was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function. To incorporate this information into the manuscript, the following text has been added to the Materials and methods (line 423): “A previous study reported that esr2a-deficient female medaka cannot release eggs due to oviduct atresia (Kayo et al., 2019). Likewise, some esr2a-deficient females generated in this study, despite the limited sample size, exhibited spawning behavior but were unable to release eggs (Δ8 line: 2/3; Δ4 line: 1/1), while such failure was not observed in wild-type females (Δ8 line: 0/12; Δ4 line: 0/11). These results support the effective loss of esr2a function.”

      Most experiments are weakly powered (low sample size).

      This comment is essentially the same as one raised in the first review (Reviewer #3’s comment 7 on weaknesses). We acknowledge the reviewer’s concern that the histological analyses were weakly powered due to the limited sample size. In our earlier revision, we responded as follows:

      Histological analyses were conducted with a relatively small sample size, as our previous experience suggested that interindividual variability in the results would not be substantial. Since significant differences were detected in many analyses, further increasing the sample size was deemed unnecessary.

      The variability of the mRNA content for a same target gene between experiments (genotype comparison vs E2 treatment comparison) raises questions about the reproducibility of the data (apparent disappearance of genotype effect).

      This comment is the same as one raised in the first review (Reviewer #3’s comment 8 on weaknesses), which we already addressed in our initial revision. For the reviewer’s convenience, we provide the response below:

      As the reviewer pointed out, the overall area of ara expression is larger in Figure 2J than in Figure 2F. However, the relative area ratios of ara expression among brain nuclei are consistent between the two figures, indicating the reproducibility of the results. Thus, this difference is unlikely to affect the conclusions of this study.

      Additionally, the differences in ara expression in pPPp and arb expression in aPPp between wild-type and cyp19a1b-deficient males appear less pronounced in Figures 2J and 2K than in Figures 2F and 2H. This is likely attributable to the smaller sample size used in the experiments for Figures 2J and 2K, resulting in less distinct differences. However, as the same genotype-dependent trends are observed in both sets of figures, the conclusion that ara and arb expression is reduced in cyp19a1b-deficient male brains remains valid.

      Conclusions:

      Overall, the claims regarding role of estrogens originating in the brain on male sexual behavior is supported by converging evidence from multiple mutant lines. The role of brain-derived estrogens on gene expression in the brain is weaker as are the results in females. 

      We appreciate the reviewer’s positive evaluation of our findings on male behavior. The concern regarding the role of brain-derived estrogens in gene expression has been addressed in our rebuttal, and the female data have been removed so that the analysis now focuses on males. The specific revisions for removing the female data are described in Response to reviewer #1’s comment 6 on weaknesses.

      Recommendations For The Authors:

      Reviewer #1 (Recommendations For The Authors):

      The manuscript is improved slightly. I am thankful the authors addressed some concerns, but for several concerns the referees raised, the authors acknowledged them yet did not make corresponding changes to the manuscript or disagreed that they were issues at all without explanation. All reviewers had issues with the imbalanced focus on males versus females and the male aggression assay. Yet, they did not perform additional experiments or even make changes to the framing and scope of the manuscript. If the authors had removed the female data, they may have had a more cohesive story, but then they would still be left with inadequate behavior assays in the males. If the authors don't have the time or resources to perform the additional work, then they should have said so. However, the work would be incomplete relative to the claims. That is a key point here. If they change their scope and claims, the authors avoid overstating their findings. I want to see this work published because I believe it moves the field forward. But the authors need to be realistic in their interpretations of their data. 

      In response to this and related comments, we have removed the female data and focused the manuscript on analyses in males. The specific revisions are described in Response to reviewer #1’s comment 6 on weaknesses. Additionally, we have validated that the cyp19a1b mutation in our medaka leads to loss of function (see Response to reviewer #3’s comment 1 on weaknesses), which further strengthens the reliability of our conclusions regarding male behavior.

      I agree with the reviewer who said we need to see validation of the absence of functional cyp19a1 b in the brain. However, the results from staining for the protein and performing in situ could be quizzical. Indeed, there aren't antibodies that could distinguish between aromatase a and b, and it is not uncommon for expression of a mutated gene to be normal. One approach they could do is measure aromatase activity, but they are *sort of* doing that by measuring brain E2. It's not perfect, but we teleost folks are limited in these areas. At the very least, they should show the predicted protein structure of the mutated aromatase alleles. It could show clearly that the tertiary structure is utterly absent, giving more support to the fact that their aromatase gene is non-functional. 

      As noted above, we have further validated the loss of cyp19a1b function by measuring cyp19a1b transcript levels in the brain and predicting the three-dimensional structure of the mutant protein. These analyses confirmed that cyp19a1b function is indeed lost, thereby increasing the reliability of our conclusions. For further details, please refer to Response to reviewer #3’s comment 1 on weaknesses.

      With all of this said, the work is important, and it is possible that with a reframing of the impact of their work in the context of their findings, I could consider the work complete. I think with a proper reframing, the work is still impactful. 

      In accordance with this feedback, and as described above, we have reframed the manuscript by removing the female data and focusing exclusively on males. This revision clarifies the scope of our study and reinforces the support for our conclusions. For further details, please refer to Response to reviewer #1’s comment 6 on weaknesses.

      (1) Clearly state in the Figure 1 legend that each data point for male aggressive behaviors represents the total # of behaviors calculated over the 4 males in each experimental tank.

      In response to this comment, we have revised the legend of Figure 1K (line 797). The original legend, “(K) Total number of each aggressive act observed among cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, or cyp19a1<sup>−/−</sup> males in the tank (n = 6, 7, and 5, respectively),” has been updated to “(K) Total number of each aggressive act performed by cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males. Each data point represents the sum of acts recorded for the 4 males of the same genotype in a single tank (n = 6, 7, and 5 tanks, respectively).” This clarifies that each data point reflects the total behaviors of the 4 males within each tank.

      (2) The authors wrote under "Response to reviewer #1's major comment "...the development of male behaviors may require moderate neuroestrogen levels that are sufficient to induce the expression of ara and arb, but not esr2b, in the underlying neural circuitry": "This may account for the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study.".

      What is meant by the latter statement? What accounts for the lack of aggression? The lack of increase in esr2b? Please clarify. 

      Line 365: In response to this comment, “This may account for the lack of aggression recovery in E2treated cyp19a1b-deficient males in this study.” has been revised to “Considering this, the lack of aggression recovery in E2-treated cyp19a1b-deficient males in this study may be explained by the possibility that the E2 dose used was sufficient to induce not only ara and arb but also esr2b expression in aggression-relevant circuits, which potentially suppressed aggression.”

      This revision clarifies that, while moderate brain estrogen levels are sufficient to promote male behaviors via induction of ara and arb, the E2 dose used in this study may have additionally induced esr2b in circuits relevant to aggression, potentially underlying the lack of aggression recovery.

      (3) This is a continuation of my comment/concern directly above. If the induction of ara and arb aren't enough, then how can, as the authors state, androgen signaling be the primary driver of these behaviors? 

      In response to this follow-up comment, we would like to clarify that, as described above, the lack of aggression recovery in E2-treated cyp19a1b-deficient males is not due to insufficient induction of ara and arb, but instead is likely because esr2b was also induced in aggression-relevant circuits, which may have suppressed aggression. Therefore, the concern that androgen signaling cannot be the primary driver of these behaviors is not applicable.

      (4) The authors' point about sticking with the terminology for the ar genes as "ara" and "arb" is not convincing. The whole point of needing a change to match the field of neuroendocrinology as a whole (that is, across all vertebrates) is researchers, especially those with high standing like the Okubo group, adopt the new terminology. Indeed, the Okubo group is THE leader in medaka neuroendocrinology. It would go a long way if they began adopting the new terminology of "ar1" and "ar2". I understand this may be laborious to a degree, and each group can choose to use their terminology, but I'd be remiss if I didn't express my opinion that changing the terminology could help our field as a whole. 

      We sincerely appreciate the reviewer’s thoughtful comments regarding nomenclature consistency in vertebrate neuroendocrinology. We understand the motivation behind the suggestion to adopt ar1 and ar2. However, we consider the established nomenclature of ara and arb to be more appropriate for the following reasons.

      First, adopting the ar1/ar2 nomenclature would introduce a discrepancy between gene and protein symbols. According to the NCBI International Protein Nomenclature Guidelines (Section 2B.Abbreviations and symbols;

      https://www.ncbi.nlm.nih.gov/genbank/internatprot_nomenguide/), the ZFIN Zebrafish Nomenclature Conventions (Section 2. PROTEINS:https://zfin.atlassian.net/wiki/spaces/general/pages/1818394635/ZFIN+Zebrafish+Nomenclature+Con ventions), and the author guidelines of many journal

      (e.g.,https://academic.oup.com/molehr/pages/Gene_And_Protein_Nomenclature), gene and protein symbols should be identical (with proteins designated in non-italic font and with the first letter capitalized). Maintaining consistency between gene and protein symbols helps avoid unnecessary confusion. The ara/arb nomenclature allows this, whereas ar1/ar2 does not.

      Second, the two androgen receptor genes in teleosts are paralogs derived from the third round of wholegenome duplication that occurred early in teleost evolution. For such duplicated genes, the ZFIN Zebrafish Nomenclature Conventions (Section 1.2. Duplicated genes) recommend appending the suffixes “a” and “b” to the approved symbol of the human or mouse ortholog. This convention clearly indicates that these genes are whole-genome duplication paralogs and provides an intuitive way to represent orthologous and paralogous relationships between teleost genes and those of other vertebrates. As a result, it has been widely adopted, and we consider it logical and beneficial to apply the same principle to androgen receptors.

      In light of these considerations, we respectfully maintain that the ara/arb nomenclature is more suitable for the present manuscript than the alternative ar1/ar2 system.

      (5) In the discussion please discuss these potentially unexpected findings.

      (a) gal was unaffected in female cyp19a1 mutants, but they exhibit mating behaviors towards females. Given gal is higher in males and these females act like females, what does this mean about the function of gal/its utility in being a male-specific marker (is it one??)? 

      (b) esr2b expression is higher in female cyp19a1 mutants. this is unexpected as well given esr2b is required for female-typical mating and is higher in females compared to males and E2 increases esr2b expression. please explain...well, what this means for our idea of what esr2b expression tell us. 

      We thank the reviewer for the insightful comments. As the female data have been removed from the manuscript, discussion of these findings in female cyp19a1b mutants is no longer necessary.

      Reviewer #3 (Recommendations For The Authors):

      The authors have addressed a number of answers to the reviewer's comments, notably they provided missing methodological information and rephrased the text. However, the authors have not addressed the main issues raised by the reviewers. Notably, it is regrettable that the reduced amount of brain aromatase cannot be confirmed, this seems to be the primary step when validating a new mutant. Even if protein products of the two genes may not be discriminated (which I can understand), it should be possible to evaluate the expression of a common messenger and/or peptide and confirm that aromatase expression is reduced in the brain. Since Cyp19a1b is relatively more abundant in the brain Cyp19a1a, this would strengthen the conclusion and provide confidence that the mutant indeed does silence aromatase expression in the brain. Although these short comings are acknowledged in the rebuttal letter, this is not mentioned in the discussion. Doing so would make the manuscript more transparent and clearer. 

      As noted in Response to reviewer #3’s comment 1 on weaknesses, we have validated the loss of Cyp19a1b function by measuring its transcript levels in the brain and predicting the three-dimensional structure of the mutant protein. These analyses confirmed that Cyp19a1b function is indeed lost, thereby increasing the reliability of our conclusions.

      FigS1 - panels C&D please indicate in which tissue were hormones measured. Blood?

      We thank the reviewer for pointing this out. In our study, “peripheral” refers to the caudal half of the body excluding the head and visceral organs, not blood. Accordingly, we have revised the figure legend and the description in the Materials and Methods section as follows:

      Legend for Figure 1B (line 787) now reads: “Levels of E2, testosterone, and 11KT in the brain (A) and peripheral tissues (caudal half of the body) (B) of adult cyp19a1b<sup>+/+</sup>, cyp19a1b<sup>+/−</sup>, and cyp19a1b<sup>−/−</sup> males (n = 3 per genotype).”

      Materials and methods (line 431): The sentence “Total lipids were extracted from the brain and peripheral tissues (from the caudal half) of” has been revised to “Total lipids were extracted from the brain and from peripheral tissues, specifically the caudal half of the body excluding the head and visceral organs, of.”

      Additional Alterations:

      We have reformatted the text and supporting materials to comply with the journal’s Author Guidelines. The following changes have been made:

      (1) Figures and supplementary files are now provided separately from the main text.

      (2) The title page has been reformatted without any changes to its content.

      (3) In-text citations have been changed from numerical references to the author–year format.

      (4) Figure labels have been revised from “Fig. 1,” “Fig. S1,” etc., to “Figure 1,” “Figure 1—figure supplement 1,” etc.

      (5) Table labels have been revised from “Table S1,” etc., to “Supplementary file 1,” etc.

      (6) Line 324: The typo “is” has been corrected to “are”.

      (7) Line 382: The section heading “Materials and Methods” has been changed to “Materials and methods” (lowercase “m”).

      (8) Line 383: The Key Resources Table has been placed at the beginning of the Materials and methods section.

      (9) Line 389: The sentence “Sexually mature adults (2–6 months) were used for experiments, and tissues were consistently sampled 1–5 hours after lights on.” has been revised to “Sexually mature adults (2–6 months) were used for experiments and assigned randomly to experimental groups. Tissues were consistently sampled 1–5 hours after lights on.”

      (10)  Line 393: The sentence “All fish were handled in accordance with the guidelines of the Institutional Animal Care and Use Committee of the University of Tokyo.” has been removed.

      (11)  Line 589: The following sentence has been added: “No power analysis was conducted due to the lack of relevant data; sample size was estimated based on previous studies reporting inter-individual variation in behavior and neural gene expression in medaka.”

      (12)  Line 598: The reference list has been reordered from numerical sequence to alphabetical order by author.

      (13)  In the figure legends, notations such as “A and B” have been revised to “A, B.”

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03091

      Corresponding author(s): Chia-Tsen, Tsai, Liuh-Yow Chen

      1. General Statements [optional]

      We thank the reviewers for their valuable time and constructive feedback on our study, which ultimately improved our manuscript. Herein, we provide a detailed response to each of the reviewers' comments, supported by new data that have been integrated into both the main text and the supplementary figures.

      2. Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary This manuscript builds upon the authors' prior findings that targeting COUP-TF2 to TRF1 induces ALT-associated phenotypes and G2-mediated synthesis in telomerase-immortalised BJT human fibroblasts. In this study, the authors show that telomere-coupled COUP-TF2 promotes H3K9me3 enrichment in these cells, and that this effect is blocked by TRIM28 depletion. Furthermore, TRIM28 depletion also suppresses the formation of ALT phenotypes in VA13 ALT cells. Given that TRIM28 has been implicated in regulating H3K9me3 deposition via SETDB1, and has been reported to co-purify with TR2 and TR4 (though not previously in the context of ALT telomeres), these findings add mechanistic depth to how heterochromatin regulators contribute to ALT activity. Overall, the manuscript's conclusions are generally supported by the presented data, but several aspects require clarification or additional experimental validation.

      The authors report a modest reduction in telomeric H3K9me3 following COUP-TF2 and TR4 depletion in U-2 OS and VA13 cells (Figure 1B). To strengthen the claim that these orphan receptors specifically regulate H3K9me3, the authors should 1) Assess additional heterochromatic histone marks (e.g., H4K20me3) at telomeres, 2) Normalize telomeric signals to both parental histone levels and input, and 3) Evaluate whether global H3K9me3 levels also decrease upon receptor depletion

      Response: We appreciate the reviewer's suggestion. To address the concern regarding specificity, we assessed H3K27me3 and H4K20me3 levels upon COUP-TF2/TR4 depletion and found no significant changes (Supplementary Fig. 1C). Furthermore, we reprocessed the telomeric ChIP data, normalizing to both input DNA and parental histone levels (Figure 1B). This refined analysis reinforces our original conclusion. Finally, Western blot analysis showed no significant changes in global H3 or H3K9me3 levels upon COUP-TF2/TR4 depletion (Figure 1A). Altogether, these results further support the specificity of COUP-TF2/TR4 for H3K9me3 at telomeres. We have revised the main text (page 3) and updated Figure 1A, 1B, and Supplementary Figure 1C for these changes.

      Most experiments explore chromatin changes in telomerase-positive BJT fibroblasts (Figure 2, Figure 4D). It remains unclear whether similar manipulations in ALT cells yield consistent effects, which would give a broader context for ALT phenotype induction. Are ALT phenotypes similarly induced in ALT cells? Does altered chromatin status affect telomere length or telomerase recruitment/activity? Can these pathways drive ALT phenotypes in non-immortalised cells?

      Response: We appreciate the reviewer's suggestion and have explored chromatin changes in telomerase-negative BJ and IMR90 primary fibroblasts (Supplementary Fig. 2C, D). Consistent to the result in BJ-telomerase cells, we found that VP64-TRF1 decreased telomeric H3, H4, and H3K9me3 levels, whereas KRAB-TRF1 increased these marks. Moreover, expression of either VP64-TRF1 or KRAB-TRF1 was sufficient to induce APB formation and ATDs in BJ and IMR90 cells. These results indicate that the chromatin changes at telomeres can drive ALT phenotypes in both primary and telomerase-immortalized fibroblast cells.

          Additionally, regarding whether chromatin alteration affects telomere length or telomere regulation, we have explored telomere length changes in BJT cells expressing vector, TRF1, KRAB-TRF1 or VP64-TRF1. The result of telomere restriction fragment (TRF) assay showed that the cells of all conditions maintained static telomere lengths through 30 days in culture (data shown below), suggesting that the chromatin alterations may not impact telomerase recruitment or activity. As this result is beyond the scope of current study, this data is only shown here in the rebuttal letter for a reference and is not included in the revised manuscript.
      
          Moreover, according to the reviewer's suggestion, we also carried out VP64-TRF1 or KRAB-TRF1 expression experiments in WI38-VA13/2RA cells that express high TERRA and have altered chromatin structures. Our data revealed that VP64-TRF1 suppresses telomere H3K9me3 and ALT activity, while KRAB-TRF1 increases both (Supplementary Figure 2E), suggesting an association of heterochromatin state with ALT activation in WI38-VA13/2RA cells.
      
          The observation that VP64-TRF1 reduces ALT activity in WI38-2RA/VA13 cells contrasts with findings in BJT cells. It is worth noting that studies from the Azzalian and Linger groups demonstrated that experimentally induced TERRA expression promotes ALT activity in ALT and non-ALT cells (PMID: 36122232, PMID: 40624280). Therefore, we propose that TERRA upregulation by VP64-TRF1 may contribute to the ALT induction observed in BJT cells (Supplementary Figure 2A, B), whereas the ability of VP64-TRF1 to suppress ALT activity in WI38-2RA/VA13 cells could be attributed to the reduction of telomere H3K9me3 and heterochromatin loss. Importantly, KRAB-TRF1 concurrently enhanced histone H3, H4, and H3K9me3 occupancy and ATL activity in both human fibroblasts and ALT cells. Altogether, these results support the notion that heterochromatin formation triggers ALT.
      
          We also examined TRIM28 recruitment to telomeres by telomere-ChIP and found that COUP-TF2LBD-TRF1 promotes TRIM28 telomere enrichment in BJ, IMR90 and U2OS, similar to BJT cells (Supplementary Fig. 5A-D).  Moreover, in ALT cell lines WI38-2RA/VA13, U2OS, and Saos-2, depletion of COUP-TF2 or TR4 reduced TRIM28 telomeric association (Figure 4A, B). Together, the data from human fibroblasts and ALT cells supports a role of orphan NRs in recruiting TRIM28 to ALT telomeres.
      

      We acknowledge the reviewer's suggestions, which allow us to clarify and strengthen the conclusions. The corresponding data are presented in Figure 4A-B and Supplementary Figure 2B-D and 5E-F, and the main text has been modified on page 4-6 in the revised manuscript.

      When referring to Figure 3G, the authors state that that telomeric H3K9me3 was abolished upon depleting TRIM28 from the U2OS and WI38-VA13/2RA cells. Abolished is a strong word for a 50% decrease, and this sentence should be revised. The reduction appears greater than that seen with COUP-TF2/TR4 depletion. Are the effects additive? If so, might TRIM28 act, at least in part, independently of COUP-TF2/TR4?

      Response: We appreciate the reviewer's comments. We have revised the manuscript on page 5, replacing "abolished" with "significantly reduced" to better describe the effect of TRIM28 depletion on telomeric H3K9me3. To further investigate the interplay between TRIM28 and orphan NRs in regulating telomeric H3K9me3, we conducted single and combined knockdown experiments in U2OS and WI38-VA13/2RA cells, followed by telomere-ChIP analysis (Supplementary Figures 4D, E). Our results showed that single depletion of either orphan NRs or TRIM28 lead to a similar decrease in telomeric H3K9me3, and that combined knockdown do not result in any further reduction. These findings support an epistatic interaction between orphan NRs and TRIM28 in the regulation of telomeric H3K9me3. We have expanded on this interpretation in the main text (page 6) and included the relevant data in Supplementary Figures 4D, E.

      VA13 cells consistently exhibit stronger effects than U-2 OS (e.g., Figures 1 and 3). This discrepancy could be linked to the high content of variant repeats in VA13 cells. The authors should assess whether variant repeat content underlies the differential response. Repeating key experiments in additional ALT lines with varied repeat compositions would be informative.

      Response: We appreciate the reviewer's suggestion and have extended our analyses to two additional ALT osteosarcoma cell lines, SAOS-2 and G292. In both lines, depletion of orphan NRs resulted in a consistent decrease in telomeric H3K9me3 levels (Supplementary Figures 1A, B). We also examined the contribution of TRIM28 to telomeric H3K9me3 in these cells. siRNA-mediated depletion of TRIM28 in SAOS-2 and G292 cells similarly caused a significant reduction in telomeric H3K9me3 and ALT phenotypes (Supplementary Figure 4A-C). Together, these results from 4 ALT cell lines confirm that orphan NRs and TRIM28 promote telomeric H3K9me3 formation in ALT cells. We have modified the main text on page 3 and 5-6 for these results.

      In line with the previous point, it would be useful to show whether TRIM28 telomeric enrichment is affected by COUP-TF2/TR4 depletion in U2OS cells (Figure 4C). To improve confidence in these findings, the authors should perform telomeric ChIP assays, especially with the COUP-TF2^LBDΔAF2-TRF1 mutant construct.

      Response: Following the reviewer's suggestion, we performed telomere-ChIP assays to assess TRIM28 enrichment at telomeres upon expression of COUP-TF2LBD-TRF1 and its ΔAF2 mutant in U2OS cells. Consistent with our immunofluorescence results, telomere-ChIP revealed that COUP-TF2LBD-TRF1 expression promotes TRIM28 telomere enrichment, while the AF2 deletion mutant failed to recruit TRIM28 (Supplementary Figure 5D). We have modified the main text on page 6 for this result.

      The immunoprecipitation experiments showing TRIM28 association with orphan receptors should include benzonase treatment to rule out DNA-mediated co-association (Figure 4F-G).

      Response: We appreciate the reviewer's suggestion. To address the possibility of DNA-mediated interactions, we pre-incubated cell lysates with benzonase prior to Co-IP (Page 7). This treatment did not disrupt the association between TRIM28 and COUP-TF2 or TR4 in WI38-VA13/2RA and BJT cells (Supplementary Figures 5E-G), indicating a DNA-independent interaction. We have modified the main text on page 7 for this result.

      The study would benefit from a direct assessment of whether COUP-TF2LBDΔAF2-TRF1 fails to induce ALT phenotypes in BJTfibroblasts.

      Response: We thank the reviewer for this suggestion. As the role of the COUP-TF2 AF2 domain in ALT induction in BJT fibroblasts has recently been thoroughly investigated and published by our group (PMID: 38752489), we have directed the current study towards a more detailed mechanistic question. Specifically, we have carried out experiments to further demonstrate that COUP-TF2 recruits TRIM28 to telomeres via its AF2 domain in both human fibroblasts and ALT cells (Supplementary Figures 5A-D). On Page 6, we have modified the main text for these results and included a citation to our previous publication to provide the necessary background.

      The experiments performed in Figure 5E-H lack a vector-only + siCtrl control.• In Figure 5E, the observation that APB formation is restored in siTRIM28 + Vector-treated cells is unexpected. The authors should address this finding and clarify whether this reflects biological noise or a compensatory effect.

      Response: We thank the reviewer for this suggestion. We have repeated the experiments with a revised design, ensuring a consistent vector background across all groups (Vector + siCtrl, Vector + siTRIM28, TRIM28 WT + siTRIM28, and TRIM28 ΔRBCC + siTRIM28) (Figure 5E-H). This improved design confirms that expression of wild-type TRIM28, but not TRIM28 ΔRBCC, restores APB formation, ATDS, ssTeloC, and telomeric H3K9me3 levels in TRIM28-depleted cells. The updated dataset also resolves the previous unexpected increase in APB formation in the siTRIM28 + Vector condition, which is now excluded. We have modified the main text accordingly on page 8.

      Reviewer #1 (Significance (Required)):

      This work offers valuable mechanistic insight into how COUP-TF2 and TRIM28 coordinate to regulate heterochromatin deposition and ALT phenotype formation. It adds to the growing understanding of chromatin-mediated telomere regulation. What remains unclear is how important this interaction is for ALT maintenance, as H3K9me3 is only moderately altered upon TRIM28 depletion in ALT cells. Depletion of TRIM28 has been shown previously to induce APB formation and telomere elongation in U-2 OS ALT cells (Wang et al., 2021), the opposite to what the authors observed here in VA13 cells (Figure 5E-H). Clarifying whether these differences are variant repeat-dependent, or reflect intrinsic features of specific ALT cell lines, would substantially elevate the study's impact.

      Response: We appreciate the reviewer's recognition of the significance of our work in elucidating the molecular basis of ALT regulation through COUP-TF2-TRIM28-mediated heterochromatin formation. In response to the reviewer's insightful comment regarding the importance of this interaction for ALT maintenance, we have expanded our study. We now include data from three additional primary human fibroblasts and a total of four ALT cancer cell lines (Figure 4, Supplementary Figure 4). These new data further strengthen the conclusion that TRIM28 promotes telomeric H3K9me3 and ALT-associated features. Furthermore, our rescue experiments support the model that the ALT-promoting function of TRIM28 in both fibroblasts and ALT cell lines is mediated through its physical interaction with COUP-TF2 (Supplementary Figure 5). We believe these results provide a solid foundation for demonstrating a cooperative role of COUP-TF2 and TRIM28 in ALT maintenance, and address the reviewer's concern regarding the generalizability of our findings.

      Reviewer #2 (Evidence, reproducibility and clarity (Required):

      Summary This manuscript investigates the role of orphan nuclear receptors (ORs), specifically COUP-TF2 and TR4, in promoting H3K9me3 enrichment at ALT telomeres via recruitment of TRIM28 (KAP1). The authors propose that the AF2 domain of COUP-TF2, located in its ligand-binding domain (LBD), is sufficient to recruit TRIM28 to telomeres. This, in turn, promotes heterochromatinization and induces hallmarks of the Alternative Lengthening of Telomeres (ALT) pathway, including APB formation and telomeric DNA synthesis outside of S-phase. This study addresses one important and unresolved question in the field: by what mechanism is the heterochromatic state established at ALT telomeres? Another timely question, not addressed here is: how is heterochromatin (specifically H3K9me3) functionally linked to ALT? The findings are potentially novel and mechanistically insightful. However, key elements of the study, particularly the central tethering experiments, require stronger quantification and clarity. Additional mechanistic tests and literature adjustments would also improve the manuscript.

      Major Concerns

      Central TRF1-COUP-TF2-LBD result lacks quantification and clarity: the tethering of COUP-TF2's LBD to telomeres via TRF1 is a core result of the paper. This experiment demonstrates that this domain is sufficient to induce weak H3K9me3 enrichment and ALT features (APBs and ATDS). However, the supporting ALT data are presented only in Supplementary Figures S1A and S1B, and are not quantified. These data should be quantified with appropriate statistics and moved to a main figure.

      Response: The current study builds upon our recent publication (PMID: 38752489), which comprehensively analyzed ALT induction (APBs, ATDS, C-circles, T-SCEs) by orphan NR-TRF1 expression (COUP-TF1, COUP-TF2, TR2, and TR4; full-length and LBD) in various human fibroblast cell lines. To avoid potential duplicate publication concerns, particularly regarding APB and ATDS results for COUP-TF2LBD-TRF1 in BJT cells, we have put the data with revised quantification results in Supplementary Figure 1D-E. We will follow the reviewer's suggestion and move this data to the main figures if the editors agree.

      Furthermore, the broader functional implication is not explored. Does this tethering induce a fully functional ALT pathway? For example, can telomerase knockout cells expressing TRF1-COUP-TF2-LBD maintain long-term proliferation? Such evidence would significantly strengthen the impact of the study.

      Response: While COUP-TF2LBD-TRF1 expression rapidly induces key ALT phenotypes, we acknowledge that this alone is insufficient to directly promote telomere lengthening and long-term proliferation of primary fibroblasts, as discussed in Gaela et al., 2024 (PMID: 38752489). However, our ongoing, unpublished studies indicate that COUP-TF2LBD-TRF1 can drive immortalization of primary BJ fibroblasts expressing SV40LT by promoting ALT-mediated telomere elongation (Attached Figure A-C; additional data not shown). These findings suggest that COUP-TF2 may cooperate with additional genetic or epigenetic alterations to facilitate ALT development. We appreciate the reviewer's recognition of this critical aspect. As our immortalization study is still in progress and will be the subject of a separate manuscript, we hope the reviewer understands that the data shown in this letter will not be included in the revised manuscript.

      Chromatin manipulation experiments lead to ambiguous conclusions: the authors propose that telomeric heterochromatin promotes ALT activity, but their own experiments (e.g., Figure 2) show that both heterochromatin-inducing (KRAB-TRF1) and euchromatin-inducing (VP64-TRF1) tethering can trigger ALT-like features. This makes it difficult to conclude that heterochromatin is specifically required.

      To clarify:

      -Did the authors express TRF1-VP64 in an ALT cell line? According to their model, this should suppress ALT activity.

      -More broadly, do chromatin alterations per se (regardless of direction) trigger ALT features? Clarifying these points is important for interpretation.

      Response: In response to the reviewer's suggestion, we expressed VP64-TRF1 and KRAB-TRF1 in WI38-2RA/VA13 cells to investigate telomere chromatin changes and ALT activity. Our data indeed revealed that VP64-TRF1 suppresses telomere H3K9me3 and ALT activity, while KRAB-TRF1 increases both (Supplementary Figure 2E), suggesting that heterochromatin triggers ALT activation.

      The observation that VP64-TRF1 reduces ALT activity in WI38-2RA/VA13 cells contrasts with findings in BJT cells. Of note, studies from the Azzalian and Lingner groups demonstrated that experimentally induced TERRA expression promotes ALT activity in ALT and non-ALT cells (PMID: 36122232, PMID: 40624280). Therefore, we propose that TERRA upregulation may contribute to the ALT induction observed in BJT cells (Figure 2A, Supplementary Figure 2A, B). Given the high basal TERRA expression, expression of VP64-TRF1 and KRAB-TRF1 did not result in a consistent change in TERRA levels (Supplementary Figure 2F). Thus, the ability of VP64-TRF1 to suppress ALT activity in WI38-2RA/VA13 cells could be attributed to the reduction of telomere H3K9me3 and heterochromatin loss. Altogether, our results support the hypothesis that heterochromatin formation, rather than euchromatin triggers ALT.

      We thank the reviewer's insightful comments, which have allowed us to resolve the ambiguity of our results and strengthen the notion that heterochromatin formation promotes ALT. We think that the heterochromatin features and high TERRA expression represent two independent, coexisting mechanisms within ALT cancer cells to guarantee ALT activation. We have modified the main text on page 4-5 accordingly.

      TERRA downregulation contradicts current models: while TERRA upregulation is often observed in ALT cells and is thought to contribute to replication stress and recombination at telomeres, the authors show that TRF1-KAP1 expression induces ALT features while TERRA is downregulated. This observation is not addressed in the manuscript. The authors should at least discuss this discrepancy and propose whether this reflects a cell line-specific phenomenon or a decoupling between TERRA levels and ALT induction in this context.

      Response: We thank the reviewer for the comments. As mentioned above (Major Concerns 2), heterochromatin formation and TERRA expression are two mechanisms that can independently promote ALT. Unlike ALT cell lines that have high TERRA levels, human fibroblasts BJ cells have low TERRA that does not induce ALT phenotypes. Thus, the effect of KRAB-TRF1 on ALT induction in BJ cells could be attributed to the heterochromatin formation, but not reduction of TERRA. We have modified the main text on page 5 to clarify the result.

      Minor Comments

      Introduction (p. 3): The authors cite Episkopou et al. as showing increased H3K9me3 at ALT telomeres. This is incorrect; that paper suggests the opposite. The first study to clearly demonstrate H3K9me3 enrichment at ALT telomeres is Cubiles et al., 2018 and should be cited instead. Results (p. 5, first paragraph): The manuscript should cite Déjardin and Kingston, 2009 as the first to report COUP-TF2 and TR4 localization at ALT telomeres. The studies by Conomos et al., 2012 and Gaela et al., 2024 build on this prior evidence. Please also include this citation in the bibliography.

      Response: We appreciate the reviewer's careful reading and for pointing out these errors. The citation errors on pages 2 and 3 have now been corrected.Broader relevance of TRIM28-OR interaction: TRIM28 is a complex protein with roles in SUMOylation, heterochromatin formation, and transcriptional initiation/elongation regulation.

      The authors should explore whether similar COUP-TF2/TRIM28 interactions occur at other genomic loci. Public ChIP-seq data for COUP-TF2, TR4, and TRIM28 could be mined to investigate whether these factors co-occupy regulatory regions elsewhere in the genome, and how this relates to gene expression states.

      Response: We appreciate the reviewer's insightful suggestion regarding a potential genome-wild functional interaction between TRIM28 and COUP-TF2. To address this, we analyzed public ENCODE ChIP-seq data from K562 cells (TRIM28: ENCSR000BRW; COUP-TF2: ENCSR000BRS). This analysis revealed 3,326 co-binding sites for TRIM28 and COUP-TF2 (Attached Figure A). Interestingly, these co-binding sites were preferentially located within gene bodies (70.7%) and promoter regions (4.3%) (Attached Figures B-D), suggesting a potential cooperative role in gene regulation that aligns with our observation of physical interaction. While the finding is intriguing, a full exploration is beyond the scope of this manuscript, which focuses on ALT telomere regulation. We consider this is an important insight and have briefly noted it in the discussion (p. 9), although the corresponding analyses are not included in the revised manuscript.

      Reviewer #2 (Significance (Required)):

      This work contributes mechanistic insight into how heterochromatin is established at ALT telomeres-an important and timely question in telomere biology and cancer research. It offers a noncanonical recruitment mechanism for TRIM28, independent of KRAB-ZNFs, and highlights the functional role of orphan nuclear receptors in telomeric chromatin regulation. The study has potential implications for understanding ALT regulation and for identifying new intervention points in ALT-positive cancers. The work is conceptually interesting, but the conclusions are currently limited by insufficient quantification, some interpretative ambiguities, and a few overlooked references. Addressing the concerns listed above would significantly enhance the rigor and impact of the manuscript.

      Response: We appreciate the reviewer's recognition of the significance of our work in elucidating the molecular basis of ALT regulation through COUP-TF2-TRIM28-mediated heterochromatin formation. We also thank the reviewer for the valuable feedback, which has significantly strengthened our manuscript.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      1. General Statements

      In this study, we mechanistically define a new molecular interaction linking two of the cell's major morphological regulatory pathways-the Rho GTPase and Hippo signaling networks. These two major signaling pathways are both required for life across huge swaths of the tree of life. They are required for the dynamic organization and reorganization of proteins, lipids, and genetic material that occurs in essential cellular processes such as division, motility and differentiation. For decades these pathways have been almost exclusively studied independently, however, they are known to act in concert in cancer to drive cytoskeletal remodeling and morphological changes that promote proliferation and metastasis. However, mechanistic insight into how they are coordinated is lacking.

      Our data reveal a mechanistic model where coordination is mediated by the RhoA GTPase-activating protein ARHGAP18, which forms molecular interactions with both the tumor suppressor Merlin (NF2) and the transcriptional co-regulator YAP (YAP1). Using a combination of state-of-the-art super-resolution microscopy (STORM, SORA-confocal) in cultured human cells, biochemical pulldown assays with purified proteins, and analyses of tissue-derived samples, we characterize ARHGAP18's function from the molecular to the tissue level in both native and cancer model systems.

      Together, these findings establish a previously unrecognized molecular connection between the RhoA and Hippo pathways and culminate in a working model that integrates our current results with prior work from our group and decades of prior studies. This model provides a new conceptual framework for understanding how RhoA and Hippo signaling are coordinated to regulate cell morphology and tumor progression in human cells.

      In this substantially revised manuscript, we have addressed all comments from the expert reviewers described point-by-point below. A shared major comment from the reviewers was the request for direct evidence of the proposed mechanistic model. To address these constructive comments, we've added new experiments, new quantification, new text, new control data, and have added two expert authors, adding super-resolution mouse tissue imaging data for the endogenous study of ARHGAP18 in its native condition. We believe that these additions greatly enhance the manuscript and collectively address the overall message from the reviewer's collective comments.

      2. Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      This manuscript describes a dual mechanism by which ARHGAP18 regulates the actin cytoskeleton. The authors propose that in addition to the known role for ARHGAP18 in regulating Rho GTPases, it also affects the cytoskeleton through regulation of the Hippo pathway transcriptional regulator YAP. ARHGAP18 knockout Jeg3 cells are were generated and show a clear loss of basal stress fiber like F-actin bundles. The authors further characterize the effects of ARHGAP18 knockout and overexpression. It is also discovered that ARHGAP18 binds to the Hippo pathway regulator Merlin and to YAP. Ultimately it is concluded that ARHGAP18 regulates the F-actin cytoskeleton through dual regulation of RHO GTPases and of YAP. While the phenotype of the ARHGAP18 knockout and the association of ARHGAP18 with Merlin and YAP is interesting, I found the authors conclusion that these phenotypes are due to ARHGAP18 regulation of both RHO and YAP to be based on largely correlative evidence and sometimes lacking in controls or tests for significance. In addition the authors often make overly strong conclusions based on the experimental evidence. In some instances, the rationale for how the experimental results support the conclusion is insufficiently articulated, making evaluation challenging. In general although the authors have some interesting observations, more definitive experiments with proper controls and statistical tests for significance and reproducibility are needed to justify their overall conclusions.

      • *

      *We appreciate the reviewers' constructive comments and have added substantial new data and quantifications to address their concerns. We have focused these new data on directly testing the proposed mechanisms, adding controls, and performing quantitative analysis with statistical testing. Additionally, we have edited our language to make our rationale clearer and to present our conclusions as a more moderate assessment of our experimental results. Below we respond to the specific comments made by the reviewer, followed by a list of additional editorial changes we've made based on the reviewer's overarching comments on clarity and rationale. *

      Specific Comments

      1) The authors make a big point about the effects of ARHGAP18 on myosin light chain phosphorylation. However, this result is not quantified and tested for statistical significance and reproducibility.

      *We thank the reviewer for their comments on our western blotting quantification, which in the original submission version had quantification of RhoA downstream signaling of pCofilin/ Cofilin and pLIMK/ LIMK. We had withheld the pMLC and MLC quantification as the result was previously published with quantification, reproducibility, and statistical significance by our group in our prior manuscript on ARHGAP18 published in Elife in 2024 (Fig. 4E of *

      https://doi.org/10.7554/eLife.83526 ). However, these prior results lacked the new overexpression data. We recognize the need to add these data to this manuscript as requested by the reviewer.

      • *

      *To address the reviewer's comment, we have added quantification of pMLC/MLC (Fig. 1F) *

      2) Along similar lines in Figure 2C they state that overexpression of ARHGAP18 causes cells to invade over the top of their neighbors. This might be true and interesting, but only a single cell is shown and there is no quantification or controls for simply overexpressing something in that cell. The authors also conclude from this image that the overexpression phenotype is independent of its GAP activity on Rho. It is not clear how this conclusion is made based on the data. It would seem like a more definitive experiment would be to see if a similar phenotype was induced by an ARHGAP18 mutant deficient in GAP activity.

      Based on the reviewer's comment, we recognize the qualitative statements made in Figure 2C (now Figure 3) should've been made more quantitative. We have added the control of Jeg 3 WT cells expressed with empty vector flag to show that WT cells do not invade over the top of each other (Fig. 3F). Additionally, we have added the quantification found in Fig. 3E, which shows the % invasive/ non-invasive cells between WT and ARHGAP18 overexpression cells. We have clarified our conclusions to make clear that these data do not directly test if the invasive phenotype derives from a Rho-independent mechanism. The text now states the following conclusion alongside others, which can be seen in our tracked changes:

      • *

      "These data support the conclusion that ARHGAP18 acts to regulate basal and junctional actin. However, it was not clear whether this activity occurred through a Rho-independent or a Rho-dependent mechanism."

      • *

      We have added new data of cells expressing an ARHGAP18 mutant deficient in GAP activity, which is explained in detail in the following response below.

      3) In Figure 3 the authors compare gene expression profiles of ARHGAP18 knockout cells to wild-type cells. They see lots of differences in focal adhesion and cytoskeletal proteins and conclude that this supports their conclusion that ARHGAP18 is not just acting through RHO. The rationale for this in not clear. In addition, they observe changes in expression profiles consistent with changes in YAP activity. They conclude that the effects are direct. This very well might be true. However RHO is a potent regulator of YAP activity and the results seem quite consistent with ARHGAP18 acting through RHO to affect YAP.

      • *

      We thank the reviewer for their comment and believe the revised manuscript now presents direct evidence to support the conclusions made through the editing text and the incorporation of new data.

      • *

      First, the reviewer highlighted that we were not clear in our rationale and explanation of the conclusions made from our RNAseq data in the new Figure 4 (Previously Figure 3). We agree with the reviewer that the RNAseq data alone is not sufficient rationale for the conclusion that ARHGAP18 is acting through YAP directly. In the revised manuscript, the conclusion is now made based on the combination of our multi-faceted investigation of the relationship between ARHGAP18 and YAP (most importantly, new Figure 5). It's important for us to argue that our RNAseq analysis is much more robust and specific than simply reporting a descriptive assay seeing lots of differences in cytoskeletal proteins. We recruited an outside RNAseq expert collaborator; Dr. Yongho Bae, to perform state-of-the-art IPA analysis and a grueling manual curation of the top hit genes to identify the predominant signaling pathways linking the loss of ARHGAP18 to known YAP translational products. We've provided a supplemental table listing each citation supporting the identified YAP pathway associations from this manual curation. We also have added a new discussion paragraph on RNAseq data to clarify our specific RNAseq data results and analysis. In the revised manuscript, we have moderated our language in the results text regarding the RNAseq data to reflect the reviewer's suggestion:

      • *

      "Our RNAseq data alone could not independently confirm if the alterations to transcriptional signaling and expression of actin cytoskeleton proteins were through a Rho-dependent or Rho-independent mechanism."

      • *

      • *

      Second, in this comment and the above, the reviewer highlights the need for a new experiment to directly test the Rho Independent effects of ARHGAP18, which we now provide in the new Figure 5. In this new data, we've applied an experimental design suggested by reviewer 2 regarding the same concern. In short, we've produced and expressed a point mutant variant ARHGAP18(R365A), which abolishes the Rho GAP activity while maintaining the remainder of the protein intact. This construct allows us to directly test the effects of ARHGAP18 independent from its RhoA GAP activity. We find that the GAP-deficient ARHGAP18 is able to fully rescue basal focal adhesions, indicating that the basal actin phenotype is at least in part regulated through a Rho-independent mechanism.

      • *

      • *

      *We believe the revised manuscript, when taken in totality, provides the definitive proof requested by the reviewer. Specifically, the combination of Figure 5, where we show new data using the ARHGAP18(R365A) variant, and the result that ARHGAP18 forms a stable complex with YAP (Fig. 6G) or Merlin (Fig.6A), is supportive of direct Rho-independent molecular interactions between YAP, Merlin, and ARHGAP18. *

      4) In Figure 4A showing Merlin binding to ARHGAP18 there is no control for the amount of Merlin sticking to the column as was done in Figure 4F for binding experiments with YAP. This makes it difficult to determine the significance of the observed binding.

      We have performed the requested control experiment and added the results to Figure 6A.

      5) The images in Figure 4C showing YAP being maintained in the nucleus more in ARHGAP18 knockout cells compared to wild-type. However the images only show a few cells and YAP localization can be highly variable depending on where you look in a field. Images with more cells and some sort of quantification would bolster this result.

      We have provided quantification (Figure 6D) of what was originally Figure 4C (now Figure 6C).

      Reviewer #1 (Significance (Required)):

      While the phenotype of the ARHGAP18 knockout and the association of ARHGAP18 with Merlin and YAP is interesting, I found the authors conclusion that these phenotypes are due to ARHGAP18 regulation of both RHO and YAP to be based on largely correlative evidence and sometimes lacking in controls or tests for significance. In addition the authors often make overly strong conclusions based on the experimental evidence. In some instances, the rationale for how the experimental results support the conclusion is insufficiently articulated, making evaluation challenging. In general although the authors have some interesting observations, more definitive experiments with proper controls and statistical tests for significance and reproducibility are needed to justify their overall conclusions.

      In the above comments, we detail the specific definitive experiments, proper controls, and statistical tests for significance, requested by the reviewer, which we believe greatly strengthen our manuscript.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      This manuscript investigates the Rho effector, ARHGAP18 in Jegs cells, a trophoblastic cell line. It presents a number of new pieces of data, which increase our understanding of the importance of this GAP on cell function and explains at a molecular level previous results of other workers in the field. ARHGAP18 was originally given the name "conundrum' and continues to stand apart from the majority of other GAP proteins and their functions. Hence the data here is significant and of high standard.

      The data is clear, and the images are of high quality and extremely impressive in their resolution. It is significant and adds a further layer to our understanding of the regulation of cell migration, particularly in the formation and resolution of microvilli.

      • *

      We appreciate the reviewer's comments and supportive insights.

      The data is based on the use of the cell line Jeg3. Even the authors previous publication in eLife is based only on this cell line. They need to show the conclusions are general and not specific to this line of cells. As an extension of this, is the ARHGAP18 function shown here only in transformed cells? Does the same mechanisms operate in normal cells, which respond to activation to proliferate or migrate?

      • *
      • We respectfully point out that the critical experiments of the prior eLife publication were validated in DLD-1 colorectal cells and not Jeg-3 cells alone (Figure 1-figure supplement 2). Our newly independent lab, established just over a year ago, is unable to perform a full expansion of the manuscript using untransformed cells, however, we agree with the reviewer's perspective and wish to address the comment to the best of our current capability. To answer the reviewers' suggestions, we have recruited Dr. Christine Schaner Tooley, an expert in mouse model system studies. In the revised manuscript, we've added new Super-Resolution SORA confocal images of endogenous ARHGAP18's localization in the intact intestinal villi tissue, and apical junctions of WT mice (Fig.1A-C). These data indicate that endogenous ARHGAP18 is enriched (but not exclusively localized) at the apical plasma membranes of normal WT epithelial cells. This localization, where both Merlin and Ezrin are present at apical membrane/ junctions under normal conditions, is a major component of the working model proposed in Fig. 7. These data also indicate that ARHGAP18 is capable of entering the nucleus in WT cells, another critical aspect of our proposed model. Collectively, our DLD-1 studies published previously and or new studies using WT mice tissue samples support the conclusion that at least some of ARHGAP18's functions described in this manuscript are not limited to Jeg3 cells.*

      In endothelial cells, Lovelace et al 2017 showed localization to microtubules and that depletion of ARHGAP18 resulted in microtubule instability. The authors may like to comment on the differences. Is this a cell type difference or RhoA versus RhoC difference?

      • *

      In our previous publication (Lombardo Elife), we validated the finding that ARHGAP18 forms a complex with microtubules, as we detected tubulin in the ARHGAP18 pulldown experiment (Figure 1- Source Data). However, our data indicate that in Jeg3 cells ARHGAP18 does not localize to the same microtubule associated spheres observed in the Lovelace publication. We now comment on the shared conclusions and differences between this manuscript and the Lovelace et al 2017 in the discussion section.

      • *

      "In endothelial cells, ARHGAP18 has been reported to localize microtubules and plays a role in maintaining proper microtubule stability (Lovelace et al., 2017). In our epithelial cell culture models and WT mouse intestine, we have been unable to detect ARHGAP18 at microtubules suggesting ARHGAP18 may have additional functions is various cell types."

      On pages 7,9 they conclude that MLC and basal and junctional actin are regulated through a GAP independent mechanism. The best way to show this is with overexpression of a GAP mutant.

      We appreciate the reviewer's insight and have produced and expressed a GAP mutant, ARHGAP18(R365A), in our cells, directly testing our conclusion that ARHGAP18 has a GAP-independent function. These data are now presented in revised Figure 5 and explained further in response to reviewer #1.

      There is a huge amount of data presented in Figure 3, but their 2 genes which they focus on, LOP1 and CORO1A, are discussed but no actual data presented in support.

      We now validate the CORO1A by qPCR in Figure 4J.

      • *

      Reviewer #2 (Significance (Required)):

      The data is significant and adds a further layer to our understanding of the regulation of cell migration, particularly in the formation and resolution of microvilli. This manuscript will be of significance to an basic science audience in the field of RhoGTPases and cell migration.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The study by Murray et al explores the effects of ARHGAP18 on the actin cytoskeleton, Rho effector kinases, non-muscle myosin, and transcription. Using super resolution microscopy, they show that in ARHGAP18 KO cells there is a mixed and unexpected cytoskeleton phenotype where myosin phosphorylation appears to be increased, but actin is disorganised with reduced stress fibres, diminished focal adhesions and augmented invasiveness. They conclude that the underlying mechanisms are likely independent from RhoA. Next, they perform RNAseq using the KO cells and identify an array of dysregulated genes, including those that play crucial roles in microvilli (related to previously published findings). Analysis of the data identify gene expression changes that are relevant for altered focal adhesion (integrins). Further analysis reveals that a large cohort of the dysregulated genes are YAP targets. They then show that in ARHGAP18 KO cells YAP nuclear localization, as detected by immunostaining, is augmented; and demonstrate that immobilized ARHGAP18 protein can bind the Hippo regulator merlin as well as YAP itself.

      Major comments:

      1, The premise of the study (that ARHGAP18 is a RhoA effector or may acts independently of RhoA) remains not proven.

      We have added new evidence of direct RhoA independent activity for ARHGAP18 described in the above comments. Specifically, we've added data using a RhoA-GAP dead variant of ARHGAP18 in Figure 5, which we believe addresses this comment.

      • *

      At several places (including in the title) the authors refer to ARHGAP18 as a Rho effector, which would suggest that it is downstream form Rho, but the basis for this is not clear. In fact, their own previous study suggested that ARHGAP is a RhoA regulator, rather than an effector. In general, the connection of the described effects to RhoA remains unclear, and not addressed in this study. The authors seem to go back and forth in their conclusions regarding the connection between ARHGAP18 and RhoA. For example, the first section of results is finished by stating (line 194): "These data support the conclusion that ARHGAP18 acts to regulate basal and junctional actin through Rho-independent mechanism". But the next section starts by stating (line 198): "We hypothesized that the invasive and cytoskeletal phenotypes observed at the basal surface of cells devoid of ARHGAP18 may be a result of changes in regulation at the transcriptional level either directly through RhoA signaling or through an additional mechanism specific to ARHGAP18". The paper would be strengthened by adding data that show whether the effects are indeed downstream, from RhoA or RhoA independent. If there is no sufficient demonstration that ARHGAP18 is downstream of RhoA and is an effector, this needs to be stated explicitly, and the wording should be changed.

      *We now provide new data in Figure 5, which directly tests the RhoA independent functions of ARHGAP18 as recommended by the reviewer. Our understanding of the term effector is 'a molecule that activates, controls, or inactivates a process or action.' Based on this understanding, we used the term to convey ARHGAP18's functional role within the feedback loop, rather than to imply that it acts exclusively downstream. *

      • *

      We seek to clarify our perspective with the reviewer's assertion that we go "back and forth" as to if ARHGAP18 functions in a Rho Dependent or Rho Independent manner. It was our intent to propose a model where ARHGAP 18 acts in two separate circuits that regulate cell signaling. The first circuit involves ARHGAP18's canonical RhoA GAP activity, which involves ERMs and LOK/SLK, and is limited to the apical plasma membrane. This first signaling circuit was characterized in our prior Elife manuscript (Lombardo et al., 2024) and in an earlier JCB manuscript (Zaman and Lombardo et al., 2021). In this newly revised manuscript, we provide a partial mechanistic characterization of the second circuit, which we freely admit is much more complex and will likely require additional study to fully characterize.

      • *

      As both circuits operate as signaling feedback loops, we find the terms 'upstream' and 'downstream' to be of limited value, and we attempt to avoid their use when possible. We retain their use only when referring to the Hippo and ROCK signaling cascades, where these designations are well established. We suggest that the conceptual inconsistencies of Conundrum/ARHGAP18 may have arisen from the tendency to view it in strictly binary terms as upstream or downstream. Here, we propose a third possibility that ARHGAP18 functions as both, participating in a negative feedback loop.

      • *

      *We have edited and added data testing if the effects are Rho independent and discussion text in response to the reviewer's comments and clarify the molecular function of ARHGAP18.

      "Additionally, focal adhesions and basal actin bundles are restored to WT levels when the ARHGAP18(R365A) GAP-ablated mutant is expressed in ARHGAP18 KO cells (Fig. 5A, B). These results represent the strongest argument that ARHGAP18 functions in additional pathways to RhoA/C alone. Our data suggests that at least one of the alternative pathways is through ARHGAP18's interaction with YAP and Merlin. From these data we conclude that ARHGAP18 has important functions in both RhoA signaling through both its GAP activity and in Hippo signaling through its GAP independent binding partners. "*

      • *

      • *

      The study is descriptive and contains a series of observations that are not connected. Because of this, the study's conclusions are not well supported, and key mechanistic insight is limited. The study feels like a set of separate observations, that remain incompletely worked out and have some preliminary feel to them. The model in the last figure also seems to contain hypotheses based on the observations, several of which remains to be proven.

      • *

      *We present our revised manuscript, in which we've more clearly outlined our rationale and conclusions, as detailed in the above responses, to emphasize the overall connectivity of the study. We have also updated the title of Figure 7 to read "__Theoretical __Model of ARHGAP18's coordination of RhoA and Hippo signaling pathways in Human epithelial cells." To make it clear that we are presenting a working model, which has elements that will require additional investigation. Throughout the manuscript, we highlight the unknown elements that remain to be tested or other outstanding questions. Thus, we do not aim to characterize this complex signaling coordination completely. Instead, this manuscript represents the 3rd iteration in our systematic advances to describe this entirely new signaling pathway. We agree that, despite three separate manuscripts (this one included) to date, this work represents an early stage in understanding the system, many additional studies will be needed to characterize this signaling system fully. Figure 7 is presented as a working model that results from a thoughtful combination of our collective data and that of other researchers, derived from numerous species across decades of study. We firmly believe that proposing such integrative models is valuable for advancing the field. We also recognize the importance of clearly indicating which aspects remain hypothetical. We now explicitly note in several places within the discussion which components of the model will require further validation and experimental confirmation. For example, regarding our theoretical mechanism in Figure 7 we state: *

      "Validation of the direct mechanism by which YAP/TAZ transcriptional changes drive basal actin changes in ARHGAP18 KO cells will require further investigation based on predictions from RNAseq results."

      • *

      Addressing any possible connection between key effects of ARHGAP18 KO (changes in actin, focal adhesion, integrins, Yap and merlin binding) could strengthen the manuscript. One such specific question is the whether the changes in integrin expression (RNAseq) are indeed connected to the actin alterations and reduction ion focal adhesions (Fig 1). Staining for these integrins to show they are indeed altered, and/or manipulating any of them to reproduce changes could provide and exciting addition.

      • *

      *We attempted to stain cells for Integrins by purchasing three separate antibodies. However, despite extensive optimization and careful selection of the specific integrins using our RNAseq results we were unable to get any of these antibodies to work in any cell type or condition. We believe that there is a technical challenge to staining for integrins due to their transmembrane and extracellular components, which we were unable to overcome. As an attempt to address the reviewers comment, we alternatively stained cells for paxillin which directly binds the cytoplasmic tails of integrins (Fig. 3&5). *

      Some of the experimental findings are not convincing or lack controls. Fig 1: some of the western blots are not convincing or poor quality. [...] On the same figure, the quality of LIM kinase blots is poor. [...] The signal is weak, and the blot does not appear to support the quantification. The last condition (expression of flag-ARHGAP18) results in a large drop in pLIMK and pcofilin on the blot, which is not reflected by the graph. Addition of *a better blot and the use of strong positive or negative control would boost confidence in these data. *

      • *

      In response to this and other reviewers' comments, we have added new western data and quantification to Figure 1. We now focus on MLC/pMLC data as we believe these data highlight the potential Rho-independent mechanism of ARHGAP18, and we were able to greatly improve the quality of the blots through careful optimization. We hope the reviewer finds these blots and quantifications (Fig. 1E and F) more convincing.

      *We note that phospho-specific Western blotting presents considerably greater technical challenges than conventional blotting. We believe that the appearance of an attractive looking blot does not always correlate to quality or reproducibility and have focused on taking extraordinarily careful steps in the blotting of our phospho-specific antibodies, which at times comes at the cost of the blot's attractiveness in appearance. For example, all phospho-specific antibodies are run using two color fluorescent markers to blot against both the total protein and the phospho-protein on the same blot. This approach often leads to blots that have reduced signal to noise compared to chemiluminescent Westerns. Additionally, we use phospho-specific blocking buffer reagents which do not contain phosphate-based buffers or agents that attract non-specific phospho-staining signals. These blocking buffers are not as effective as non-fat milk in pbs at blocking the background signal, however, they are ultimately cleaner for phospho-specific primary antibodies. We use carefully optimized protocols, from cell treatment to lysis, transfer, and antibody incubation, including methods developed by laboratories where the corresponding author of the manuscript was trained. Nonetheless, despite these efforts, we have now removed the LIMK and cofilin data because we deemed them unnecessary for the main conclusions of this manuscript and were unable to improve their quality to satisfy the reviewer. *

      The changes in pMLC on the western blots are very small, and for any conclusion, these studies require quantification. Further, the expression levels of Flag-ARHGAP18 needs to be shown to support the statement that the protein is expressed, and indeed overexpressed under these conditions (vs just re-expressed).

      In continuation of the above comment, we have made significant effort to improve the quality of our pMLC western blots and now provide quantification in Figure 1. We also now provide the Flag-ARHGAP18 signal as requested by the reviewer.

      Fig 4: the differences in YAP nuclear localization under the various conditions are not well visible. Quantitation of nuclear/cytosolic signal ratio should be provided. Please provide a rationale and more context for using serum starvation and re-addition. What is the expected effect? Serum removal and addition is referred to as nutrient removal and re-addition, but this is inaccurate, as it does not equal nutrient removal, since serum contains a variety of other important components, e.g. growth factors too.

      We have provided new quantification of the nuclear/cytosolic signal ratio in Figure 6D. We have explained our rational for the study through the following new text:

      "Merlin is activated and localized to junctions upon signaling, promoting growth and proliferation; among these signals is the availability of growth factors and other components of serum (Bretscher et al., 2002). We hypothesized that since ARHGAP18 formed a complex with Merlin that ARHGAP18's localization may localize to junctions under conditions which promote Merlin activation."

      • *

      We have altered our use of "nutrient removal" to "serum removal"

      The binding between ARHGAP18 and merlin is interesting, but a key limitation is the use of expressed proteins. Can the binding be shown for the endogenous proteins (IP, colocalization). Another important unaddressed question is the relevance of this binding, and the relation of this to altered YAP nuclear localization.

      • *

      *Our data in Fig. 6G shows binding of a resin bound human ARHGAP18 to endogenous YAP from human cells as suggested by the reviewer. In Fig. 6A, we have selected to use GFP-Merlin as Merlin shares approximately 60% sequence identity with Ezrin, Radixin, and Moesin (ERMs). Their similarity is such that Merlin was named for Moesin-Ezrin-Radixin-Like Protein. In our experience, nearly all Merlin or ERM antibodies have some cross-contaminating signal. Thus, a major concern is that if we were to blot for endogenous Merlin in the pull-down experiment, we may see a band that could in fact be ERMs. To avoid this, we tagged Merlin with GFP to ensure that the product pulled down by ARHGAP18 was Merlin, not an ERM. Regarding the ARHGAP18-resin bound column, our homemade ARHGAP18 antibody is polyclonal. We have extensive experience in pulldown assays and have found that the binding of a polyclonal antibody to the bait protein can produce less accurate results, as the binding site for the antibody is unknown and can sterically hinder attachment of target proteins like Merlin. In our experience, attachment to a flag-tag, which is expressed after a flexible linker at the N- or C-terminus, allows us to overcome this limitation, which we've used in this manuscript. *

      Minor comments:

      Introduction line 99: "When localized to the nucleus, YAP/TAZ promotes the activation of cytoskeletal transcription factors associated with cell proliferation and actin polymerization" Please clarify what you mean by this statement, that is inaccurate in its present for. Did you mean effects on transcription factors that control cytoskeletal proteins, or do you mean that Yap/Taz affect these proteins? Please also provide reference for this.

      We've altered the sentence as suggested by the reviewer, which now reads the following:

      "When localized to the nucleus, YAP/TAZ promotes transcriptional changes associated with cell proliferation and actin polymerization."

      • *

      *The full mechanism for how YAP/TAZ promotes proliferation and actin polymerization is a currently debated issue. We do not think introducing the various current proposed models is required for this manuscript, and we simply intend to convey that when in the nucleus, YAP/TAZ promotes transcriptional changes that drive actin polymerization and cell proliferation. *

      -What is the cell confluence in these experiments? For epithelial cells confluence affects actin structure. Please comment on similarity of confluency across experimental conditions?

      • *

      All cellular experiments are paired where WT and ARHGAP18 KO cells are plated at the same time under identical conditions. For imaging, we plate all cells onto glass coverslips in a 6 well dish so that each condition is literally in the same cell culture plate and gets identical treatment. In our prior Elife paper studying ARHGAP18, we characterized that ARHGAP18 KO cells and WT cells divide at a similar rate and have similar proliferation characteristics. The epithelial cell cultures are maintained for experiments around 70-80% confluency. For the focal adhesion staining experiments, the confluency is slightly lower, between 50-60% to capture the focal adhesions towards the leading edge. We have added the following new text to further describe these methods: "Cell cultures for experiments were maintained at 70%-80% confluency. For focal adhesion experiments, the cell cultures were maintained at 50%-60% confluency."

      -Fig 2 legend: please indicate that the protein detected was non-muscle myosin heavy chain (distinct from the light chain detected in Fig 1).

      • *

      We have altered original Figure 2 (new Figure 3) legend.

      -Line 339-340: please check the syntax of this sentence -Western blot quantification: the comparison of experiments with samples run on different gels/blots requires careful normalization and experimental consistency. Please describe how this was achieved.

      • *

      We have added the following new text to further describe these methods:

      "For blots which required quantification of antibodies that were only rabbit primaries (e.g., pMLC/MLC antibodies listed above), samples were loaded onto a single gel and transferred onto a single membrane at the same time. After transfer, the membrane was cut in half and subsequent steps were done in parallel. All quantified blots were checked for equal loading using either anti-tubulin as a housekeeping protein or total protein as detected by Coomassie staining"

      Reviewer #3 (Significance (Required)):

      Rho signalling is a central regulator of an array of normal and pathological cell functions, and our understanding of the context dependent regulation of this key pathway remains very incomplete. Therefore, new knowledge on the role of specific regulators, such as ARHGAP18, is of interest to a very broad range of researchers. A further exciting aspect of this protein, that despite indications by many studies that it acts as a GAP (inhibitor) for Rho proteins, there are findings in the literature that suggest that its manipulation can affect actin in unexpected (opposite) manner. These point to possible Rho-independent roles, and warranted further in-depth exploration.

      One of the strength of the study is that it explores possible roles of ARHGAP18 beyond RhoA and describes some new and interesting observations, which advance our knowledge. The authors use some excellent tools (e.g. ARHGAP KO cells and re-expression) and approaches (e.g. super resolution microscopy to analyze actin changes, RNAseq and bioinformatics to find genes that may be downstream from ARHGAP18). A key limitation of the study however, is that it is not clear whether the observed findings are indeed independent from RhoA. Further limitation is that potential causal relationships between the described findings are not studied, and therefore the findings are in some cases overinterpreted, and limited mechanistic insights are provided. In some cases the exclusive use of expressed proteins is also a limitation. Finally, some of the experiments also need improvement.

      Reviewer expertise: RhoA signalling, guanine nucleotide exchange factors, epithelial biology, cell migration, intercellular junctions.

      In the above comments, we detail the new experimental data addressing reviewer 3's listed key limitations. We've added new data using the Rho GAP deficient ARHGAP18(R365A) variant which allows for the direct characterization of ARHGAP18's Rho independent activity. We have introduced new data in WT cells studying endogenous proteins to address the limitations from expressed proteins. Finally, we have moderated our language to address overinterpretation. Collectively, we believe that our revised manuscript addresses the constructive reviewer's comments.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Dear editor and reviewers,

      We sincerely thank you for your thoughtful comments and constructive suggestions, which have greatly improved the quality and clarity of our manuscript. In response, we have implemented all requested changes, which are highlighted in yellow throughout the revised text, and updated several figures accordingly. Furthermore, we have performed all additional experiments recommended by the reviewers and incorporated the new data into the manuscript. To enhance clarity, we have also included a schematic representation of our proposed model in an additional figure, providing a concise visual summary of our findings.

      We hope that these revisions fully address all concerns raised by the reviewers and meet all the expectations for publication.

      Below, we answer the reviewers point by point (in blue).


      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this paper, the authors address the important question of the role of centrosomes during neuronal development. They use Drosophila as an in vivo model. The field is somewhat unclear on the role and importance of centrosomes during neuronal development, although the current data would suggest they are dispensable for axon specification and growth. Early studies in cultured mammalian neurons showed that centrosomes are active and that their microtubules can be cut and transported into the neurites. But a study then showed that centrosomes in these cultured neurons are deactivated relatively early during neuronal development in vitro and that ablating centrosomes even when they are active had no obvious effect on axon specification and growth. Consistent with this, a study in Drosophila provided evidence that centrosomes were not active or necessary in different types of neurons. More recently, a study showed that centrosomal microtubules are dispensable for axon specification and growth in mice in vivo but are required for neuronal migration in the cerebral cortex. However, another study has linked the generation of acetylated microtubules at centrosomes with axon development. In this current study, the authors examine the effect of centrosome loss on various motor and sensory neurons and muscles mainly by examining mutants in essential centriole duplication genes. They associate axonal routing and morphology defects with centrosome loss and provide some evidence that centrosomes could still be active in the developing neurons. Overall, they conclude that centrosomes are active during at least early neuronal development and that this activity is important for proper axonal morphology and routing.

      While I think this study addressing a very interesting and important question, I think as it stands the data is not sufficient to be conclusive on a role for centrosomes during neuronal development. My biggest concern is that most phenotypes have not yet been shown to be cell autonomous, as whole animal mutants have been analysed rather than analysing the effect of cell-specific depletion, and the evidence for active centrosomes needs to be strengthened. If the authors can provide stronger evidence for a role of centrosomes in axonal development then the paper will certainly be of interest to a broad readership.

      We thank the reviewer for the clear and concise summary and fully agree that our study addresses a critical gap in understanding. Centrosomes have long been implicated in morphogenesis, yet their precise contribution to nervous system development has remained unclear. Our findings provide compelling evidence that centrosomes are indispensable for proper nervous system formation and that their absence also triggers muscular defects, highlighting their broader role in tissue organization.

      We acknowledge that the original manuscript lacked some key details; therefore, we have now strengthened our conclusions with additional experiments. Specifically, we demonstrate that these effects are cell-autonomous by using two independent RNAi lines targeted to a subset of motor neurons. Furthermore, we present new data showing that neuronal centrosomes remain active during the early stages of axonal development, emphasising their functional relevance in morphogenesis. All new experiments, figures, and corresponding text revisions are detailed below.

      Major comments 1) The sas-6 transallelic combination shows only 17% embryonic lethality compared to 50% embryonic lethality with sas-4 mutants. Given that both mutants should result in the same degree of centrosome loss (this should be quantified in sas-6 mutants) it would suggest that either sas-4 has other roles away from centrosomes or that the sas-4 mutant chromosome used in the experiment has other mutations that affect viability. The effect of picking up "second-site lethal" mutations on mutant chromosomes is common and so I would not be surprised if this is the reason for the difference in phenotypes. This can be addressed either by "cleaning up" the sas-4 mutant chromosome by backcrossing to wild-type lines, allowing recombination to occur and replace the potential second site mutations, or by using transallelic combinations of sas-4, as they did for sas-6. The "easier" option may just be to analyse all the phenotypes with the sas-6 transallelic combination.

      We appreciate this comment, as it brought to light an issue with the CRISPR line Sas-6-Δa. Upon reanalysing all the data, we determined that this line is embryonic lethal both in homozygosis and when combined with the deficiency uncovering the genomic region, Df(3R)BSC794. In contrast, Sas-6-Δb homozygotes are viable. The inconsistency between these results raised concerns about whether the Δa and Δb Sas-6 mutants carry deletions confined to the Sas-6 coding region. Although this would not hinder our cell biology analysis, it could represent a problem in viability tests. To address this, we repeated all analyses using Sas-6-Δb homozygotes and Sas-6-Δb combined with Df(3R)BSC794. These new results are more consistent and indicate that approximately 50% of Sas-6/Def individuals hatch as adults. Fig. 3 was redone and the manuscript text changed in view of these results.

      2) Using "whole animal" mutants for assessing neuronal morphology is risky due to non-cell-autonomous effects. The authors have carried out some phenotypic analysis of neurons depleted of Sas-4 by cell-specific RNAi, but I feel they need to do this for all of their analysis. This includes embryonic lethality measures, quantification of centrosome numbers, and all axonal phenotypes in Sas-4 RNAi neurons. It would also be prudent to use 2 distinct RNAi lines to help ensure any phenotypes are not off-target effects (and this may help clarify why the authors see some additional phenotypes with RNAi). Indeed, there are relatively weak phenotypes in muscles when using RNAi compared to the mutants and these potential non-cell-autonomous effects could then have a knock-on effect on neuronal morphology. If the authors were concerned that RNAi is not very efficient (explaining any potential weaker phenotypes than in mutants) the authors could examine the effectiveness of RNAi lines by analysing protein depletion by western blotting or mRNA depletion by rt-qPCR (although this has to be done in a different cell type due to the difficulty in obtaining a neuronal extract).

      We have now added a new panel to supplementary Figure 1, showing how the expression of a different Sas-4 RNAi line (2) induces similar nervous system phenotypes when expressed only in aCC, pCC and RP2 pioneer neurons (Sup. Fig. 1 M-O).

      3) When analysing centriole presence or absence it is a good idea to stain with two different centriole markers e.g. Asl and Plp. This helps rule out unspecific staining. It is clear from the images that similar sized foci can be observed outside of the cells (see Figure 5A for example), so clearly some of the foci that appear to be within the cells may also be unspecific staining.

      In a new supplementary figure, we now show that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). In addition, and we apologise for the confusion, but the reason why there are foci outside the marked cells is because these are wholemount embryonic stainings and the anti-Plp antibody marks all centrosomes in all cells in the embryo.

      4) The evidence for active centrosomes is not that convincing. Acetylated tubulin is associated with stable MTs, which are not normally organised by "active" centrosomes that nucleate dynamic microtubules. Moreover, it is plausible that centriole foci happen to overlap with the acetylated tubulin staining by chance. This would explain why not all centrosomes colocalise with acetylated tubulin signal. The authors could better test centrosome activity by performing live imaging with EB1-GFP. If centrosomes are active, it is very easy to observe the many comets produced by the centrosomes.

      We appreciate the reviewer’s comment and agree that acetylated tubulin alone is not an ideal marker for centrosome activity. To address this, we performed live imaging of aCC neurons expressing EB1-GFP together with Asl-Tomato. This was technically challenging because we were imaging only two neurons per segment in live embryos, under significant limitations in fluorescence detection and timing. Despite these constraints, we were able to clearly observe EB1 comets emerging from the centrosome and moving toward the cell periphery, providing direct evidence of microtubule nucleation from centrosomes in neurons.

      Importantly, we complemented this with a microtubule depolymerization/polymerization assay, which provides unequivocal evidence that polymerization initiates at the centrosome. After depolymerization, we observed microtubule regrowth from the centrosome, confirming its role as an active microtubule-organizing centre in these neurons. Together, we hope that these results are enough to demonstrate that neuronal centrosomes are functionally active during early axonal development. These experiments are presented in Figure 6 and corresponding text in the manuscript.

      5) If the authors believe that centrosomes have a role in axon pathfinding in sensory neurons, they should show that these centrosomes are active, at least during early stages (again using EB1-GFP imaging).

      We appreciate the reviewer’s suggestion and agree that EB1-GFP imaging would be the most direct way to assess centrosome activity in sensory neurons. However, performing time-lapse imaging in these neurons is technically very demanding due to their location and accessibility in live embryos, and we did not attempt this approach. Instead, we now provide new evidence showing that sensory neuron centrosomes colocalize with both α-tubulin and γ-tubulin. This strongly supports that these centrosomes are associated with microtubule nucleation machinery and are as likely as motor neuron centrosomes to be active during early stages of axon development. These new data have been included in the revised manuscript (see Figure 5 and corresponding text).

      6) The authors mention in the discussion that "increased JNK activity, can result in axonal wiggliness (Karkali et al, 2023)". I therefore wonder whether centrosome loss may induce JNK activation (the stress response), as this would then indicate an indirect effect of centrosome loss on axonal structure rather than a direct influence of centrosome-generated microtubules. The authors could assess whether the DNK-JNK pathway is activated in neurons lacking centrosomes by expression UAS-Puc-GFP and quantifying the nuclear signal.

      In a new supplementary figure, we now show by using a reporter for JNK signalling, as requested, that Sas-4 neurons do not activate the JNK pathway (Supl. Fig 4).

      7) In Figure 5, the authors claim that they find "a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". I don't think this is a strong correlation. The difference in centriole number between embryos with no defects and those with defects is very small. In contrast, the difference between centriole numbers in control (no defects) and mutant (no defects) is very large. So, there does not appear to be a strong correlation between centrosome number and phenotype.

      We agree and we have corrected this sentence to better explain the results.

      Minor comments

      1) I don't understand Figure 3C - why do the % of surviving homozygotes and heterozygotes add up to 100%? Should the grey boxes not relate to dead and the white to surviving?

      Thank you for pointing this out. Figures 1B and 3C represent only the surviving individuals. The grey boxes correspond to surviving homozygotes, and the white boxes correspond to surviving heterozygotes. The percentages add up to 100% only at embryonic stages because all embryos reach late embryonic stages. The grey and white boxes reflect the proportion of these two genotypes among the survivors, not the total number of embryos including those that died. We have changed the text to convey this.

      2) "In mouse fibroblasts, myoblasts and endothelial cells, centrosome orientation is important for nuclear positioning and cell migration(Chang et al, 2015; Gomes et al, 2005; Kushner et al, 2014)." Do you mean "centrosome position"?

      Yes, text changed, thank you for spotting it.

      3) In the introduction, the authors mention Meka et al. when saying the centrosomal microtubules are important for axonal development, but they should also discuss the counter argument from Vinopal et al., 2023 (Neuron) that showed how centrosomes were required for neuronal migration but not axon growth, which was instead mediated by Golgi-derived microtubules.

      Done, thank you very much.

      4) Lines 228-230 - repeated sentence

      Corrected, thank you very much.

      5) Additionally, we did not detect centrioles in the quadrant opposite the axon exit point (Fig. 2B n=75) - this data is not in Fig 2B

      Correct, it is in figure 4B, thank you very much.

      6) "This significant decrease in the humber of centrioles further supports the critical role of Sas-4 in pioneer neurons of the ventral nerve cord (VNC) during Drosophila embryogenesis". It rather highlights that Sas-4 is required for centriole formation in these neurons. Also, humber = number.

      We agree, and have changed the text, thank you very much.

      7) Result title: Non-ciliated sensory neurons have centrioles. This is kind of obvious. A better title may be "axon phenotypes correlate with centriole numbers in sensory neurons" but unfortunately i don't think there is good evidence for this (See major point above).

      We agree and we have changed. We now believe we have strong evidence to support it. We hope the additional data presented in the revision convincingly demonstrate this point.

      Reviewer #1 (Significance (Required)):

      As mentioned above, the advance will be important if more evidence is provided. In this case, the paper will be interesting to a broad readership. But currently the paper is limited by the lack of evidence for centrosome function and activity in the neurons.

      We hope that reviewer 1, now considers that the manuscript is not limited anymore and that it shows convincing evidence for centrosome function and activity in embryonic neurons.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: In this manuscript, Gonzalez et al. examine the potential function of centrosomes in the neurons and muscle cells of Drosophila embryos. By studying various mutant and RNAi lines in which centriole duplication has been disrupted, they conclude that the loss of centrioles disrupts axonal pathfinding and muscle integrity.

      Major points: 1. Throughout the manuscript, the phenotypes presented are often quite subtle. For this reason, I would really recommend that these experiments are scored blind. Perhaps the authors did this, but I didn't see any mention of this.

      All our phenotypic analyses are performed blind. We apologize for not having originally included this information in the Methods section; it has now been added. Embryos are stained using colorimetric methods (DAB) to label the nervous system, while balancer chromosomes are marked with a fluorescent antibody. This approach allows us to assess and quantify phenotypes using white light without knowing whether the embryos are homozygous mutants or heterozygous, which can only be detected by changing the channels to fluorescence.

      1. The authors conclude that neurons have active centrioles that function as centrosomes (Figure 6), but the data here is confusing. The authors state that in these cells they observe acetylated MTs extending from the centrosomes and these colocalised with g-tubulin. But the authors don't show the overlap between centrosomes, g-tubulin and MTs, as they stain for these separately. This is problematic, as it was not clear from these images that the majority of the MTs really are extending from the centrosome: the centrosome may just associate or be close by to these MT cables (Figure 6A,B). Moreover, the authors show that only a fraction of the centrosomes in these cells associate with g-tubulin, so presumably in cells where the centrosomes lack g-tubulin they would not expect the centrosomes to be associated with the MTs-but they do not show that this is the case. Perhaps the authors can't test this, but an alternative would be to show that these MT arrays are absent in Sas-4 mutants. This would give more confidence that these MTs arise from the centrosomes.

      We agree that the initial data based on acetylated microtubules and γ-tubulin colocalization were not sufficient to conclude that microtubules originate from the centrosome, as these markers can only suggest association. To address this, we have now included additional experiments that provide direct evidence of centrosome activity.

      First, we performed live imaging of aCC neurons expressing EB1-GFP together with Asl-Tomato. Despite the technical challenges of imaging only two neurons per segment in live embryos under strict fluorescence and timing constraints, we were able to clearly observe EB1 comets emerging from the centrosome and moving toward the cell periphery. This demonstrates active microtubule nucleation from centrosomes rather than mere proximity to microtubule bundles.

      Second, we carried out a microtubule depolymerization/polymerization assay, which provides unequivocal evidence that polymerization initiates at the centrosome. After depolymerization, microtubules regrew from the centrosome, confirming its role as an active microtubule-organizing center. These experiments go beyond colocalization and directly address the concern that centrosomes might simply be adjacent to microtubule cables.

      Regarding the suggestion to use Sas-4 mutants, while we did not perform this experiment, the regrowth assay combined with EB1 imaging strongly supports that these microtubules originate from the centrosome. All new data are presented in Figure 6 and the corresponding text in the revised manuscript.

      1. The authors show that muscle cell integrity is compromised by centriole-loss (Figure 2). This is very surprising as it is widely believed that centrosomes are non-functional in muscle cells, and the MTs are instead organised around the nuclear envelope. I'm not aware of the situation in Drosophila muscle cells, but the authors should ideally try to examine if the centrioles are functioning as centrosomes in these cells. At the very least they should discuss how they think centriole-loss is influencing the muscle integrity when it is widely believed they are inactive in these cells.

      We do not claim that centrosomes are active in muscle cells at these developmental stages. The observed muscle defects could result from earlier processes such as cell division, migration, or muscle fusion. We agree that this is an intriguing observation; however, pursuing this question further would go beyond the scope of the current manuscript. As requested by the reviewer, we have now expanded the discussion to consider how centriole loss might impact muscle integrity.

      Regardless of the strength of the supporting data, I think the authors should tone down their conclusions. The title and abstract led me to believe that centriole loss would cause significant problems in axonal pathfinding and muscle integrity. In all the mutant specimens examined (and certainly the low magnification views shown in Figure 1D'-F', Figure 1I'-K' and Figure 2D'-F') the mutants look very similar to the WT. Many readers may not get past the title and abstract, so the authors should make it clearer that these defects are very subtle.

      We have changed the text to convey this idea.

      Minor points: 1. In Figures 4 and 5, CP309 staining is relied on to identify centrioles, but there is quite a background of non-specific dots, making it hard to be certain what is a centriole and what isn't. For example, in Figure 5D' there are lots of dots within some of the cells - are any of these centrioles? How can the authors be certain which dot is a centriole in some of the cells shown in Figure 5C'? Is it possible to use a second marker and only count as centrioles dots that are recognised by both antibodies?

      We thank the reviewer for this suggestion and agree that using a second marker improves confidence in centriole identification. In a new supplementary figure (Supplementary Fig. 3), we now show that Asl and Plp colocalize in neurons and provide a quantification of the frequency of this colocalization. This dual labelling confirms the identity of centrioles and addresses the concern about non-specific background.

      We also apologize for any confusion regarding the presence of foci outside the marked cells. These images are whole-mount embryonic stainings, and the anti-Plp antibody labels all centrosomes in all cells of the embryo, which explains the additional foci observed.

      In the abstract that authors state that traditionally centrosomes have been considered to be non-essential in terminally differentiated cells. I don't think this is correct. In the standard "textbook" view of a cell, the centrosome is normally positioned in the centre of the cell organising an extensive array of MTs that are thought play an important role in organising intracellular transport, the positioning and movement of organelles and the maintenance and establishment of cell polarity. I don't think it is only recent evidence that suggests they play vital roles in terminally differentiated cells.

      We thank the reviewer for this correction and we have changed the text accordingly.

      1. Line 162 the authors state that in the RNAi knockdown lines they observe several additional phenotypes, but then in the same sentence (Line 164) they say that these defects were also observed in the original mutant and mutant/Df lines.

      We apologise for this confusion, we have rearranged the sentence for clearance.

      The sentences in Line281-287 don't reference any of the Figures, so it seems the authors are just stating these results without presenting any data (e.g. "Significantly, we also found a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". If they've tested this correlation, they should show it.

      We have rearranged the sentences for better understanding.

      In Figure 7 I did not understand how the authors measured tortuosity (wiggliness) and could see no description in the methods. This is important as, again the defect seems quite subtle, but perhaps I am not understanding which bits of the axon are being measures. Is it just the small bit of the axons close to the asterixis that is being measured, or the whole FasII track?

      We have now added another quantification and additional descriptions in the methods section.

      Reviewer #2 (Significance (Required)):

      The potential function of centrosomes in axonal outgrowth is quite controversial, so this study is potentially of considerable interest.

      However, several aspects of the data presented here were confusing or not terribly convincing. In its present state, I don't think the main conclusions are strongly enough supported by the data.

      We hope that reviewer 2, now considers that the manuscript is not confusing anymore and that it shows convincing evidence for centrosome function and activity in embryonic neurons.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The manuscript of González et al. entitled "Centriole Loss in Embryonic Development Disrupts Axonal Pathfinding and Muscle Integrity" deals with the role of centrosomes in shaping axonal morphology. To this aim the AA analysed Drosophila Sas-4 mutants that are reported to develop until adult stage without centrioles. Remarkably, the AA observe that 50% of the homozygous mutant embryos fail to hatch as larvae. The present observations suggest that centrosome loss results in axonemal shaping defects and muscle developmental abnormalities. Finally, the AA show the presence of functional centrosomes in neurons. In my opinion, the manuscript is interesting because shows unexpected findings. However, to justify these new findings the AA are required to improve some experimental observations.

      We thank the reviewer for his summary of our work and for considering it interesting. We have taken into account all the comments and believe that these have helped improve our manuscript.

      Major: Abstract- It is unclear in which phenotypic condition the observations of centrosome loss or centrosome presence have been found. Please better explain. l.36. embryos, larvae, adult, from Sas4 or controls? If mutants, the observations are very interesting since Sas4 would be without centrioles. Indeed, Basto et al., show that chemosensory neurons do not develop an axoneme in the absence of centrioles, but extend dendrites toward the sensory bristle.

      We have made clear which refer to wild-type and which are Centriole Loss (CL) conditions. CL conditions refer to mutant and downregulation conditions, whereas targeted downregulation refers to RNAi downregulation only in neurons.

      I do not think appropriate the use of "centriole" in the main title since the centrioles would be localized by true centriolar antigens rather than by centrosomal antigens. This problem occurs throughout the text and some figures where the AA image centrioles by centrosomal material. In Gig. 5A only the AA properly look at Asl localization. The other pictures of presumptive centrioles or centriole quantification report CP309 dots. This localization does not unequivocally reveal centrioles, since CP309 is essentially required for centrosome-mediated Mt nucleation. There are differentiated Drosophila tissues in which centrioles are present, but inactivated, and unable to recruit pericentriolar material. Mt are nucleated by ncMTOCs that contain centrosomal material and gamma-tubulin. Thus, the centrosomal antigens do not colocalize with centrioles.

      We have changed centrioles to centrosomes in the title and most sections in the manuscript. We have also included an extra control, showing that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). Asl is a reliable and widely used marker for centrioles, as it localizes specifically to the centriole structure (Varmark H, Llamazares S, Rebollo E, Lange B, Reina J, Schwarz H, Gonzalez C. Asterless is a centriolar protein required for centrosome function and embryo development in Drosophila. Curr Biol. 2007 Oct 23;17(20):1735-45. doi: 10.1016/j.cub.2007.09.031. PMID: 17935995.)

      Minor: l. 58. The early arrest is mainly due to a checkpoint control. In double mutant for Sas4 and P53 the embryos survive longer, even if their further development is asrrested.

      We thank the reviewer for this comment, and we have changed the text accordingly.

      1. Previous works, also quoted by the AA, reported that in mature neurons the centrosome are inactivated, whereas the present manuscript describes functional centrosomes in Drosophila motor and peripheral nervous system. This is an intriguing observations that needs a better explanation in Discussion section.

      We thank the reviewer for this comment, and we have changed the discussion accordingly.

      l.143-145. I understand that 50% of the Sas4 embryos that reach the adult stage have centrioles. Is it correct? But if it is so, how the AA explain the absence of centrioles in sensory neurons of adult flies as reported by Basto et al. ?

      According to our results they have less centrioles than controls already at embryonic stages. In addition, as reported in Basto et al. they continue losing centrioles during larval stages and metamorphosis, which explains why centrioles are not detected at adult stages.

      l.215. It is unclear for me why the AA analyse Sas6 flies, unless explain the mutant phenotype.

      To strengthen our conclusions with Sas-4 and exclude the possibility that the observed phenotypes arise from a centrosome-independent function of Sas-4. For this reason, we have taken additional steps to confirm that the effects are specifically due to centrosome loss and we used Sas-6 mutants as one of these.

      l.221. How the centrioles have been quantified? What antibody, the AA used.

      We have quantified centrosomes using antibodies agains Plp (CP309) and Asl-YFP expression.

      l.244. and Fig 4C,D. I see high background with CP309. As reported previously I think better to use antibodies against centriolar proteins, such as Sas6, Ana1, Asl, or Sas4 ( if centrioles are present in 50% of mutants as the AA claim, the antibody could be also useful). In addition, I can see some CP309 spots in Fig 4E,F. Are they centrioles?

      Indeed, as we report, Sas-4 mutant embryos are not totally devoid of centrosomes. In addition, and we apologise for the confusion, but the reason why there are foci outside the marked cells in control embryos is because these are wholemount embryonic stainings and the anti-Plp antibody marks all centrosomes in all cells in the embryo, not just in the neurons.

      l.270 and Fig. 5A and Fig.5 C-E. Why the AA localize Cp309 and not Asl (Fig. 5A) to detect centrioles?

      In a new supplementary figure, we now show that Asl and Plp colocalize and quantify the number of times we find this colocalization in neurons (Supl. Fig 3). So, we can use CP309 in neurons to the same effect as Asl-

      L295-296. I cannot see Mts, but only a diffuse staining. I am expecting to see distinct Mt bundles.

      In figure 5 it is now easier to see the MT bundles in the new experiment in Fig. 5F-I , where we performed MT depolymerisation/repolymerisation: Nevertheless, we need to stress out that we are doing these analyses in wholemount embryonic stainings.

      326-327. How the AA explain this different lethality, even if both the proteins are involved in centriole assembly?

      We have now redone all the viability and mutant phenotype analysis using Sas-6 CRISPR mutant over the Deficiency, which is a better way to access the phenotype.

      335-337. In my opinion the quoted publications are not relevant.

      We believe that these references back up our hypothesis because:

      • Metzger et al 2012 stress the importance of nuclear position in muscle development in Drosophila
      • Loh et al 2023, relate centrosomes with nuclear migration in Drosophila
      • Tillery et al 2018, is a review describing MTs in muscle development in Drosophila.

      358-359. Does maternal contribution persist after gastrulation?

      While bulk degradation occurs by midblastula transition, some stable maternal products persist beyond gastrulation. In our case, if centrioles are formed due to the maternal contribution, they will only be diluted by cell division, which explains why we can detect centrioles at late embryonic stages.

      l.366. This is an intriguing point, but as previously observed I have some problem with centriole localization. References. Please uniform Journal abbreviations and control page numbers.

      I hope we have clarified this problem with the new experiments showing MT repolarization from the centrosomes in neurons.

      Reviewer #3 (Significance (Required)):

      The manuscript is potentially interesting for peoples working of cell and molecular biology, and development. However, the paper needs an additional working to be suitable for publication.

      We hope that reviewer 3, considers that the additional work and revision make this manuscript suitable for publication.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Summary: In this manuscript, Gonzalez et al. examine the potential function of centrosomes in the neurons and muscle cells of Drosophila embryos. By studying various mutant and RNAi lines in which centriole duplication has been disrupted, they conclude that the loss of centrioles disrupts axonal pathfinding and muscle integrity.

      Major points:

      1. Throughout the manuscript, the phenotypes presented are often quite subtle. For this reason, I would really recommend that these experiments are scored blind. Perhaps the authors did this, but I didn't see any mention of this.
      2. The authors conclude that neurons have active centrioles that function as centrosomes (Figure 6), but the data here is confusing. The authors state that in these cells they observe acetylated MTs extending from the centrosomes and these colocalised with g-tubulin. But the authors don't show the overlap between centrosomes, g-tubulin and MTs, as they stain for these separately. This is problematic, as it was not clear from these images that the majority of the MTs really are extending from the centrosome: the centrosome may just associate or be close by to these MT cables (Figure 6A,B). Moreover, the authors show that only a fraction of the centrosomes in these cells associate with g-tubulin, so presumably in cells where the centrosomes lack g-tubulin they would not expect the centrosomes to be associated with the MTs-but they do not show that this is the case. Perhaps the authors can't test this, but an alternative would be to show that these MT arrays are absent in Sas-4 mutants. This would give more confidence that these MTs arise from the centrosomes.
      3. The authors show that muscle cell integrity is compromised by centriole-loss (Figure 2). This is very surprising as it is widely believed that centrosomes are non-functional in muscle cells, and the MTs are instead organised around the nuclear envelope. I'm not aware of the situation in Drosophila muscle cells, but the authors should ideally try to examine if the centrioles are functioning as centrosomes in these cells. At the very least they should discuss how they think centriole-loss is influencing the muscle integrity when it is widely believed they are inactive in these cells.
      4. Regardless of the strength of the supporting data, I think the authors should tone down their conclusions. The title and abstract led me to believe that centriole loss would cause significant problems in axonal pathfinding and muscle integrity. In all the mutant specimens examined (and certainly the low magnification views shown in Figure 1D'-F', Figure 1I'-K' and Figure 2D'-F') the mutants look very similar to the WT. Many readers may not get past the title and abstract, so the authors should make it clearer that these defects are very subtle.

      Minor points:

      1. In Figures 4 and 5, CP309 staining is relied on to identify centrioles, but there is quite a background of non-specific dots, making it hard to be certain what is a centriole and what isn't. For example, in Figure 5D' there are lots of dots within some of the cells - are any of these centrioles? How can the authors be certain which dot is a centriole in some of the cells shown in Figure 5C'? Is it possible to use a second marker and only count as centrioles dots that are recognised by both antibodies?
      2. In the abstract that authors state that traditionally centrosomes have been considered to be non-essential in terminally differentiated cells. I don't think this is correct. In the standard "textbook" view of a cell, the centrosome is normally positioned in the centre of the cell organising an extensive array of MTs that are thought play an important role in organising intracellular transport, the positioning and movement of organelles and the maintenance and establishment of cell polarity. I don't think it is only recent evidence that suggests they play vital roles in terminally differentiated cells.
      3. Line 162 the authors state that in the RNAi knockdown lines they observe several additional phenotypes, but then in the same sentence (Line 164) they say that these defects were also observed in the original mutant and mutant/Df lines.
      4. The sentences in Line281-287 don't reference any of the Figures, so it seems the authors are just stating these results without presenting any data (e.g. "Significantly, we also found a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". If they've tested this correlation, they should show it.
      5. In Figure 7 I did not understand how the authors measured tortuosity (wiggliness) and could see no description in the methods. This is important as, again the defect seems quite subtle, but perhaps I am not understanding which bits of the axon are being measures. Is it just the small bit of the axons close to the asterixis that is being measured, or the whole FasII track?

      Significance

      The potential function of centrosomes in axonal outgrowth is quite controversial, so this study is potentially of considerable interest.

      However, several aspects of the data presented here were confusing or not terribly convincing. In its present state, I don't think the main conclusions are strongly enough supported by the data.

    3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #1

      Evidence, reproducibility and clarity

      In this paper, the authors address the important question of the role of centrosomes during neuronal development. They use Drosophila as an in vivo model. The field is somewhat unclear on the role and importance of centrosomes during neuronal development, although the current data would suggest they are dispensable for axon specification and growth. Early studies in cultured mammalian neurons showed that centrosomes are active and that their microtubules can be cut and transported into the neurites. But a study then showed that centrosomes in these cultured neurons are deactivated relatively early during neuronal development in vitro and that ablating centrosomes even when they are active had no obvious effect on axon specification and growth. Consistent with this, a study in Drosophila provided evidence that centrosomes were not active or necessary in different types of neurons. More recently, a study showed that centrosomal microtubules are dispensable for axon specification and growth in mice in vivo but are required for neuronal migration in the cerebral cortex. However, another study has linked the generation of acetylated microtubules at centrosomes with axon development. In this current study, the authors examine the effect of centrosome loss on various motor and sensory neurons and muscles mainly by examining mutants in essential centriole duplication genes. They associate axonal routing and morphology defects with centrosome loss and provide some evidence that centrosomes could still be active in the developing neurons. Overall, they conclude that centrosomes are active during at least early neuronal development and that this activity is important for proper axonal morphology and routing.

      While I think this study addressing a very interesting and important question, I think as it stands the data is not sufficient to be conclusive on a role for centrosomes during neuronal development. My biggest concern is that most phenotypes have not yet been shown to be cell autonomous, as whole animal mutants have been analysed rather than analysing the effect of cell-specific depletion, and the evidence for active centrosomes needs to be strengthened. If the authors can provide stronger evidence for a role of centrosomes in axonal development then the paper will certainly be of interest to a broad readership.

      Major comments

      1. The sas-6 transallelic combination shows only 17% embryonic lethality compared to 50% embryonic lethality with sas-4 mutants. Given that both mutants should result in the same degree of centrosome loss (this should be quantified in sas-6 mutants) it would suggest that either sas-4 has other roles away from centrosomes or that the sas-4 mutant chromosome used in the experiment has other mutations that affect viability. The effect of picking up "second-site lethal" mutations on mutant chromosomes is common and so I would not be surprised if this is the reason for the difference in phenotypes. This can be addressed either by "cleaning up" the sas-4 mutant chromosome by backcrossing to wild-type lines, allowing recombination to occur and replace the potential second site mutations, or by using transallelic combinations of sas-4, as they did for sas-6. The "easier" option may just be to analyse all the phenotypes with the sas-6 transallelic combination.
      2. Using "whole animal" mutants for assessing neuronal morphology is risky due to non-cell-autonomous effects. The authors have carried out some phenotypic analysis of neurons depleted of Sas-4 by cell-specific RNAi, but I feel they need to do this for all of their analysis. This includes embryonic lethality measures, quantification of centrosome numbers, and all axonal phenotypes in Sas-4 RNAi neurons. It would also be prudent to use 2 distinct RNAi lines to help ensure any phenotypes are not off-target effects (and this may help clarify why the authors see some additional phenotypes with RNAi). Indeed, there are relatively weak phenotypes in muscles when using RNAi compared to the mutants and these potential non-cell-autonomous effects could then have a knock-on effect on neuronal morphology. If the authors were concerned that RNAi is not very efficient (explaining any potential weaker phenotypes than in mutants) the authors could examine the effectiveness of RNAi lines by analysing protein depletion by western blotting or mRNA depletion by rt-qPCR (although this has to be done in a different cell type due to the difficulty in obtaining a neuronal extract).
      3. When analysing centriole presence or absence it is a good idea to stain with two different centriole markers e.g. Asl and Plp. This helps rule out unspecific staining. It is clear from the images that similar sized foci can be observed outside of the cells (see Figure 5A for example), so clearly some of the foci that appear to be within the cells may also be unspecific staining.
      4. The evidence for active centrosomes is not that convincing. Acetylated tubulin is associated with stable MTs, which are not normally organised by "active" centrosomes that nucleate dynamic microtubules. Moreover, it is plausible that centriole foci happen to overlap with the acetylated tubulin staining by chance. This would explain why not all centrosomes colocalise with acetylated tubulin signal. The authors could better test centrosome activity by performing live imaging with EB1-GFP. If centrosomes are active, it is very easy to observe the many comets produced by the centrosomes.
      5. If the authors believe that centrosomes have a role in axon pathfinding in sensory neurons, they should show that these centrosomes are active, at least during early stages (again using EB1-GFP imaging).
      6. The authors mention in the discussion that "increased JNK activity, can result in axonal wiggliness (Karkali et al, 2023)". I therefore wonder whether centrosome loss may induce JNK activation (the stress response), as this would then indicate an indirect effect of centrosome loss on axonal structure rather than a direct influence of centrosome-generated microtubules. The authors could assess whether the DNK-JNK pathway is activated in neurons lacking centrosomes by expression UAS-Puc-GFP and quantifying the nuclear signal.
      7. In Figure 5, the authors claim that they find "a correlation between axonal guidance phenotypes and the numbers of centrioles per embryo". I don't think this is a strong correlation. The difference in centriole number between embryos with no defects and those with defects is very small. In contrast, the difference between centriole numbers in control (no defects) and mutant (no defects) is very large. So, there does not appear to be a strong correlation between centrosome number and phenotype.

      Minor comments

      1. I don't understand Figure 3C - why do the % of surviving homozygotes and heterozygotes add up to 100%? Should the grey boxes not relate to dead and the white to surviving?
      2. "In mouse fibroblasts, myoblasts and endothelial cells, centrosome orientation is important for nuclear positioning and cell migration(Chang et al, 2015; Gomes et al, 2005; Kushner et al, 2014)." Do you mean "centrosome position"?
      3. In the introduction, the authors mention Meka et al. when saying the centrosomal microtubules are important for axonal development, but they should also discuss the counter argument from Vinopal et al., 2023 (Neuron) that showed how centrosomes were required for neuronal migration but not axon growth, which was instead mediated by Golgi-derived microtubules.
      4. Lines 228-230 - repeated sentence
      5. Additionally, we did not detect centrioles in the quadrant opposite the axon exit point (Fig. 2B n=75) - this data is not in Fig 2B
      6. "This significant decrease in the humber of centrioles further supports the critical role of Sas-4 in pioneer neurons of the ventral nerve cord (VNC) during Drosophila embryogenesis". It rather highlights that Sas-4 is required for centriole formation in these neurons. Also, humber = number.
      7. Result title: Non-ciliated sensory neurons have centrioles. This is kind of obvious. A better title may be "axon phenotypes correlate with centriole numbers in sensory neurons" but unfortunately i don't think there is good evidence for this (See major point above).

      Significance

      As mentioned above, the advance will be important if more evidence is provided. In this case, the paper will be interesting to a broad readership. But currently the paper is limited by the lack of evidence for centrosome function and activity in the neurons.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public Reviews):

      Summary:

      Argunşah et al. describe and investigate the mechanisms underlying the differential response dynamics of barrel vs septa domains of the whisker-related primary somatosensory cortex (S1). Upon repeated stimulation, the authors report that the response ratio between multi- and single-whisker stimulation increases in layer (L) 4 neurons of the septal domain, while remaining constant in barrel L4 neurons. This difference is attributed to the short-term plasticity properties of interneurons, particularly somatostatin-expressing (SST+) neurons. This claim is supported by the increased density of SST+ neurons found in L4 of the septa compared to barrels, along with a stronger response of (L2/3) SST+ neurons to repeated multi- vs single-whisker stimulation. The role of the synaptic protein Elfn1 is then examined. Elfn1 KO mice exhibited little to no functional domain separation between barrel and septa, with no significant difference in single- versus multi-whisker response ratios across barrel and septal domains. Consistently, a decoder trained on WT data fails to generalize to Elfn1 KO responses. Finally, the authors report a relative enrichment of S2- and M1-projecting cell densities in L4 of the septal domain compared to the barrel domain.

      Strengths:

      This paper describes and aims to study a circuit underlying differential response between barrel columns and septal domains of the primary somatosensory cortex. This work supports the view that barrel and septal domains contribute differently to processing single versus multi-whisker inputs, suggesting that the barrel cortex multiplexes sensory information coming from the whiskers in different domains.

      We thank the reviewer for the very neat summary of our findings that barrel cortex multiplexes converging information in separate domains.

      Weaknesses:

      While the observed divergence in responses to repeated SWS vs MWS between the barrel and septal domains is intriguing, the presented evidence falls short of demonstrating that short-term plasticity in SST+ neurons critically underpins this difference. The absence of a mechanistic explanation for this observation limits the work’s significance. The measurement of SST neurons’ response is not specific to a particular domain, and the Elfn1 manipulation does not seem to be specific to either stimulus type or a particular domain.

      We appreciate the reviewer’s perspective. Although further research is needed to understand the circuit mechanisms underlying the observed phenomenon, we believe our data suggest that altering the short-term dynamics of excitatory inputs onto SST neurons reduces the divergent spiking dynamics in barrels versus septa during repetitive single- and multi-whisker stimulation. Future work could examine how SST neurons, whose somata reside in barrels and septa, respond to different whisker stimuli and the circuits in which they are embedded. At this time, however, the authors believe there is no alternative way to test how the short-term dynamics of excitatory inputs onto SST neurons, as a whole, contribute to the temporal aspects of barrel versus septa spiking.

      The study's reach is further constrained by the fact that results were obtained in anesthetized animals, which may not generalize to awake states.

      We appreciate the reviewer’s concern regarding the generalizability of our findings from anesthetized animals to awake states. Anesthesia was employed to ensure precise individual whisker stimulation (and multi-whisker in the same animal), which is challenging in awake rodents due to active whisking. While anesthesia may alter higher-order processing, core mechanisms, such as short and long term plasticity in the barrel cortex, are preserved under anesthesia (Martin-Cortecero et al., 2014; Mégevand et al., 2009).

      The statistical analysis appears inappropriate, with the use of repeated independent tests, dramatically boosting the false positive error rate.

      Thank you for your feedback on our analysis using independent rank-based tests for each time point in wild-type (WT) animals. To address concerns regarding multiple comparisons and temporal dependencies (for Figure 1F and 4D for now but we will add more in our revision), we performed a repeated measures ANOVA for WT animals (13 Barrel, 8 Septa, 20 time points), which revealed a significant main effect of Condition (F(1,19) = 16.33, p < 0.001) and a significant Condition-Time interaction (F(19,361) = 2.37, p = 0.001). Post-hoc tests confirmed significant differences between Barrel and Septa at multiple time points (e.g., p < 0.0025 at times 3, 4, 6, 7, 8, 10, 11, 12, 16, 19 after Bonferroni posthoc correction), supporting a differential multi-whisker vs. single-whisker ratio response in WT animals. In contrast, a repeated measures ANOVA for knock-out (KO) animals (11 Barrel, 7 Septa, 20 time points) showed no significant main effect of Condition (F(1,14) = 0.17, p = 0.684) or Condition-Time interaction (F(19,266) = 0.73, p = 0.791), indicating that the BarrelSepta difference observed in WT animals is absent in KO animals.

      Furthermore, the manuscript suffers from imprecision; its conclusions are occasionally vague or overstated. The authors suggest a role for SST+ neurons in the observed divergence in SWS/MWS responses between barrel and septal domains. However, this remains speculative, and some findings appear inconsistent. For instance, the increased response of SST+ neurons to MWS versus SWS is not confined to a specific domain. Why, then, would preferential recruitment of SST+ neurons lead to divergent dynamics between barrel and septal regions? The higher density of SST+ neurons in septal versus barrel L4 is not a sufficient explanation, particularly since the SWS/MWS response divergence is also observed in layers 2/3, where no difference in SST+ neuron density is found.

      Moreover, SST+ neuron-mediated inhibition is not necessarily restricted to the layer in which the cell body resides. It remains unclear through which differential microcircuits (barrel vs septum) the enhanced recruitment of SST+ neurons could account for the divergent responses to repeated SWS versus MWS stimulation.

      We fully appreciate the reviewer’s comment. We currently do not provide any evidence on the contribution of SST neurons in the barrels versus septa in layer 4 on the response divergence of spiking observed in SWS versus MWS. We only show that these neurons differentially distribute in the two domains in this layer. It is certainly known that there is molecular and circuit-based diversity of SST-positive neurons in different layers of the cortex, so it is plausible that this includes cells located in the two domains of vS1, something which has not been examined so far. Our data on their distribution are one piece of information that SST neurons may have a differential role in inhibiting barrel stellate cells versus septa ones. Morphological reconstructions of SST neurons in L4 of the somatosensory barrel cortex has shown that their dendrites and axons project locally and may confine to individual domains, even though not specifically examined (Fig. 3 of Scala F et al., 2019). The same study also showed that L4 SST cells receive excitatory input from local stellate cells) and is known that they are also directly excited by thalamocortical fibers (Beierlein et al., 2003; Tan et al., 2008), both of which facilitate.

      As shown in our supplementary figure, the divergence is also observed in L2/3 where, as the reviewer also points out, where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains columns- in sensory cortices.

      Regardless of the mechanism, the Elfn1 knock-out mouse line almost exclusively affects the incoming excitability onto SST neurons (see also reply to comment below), hence what can be supported by our data is that changing the incoming short-term synaptic plasticity onto these neurons brings the spiking dynamics between barrels and septa closer together.

      The Elfn1 KO mouse model seems too unspecific to suggest the role of the short-term plasticity in SST+ neurons in the differential response to repeated SWS vs MWS stimulation across domains. Why would Elfn1-dependent short-term plasticity in SST+ neurons be specific to a pathway, or a stimulation type (SWS vs MWS)? Moreover, the authors report that Elfn1 knockout alters synapses onto VIP+ as well as SST+ neurons (Stachniak et al., 2021; previous version of this paper)-so why attribute the phenotype solely to SST+ circuitry? In fact, the functional distinctions between barrel and septal domains appear largely abolished in the Elfn1 KO.

      Previous work by others and us has shown that globally removing Elfn1 selectively removes a synaptic process from the brain without altering brain anatomy or structure. This allows us to study how the temporal dynamics of inhibition shape activity, as opposed to inhibition from particular cell types. We will nevertheless update the text to discuss more global implications for SST interneuron dynamics and include a reference to VIP interneurons that contain Elfn1.

      When comparing SWS to MWS, we find that MWS replaces the neighboring excitation which would normally be preferentially removed by short-term plasticity in SST interneurons, thus providing a stable control comparison across animals and genotypes. On average, VIP interneurons failed to show modulation by MWS. We were unable to measure a substantial contribution of VIP cells to this process and also note that the Elfn1 expressing multipolar neurons comprise only ~5% of VIP neurons (Connor and Peters, 1984; Stachniak et al., 2021), a fraction that may be lost when averaging from 138 VIP cells. Moreover, the effect of Elfn1 loss on VIP neurons is quite different and marginal compared to that of SST cells, suggesting that the primary impact of Elfn1 knockout is mediated through SST+ interneuron circuitry. Therefore, even if we cannot rule out that these 5% of VIP neurons contribute to barrel domain segregation, we are of the opinion that their influence would be very limited if any.

      Reviewer #2 (Public Reviews):

      Summary:

      Argunsah and colleagues demonstrate that SST-expressing interneurons are concentrated in the mouse septa and differentially respond to repetitive multi-whisker inputs. Identifying how a specific neuronal phenotype impacts responses is an advance.

      Strengths:

      (1)  Careful physiological and imaging studies.

      (2)  Novel result showing the role of SST+ neurons in shaping responses.

      (3)  Good use of a knockout animal to further the main hypothesis.

      (4)  Clear analytical techniques.

      We thank the reviewer for their appreciation of the study.

      Weaknesses:

      No major weaknesses were identified by this reviewer. Overall, I appreciated the paper but feel it overlooked a few issues and had some recommendations on how additional clarifications could strengthen the paper. These include:

      (1) Significant work from Jerry Chen on how S1 neurons that project to M1 versus S2 respond in a variety of behavioral tasks should be included (e.g. PMID: 26098757). Similarly, work from Barry Connor’s lab on intracortical versus thalamocortical inputs to SST neurons, as well as excitatory inputs onto these neurons (e.g. PMID: 12815025) should be included.

      We thank the reviewer for these valuable resources that we overlooked. We will include Chen et al. (2015), Cruikshank et al. (2007) and Gibson et al. (1999) to contextualize S1 projections and SST+ inputs, strengthening the study’s foundation as well as Beierlein et al. (2003) which nicely show both local and thalamocortical facilitation of excitatory inputs onto L4 SST neurons, in contrast to PV cells. The paper also shows the gradual recruitment of SST neurons by thalamocortical inputs to provide feed-forward inhibition onto stellate cells (regular spiking) of the barrel cortex L4 in rat.

      (2) Using Layer 2/3 as a proxy to what is happening in layer 4 (~line 234). Given that layer 2/3 cells integrate information from multiple barrels, as well as receiving direct VPm thalamocortical input, and given the time window that is being looked at can receive input from other cortical locations, it is not clear that layer 2/3 is a proxy for what is happening in layer 4.

      We agree with the reviewer that what we observe in L2/3 is not necessarily what is taking place in L4 SST-positive cells. The data on L2/3 was included to show that these cells, as a population, can show divergent responses when it comes to SWS vs MWS, which is not seen in L2/3 VIP neurons. Regardless of the mechanisms underlying it, our overall data support that SST-positive neurons can change their activation based on the type of whisker stimulus and when the excitatory input dynamics onto these neurons change due to the removal of Elfn1 the recruitment of barrels vs septa spiking changes at the temporal domain. Having said that, the data shown in Supplementary Figure 3 on the response properties of L2/3 neurons above the septa vs above the barrels (one would say in the respective columns) do show the same divergence as in L4. This suggests that a circuit motif may exist that is common to both layers, involving SST neurons that sit in L4, L5 or even L2/3. This implies that despite the differences in the distribution of SST neurons in septa vs barrels of L4 there is an unidentified input-output spatial connectivity motif that engages in both L2/3 and L4. Please also see our response to a similar point raised by reviewer 1.

      (3) Line 267, when discussing distinct temporal response, it is not well defined what this is referring to. Are the neurons no longer showing peaks to whisker stimulation, or are the responses lasting a longer time? It is unclear why PV+ interneurons which may not be impacted by the Elfn1 KO and receive strong thalamocortical inputs, are not constraining activity.

      We thank the reviewer for their comment and will clarify the statement.

      This convergence of response profiles was further clear in stimulus-aligned stacked images, where the emergent differences between barrels and septa under SWS were largely abolished in the KO (Figure 4B). A distinction between directly stimulated barrels and neighboring barrels persisted in the KO. In addition, the initial response continued to differ between barrel and septa and also septa and neighbor (Figure 4B). This initial stimulus selectivity potentially represents distinct feedforward thalamocortical activity, which includes PV+ interneuron recruitment that is not directly impacted by the Elfn1 KO (Sun et al., 2006; Tan et al., 2008). PV+ cells are strongly excited by thalamocortical inputs, but these exhibit short-term depression, as does their output, contrasting with the sustained facilitation observed in SST+ neurons. These findings suggest that in WT animals, activity spillover from principal barrels is normally constrained by the progressive engagement of SST+ interneurons in septal regions, driven by Elfn1-dependent facilitation at their excitatory synapses. In the absence of Elfn1, this local inhibitory mechanism is disrupted, leading to longer responses in barrels, delayed but stronger responses in septa, and persistently stronger responses in unstimulated neighbors, resulting in a loss of distinction between the responses of barrel and septa domains that normally diverge over time (see Author response image 1 below).

      Author response image 1.

      (A) Barrel responses are longer following whisker stimulation in KO. (B) Septal responses are slightly delayed but stronger in KO. (C) Unstimulated neighbors show longer persistent responses in KO.

       

      (4) Line 585 “the earliest CSD sink was identified as layer 4…” were post-hoc measurements made to determine where the different shank leads were based on the post-hoc histology?

      Post hoc histology was performed on plane-aligned brain sections which would allow us to detect barrels and septa, so as to confirm the insertion domains of each recorded shank. Layer specificity of each electrode therefore could therefore not be confirmed by histology as we did not have coronal sections in which to measure electrode depth.

      (5) For the retrograde tracing studies, how were the M1 and S2 injections targeted (stereotaxically or physiologically)? How was it determined that the injections were in the whisker region (or not)?

      During the retrograde virus injection, the location of M1 and S2 injections was determined by stereotaxic coordinates (Yamashita et al., 2018). After acquiring the light-sheet images, we were able to post hoc examine the injection site in 3D and confirm that the injections were successful in targeting the regions intended. Although it would have been informative to do so, we did not functionally determine the whisker-related M1 and whisker-related S2 region in this experiment.

      (6) Were there any baseline differences in spontaneous activity in the septa versus barrel regions, and did this change in the KO animals?

      Thank you for this interesting question. Our previous study found that there was a reduction in baseline activity in L4 barrel cortex of KO animals at postnatal day (P)12, but no differences were found at P21 (Stachniak et al., 2023).

      Reviewer #3 (Public Reviews):

      Summary:

      This study investigates the functional differences between barrel and septal columns in the mouse somatosensory cortex, focusing on how local inhibitory dynamics, particularly involving Elfn1-expressing SST⁺ interneurons, may mediate temporal integration of multiwhisker (MW) stimuli in septa. Using a combination of in vivo multi-unit recordings, calcium imaging, and anatomical tracing, the authors propose that septa integrate MW input in an Elfn1-dependent manner, enabling functional segregation from barrel columns.

      Strengths:

      The core hypothesis is interesting and potentially impactful. While barrels have been extensively characterized, septa remain less understood, especially in mice, and this study's focus on septal integration of MW stimuli offers valuable insights into this underexplored area. If septa indeed act as selective integrators of distributed sensory input, this would add a novel computational role to cortical microcircuits beyond what is currently attributed to barrels alone. The narrative of this paper is intellectually stimulating.

      We thank the reviewer for finding the study intellectually stimulating.

      Weaknesses:

      The methods used in the current study lack the spatial and cellular resolution needed to conclusively support the central claims. The main physiological findings are based on unsorted multi-unit activity (MUA) recorded via low-channel-count silicon probes. MUA inherently pools signals from multiple neurons across different distances and cell types, making it difficult to assign activity to specific columns (barrel vs. septa) or neuron classes (e.g., SST⁺ vs. excitatory).

      The recording radius (~50-100 µm or more) and the narrow width of septa (~50-100 µm or less) make it likely that MUA from "septal" electrodes includes spikes from adjacent barrel neurons.

      The authors do not provide spike sorting, unit isolation, or anatomical validation that would strengthen spatial attribution. Calcium imaging is restricted to SST⁺ and VIP⁺ interneurons in superficial layers (L2/3), while the main MUA recordings are from layer 4, creating a mismatch in laminar relevance.

      We thank the reviewer for pointing out the possibility of contamination in septal electrodes. Importantly, it may not have been highlighted, although reported in the methods, but we used an extremely high threshold (7.5 std, in methods, line 583) for spike detection in order to overcome the issue raised here, which restricts such spatial contaminations. Since the spike amplitude decays rapidly with distance, at high thresholds, only nearby neurons contribute to our analysis, potentially one or two. We believe that this approach provides a very close approximation of single unit activity (SUA) in our reported data. We will include a sentence earlier in the manuscript to make this explicit and prevent further confusion.

      Regarding the point on calcium imaging being performed on L2/3 SST and VIP cells instead of L4. Both reviewer 1 and 2 brought up the same issue and we responded as follows. As shown in our supplementary figure, the divergence is also observed in L2/3 where we do not have a differential distribution of SST cells, at least based on a columnar analysis extending from L4. There are multiple scenarios that could explain this “discrepancy” that one would need to examine further in future studies. One straightforward one is that the divergence in spiking in L2/3 domains may be inherited from L4 domains, where L4 SST act on. Another is that even though L2/3 SST neurons are not biased in their distribution their input-output function is, something which one would need to examine by detailed in vitro electrophysiological and perhaps optogenetic approaches in S1. Despite the distinctive differences that have been found between the L4 circuitry in S1 and V1 (Scala F et al., 2019), recent observations indicate that small but regular patches of V1 marked by the absence of muscarinic receptor 2 (M2) have high temporal acuity (Ji et al., 2015), and selectively receive input from SST interneurons (Meier et al., 2025). Regions lacking M2 have distinct input and output connectivity patterns from those that express M2 (Meier et al., 2021; Burkhalter et al., 2023). These findings, together with ours, suggest that SST cells preferentially innervate and regulate specific domains -columns- in sensory cortices.

      Furthermore, while the role of Elfn1 in mediating short-term facilitation is supported by prior studies, no new evidence is presented in this paper to confirm that this synaptic mechanism is indeed disrupted in the knockout mice used here.

      We thank Reviewer #3 for noting the absence of new evidence confirming Elfn1’s disruption of short-term facilitation in our knockout mice. We acknowledge that our study relies on previously strong published data demonstrating that Elfn1 mediates short-term synaptic facilitation of excitatory inputs onto SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023). These studies consistently show that Elfn1 knockout abolishes facilitation in SST+ synapses, leading to altered temporal dynamics, which we hypothesize underlies the observed loss of barrel-septa response divergence in our Elfn1 KO mice (Figure 4). Nevertheless, to address the point raised, we will clarify in the revised manuscript (around lines 245-247 and 271-272) that our conclusions are based on these established findings, stating: “Building on prior evidence that Elfn1 knockout disrupts short-term facilitation in SST+ interneurons (Sylwestrak and Ghosh, 2012; Tomioka et al., 2014; Stachniak et al., 2019, 2023), we attribute the abolished barrel-septa divergence in Elfn1 KO mice to altered SST+ synaptic dynamics, though direct synaptic measurements were not performed here.”

      Additionally, since Elfn1 is constitutively knocked out from development, the possibility of altered circuit formation-including changes in barrel structure and interneuron distribution, cannot be excluded and is not addressed.

      We thank Reviewer #3 for raising the valid concern that constitutive Elfn1 knockout could potentially alter circuit formation, including barrel structure and interneuron distribution. To address this, we will clarify in the revised manuscript (around line ~271 and in the Discussion) that in our previous studies that included both whole-cell patch-clamp in acute brain slices ranging from postnatal day 11 to 22 (P11 - P21) and in vivo recordings from barrel cortex at P12 and P21, we saw no gross abnormalities in barrel structure, with Layer 4 barrels maintaining their characteristic size and organization, consistent with wildtype (WT) mice (Stachniak et al., 2019, 2023). While we cannot fully exclude subtle developmental changes, prior studies indicate that Elfn1 primarily modulates synaptic function rather than cortical cytoarchitecture (Tomioka et al., 2014). Elfn1 KO mice show no gross morphological or connectivity differences and the pattern and abundance of Elfn1 expressing cells (assessed by LacZ knock in) appears normal (Dolan and Mitchell, 2013).

      We will add the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013).

      Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without the usage of time-depended conditional knockout of the gene.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) My biggest concern is regarding statistics. Did the authors repeatedly apply independent tests (Mann-Whitney) without any correction for multiple comparisons (Figures 1 and 4)? In that case, the chances of a spurious "significant" result rise dramatically. 

      In response to the reviewer’s comment, we now present new statistical results by utilizing ANOVA and blended these results in the manuscript between lines 172 and 192 for WT data and 282 and 298 for Elfn1 KO data. This new statistical approach shows the same differences as we had previously reported, hence consolidating the statements made. 

      (2) The findings only hint at a mechanism involving SST+ neurons for how SWS and MWS are processed differently in the barrel vs septal domains. As a direct test of SST+ neuron involvement in the divergence of barrel and septal responses, the authors might consider SST-specific manipulations - for example, inhibitory chemo- or optogenetics during SWS and MWS stimulation.

      We thank the reviewer for this comment and agree that a direct manipulation of SST+ neurons via inhibitory chemo- or opto-genetics could provide further supporting evidence for the main claims in our study. We have opted out from performing these experiments for this manuscript as we feel they can be part of a future study.  At the same time, it is conceivable that such manipulations and depending on how they are performed may lead to larger and non-specific effects on cortical activity, since SST neurons will likely be completely shut down. So even though we certainly appreciate and value the strengths of such approaches, our experiments have addressed a more nuanced hypothesis, namely that the synaptic dynamics onto SST+ neurons matter for response divergence of septa versus barrels, which could not have been easily and concretely addressed by manipulating SST+ cell firing activity.  

      (3) In general, it is hard to comprehend what microcircuit could lead to the observed divergence in the MWS/SWS ratio in the barrel vs septal domain. There preferential recruitment of SST+ neurons during MWS is not specific to a particular domain, and the higher density of SST+ neurons specifically in L4 septa cannot per se explain the diverging MWS/SWS ratio in L4 septal neurons since similar ratio divergence is observed across domains in L2/3 neurons without increase SST+ neuron density in L2/3. This view would also assume that SST+ inhibition remains contained to its own layer and domain. Is this the case? Is it that different microcircuits between barrels and septa differently shape the response to repeated MWS? This is partially discussed in the paper; can the authors develop on that? What would the proposed mechanism be? Can the short-term plasticity of the thalamic inputs (VPM vs POm) be part of the picture?

      We thank the reviewer for raising this important point. We propose that the divergence in MWS/SWS ratios across barrel and septal domains arises from dynamic microcircuit interactions rather than static anatomical features such as SST+ density, which we describe and can provide a hint. In L2/3, where SST+ density is uniform, divergence persists, suggesting that trans-laminar and trans-domain interactions are key. Barrel domains, primarily receiving VPM inputs, exhibit short-term depression onto excitatory cells and engage PV+ and SST+ neurons to stabilize the MWS/SWS ratio, with Elfn1-dependent facilitation of SST+ neurons gradually increasing inhibition during repetitive SWS. Septal domains, in contrast, are targeted by facilitating POm inputs, combined with higher L4 SST+ density and Elfn1-mediated facilitation, producing progressive inhibitory buildup that amplifies the MWS/SWS ratio. SST+ projections in septa may extend trans-laminarly and laterally, influencing L2/3 and neighboring barrels, thereby explaining L2/3 divergence despite uniform SST+ density in L2/3. In this regards, direct laminar-dependent manipulations will be required to confirm whether L2/3 divergence is inherited from L4 dynamics. In Elfn1 KO mice, the loss of facilitation in SST+ neurons likely flattens these dynamics, disrupting functional segregation. Future experiments using VPM/POm-specific optogenetic activation and SST+ silencing will be critical to directly test this model.

      We expanded the discussion accordingly.

      (4) Can the decoder generalize between SWS and MWS? In this condition, if the decoder accuracy is higher for barrels than septa, it would support the idea that septa are processing the two stimuli differently. 

      Our results show that septal decoding accuracy is generally higher than barrel accuracy when generalizing from multi-whisker stimulation (MWS) to single-whisker stimulation (SWS), indicating distinct information processing in septa compared to barrels.

      In wild-type (WT) mice, septal accuracy exceeds barrel accuracy across all time windows (150ms, 51-95ms, 1-95ms), with the largest difference in the 51-95ms window (0.9944 vs. 0.9214 at pulse 20, 10Hz stimulation). This septal advantage grows with successive pulses, reflecting robust, separable neural responses, likely driven by the posterior medial nucleus (POm)’s strong MWS integration contrasting with minimal SWS activation. Barrel responses, driven by consistent ventral posteromedial nucleus (VPM) input for both stimuli, are less distinguishable, leading to lower accuracy.

      In Elfn1 knockout (KO) mice, which disrupt excitatory drive to somatostatin-positive (SST+) interneurons, barrel accuracy is higher initially in the 1-50ms window (0.8045 vs. 0.7500 at pulse 1), suggesting reduced early septal distinctiveness. However, septal accuracy surpasses barrels in later pulses and time windows (e.g., 0.9714 vs. 0.9227 in 51-95ms at pulse 20), indicating restored septal processing. This supports the role of SST+ interneurons in shaping distinct MWS responses in septa, particularly in late-phase responses (51-95ms), where inhibitory modulation is prominent, as confirmed by calcium imaging showing stronger SST+ activation during MWS.

      These findings demonstrate that septa process SWS and MWS differently, with higher decoding accuracy reflecting structured, POm- and SST+-driven response patterns. In Elfn1 KO mice, early deficits in septal processing highlight the importance of SST+ interneurons, with later recovery suggesting compensatory mechanisms. 

      We have added Supplementary Figure 4 and included this interpretation between lines 338353. 

      We thank the reviewer for suggesting this analysis.

      (5) It is not clear to me how the authors achieve SWS. How is it that the pipette tip "placed in contact with the principal whisker" does not detach from the principal whisker or stimulate other whiskers? Please clarify the methods. 

      Targeting the specific principal whisker is performed under the stereoscope.  

      Specifically, we have added this statement in line 628:

      “We trimmed the whiskers where necessary, to avoid them touching each other and to avoid stimulating other whiskers. By putting the pipette tip very close (almost touching) to the principal whisker, the movement of the tip (limited to 1mm) would reliably move the targeted whisker. The specificity of the stimulation of the selected principal whisker was observed under the stereoscope.”

      (6) The method for calculating decoder accuracy is not clearly described-how can accuracy exceed 1? The authors should clarify this metric and provide measures of variability (e.g., confidence intervals or standard deviations across runs) to assess the significance of their comparisons. Additionally, using a consistent scale across all plots would improve interoperability. 

      We thank the reviewer for raising this point. We have now changed the way accuracies are calculated and adopted a common scale among different plots (see updated Figure 5). We have also changed the methods section accordingly.

      (7) Figure 1: The sample size is not specified. It looks like the numbers match the description in the methods, but the sample size should be clearly stated here. 

      These are the numbers the reviewer is inquiring about. 

      WT: (WT) animals: a 280 × 95 × 20 matrix for the stimulated barrel (14 Barrels, 95ms, 20 pulses), a 180 × 95 × 20 matrix for the septa (9 Septa, 95ms, 20 pulses), and a 360 × 95 × 20 matrix for the neighboring barrel (18 Neighboring barrels, 95ms, 20 pulses). N=4 mice.

      KO: 11-barrel columns, 7 septal columns, 11 unstimulated neighbors from N=4 mice.

      Panels D-F are missing axes and axis labels (firing rate, p-value). Panel D is mislabeled (left, middle, and right). I can't seem to find the yellow line. 

      Thank you for this observation. We made changes in the figures to make them easier to navigate based on the collective feedback from the reviewers.

      Why is changing the way to compare the differences in the responses to repeated stimulation between SWS and MWS? 

      To assess temporal accumulation of information, we compared responses to repeated single-whisker stimulation (SWS) and multi-whisker stimulation (MWS) using an accumulative decoding approach rather than simple per-pulse firing rates. This method captures domain-specific integration dynamics over successive pulses.

      The use of the term "principal whisker" is confusing, as it could refer to the whisker that corresponds to the recorded barrel. 

      When we use the term principal whisker, the intention is indeed to refer to the whisker corresponding to the recorded barrel during single whisker stimulation. The term principal whisker is removed from Figure legend 1 and legend S1C where it may have led to  ambiguity.    

      Why the statement "after the start of active whisking"? Mice are under anesthesia here; it does not appear to be relevant for the figure. 

      “After the start of active whisking” refers to the state of the barrel cortex circuitry at the time of recordings. The particular reference we use comes from the habit of assessing sensory processing also from a developmental point of view. The reviewer is correct that it has nothing to do the with the status of the experiment. Nevertheless, since the reviewer found that it may create confusion, we have now taken it out. 

      (8) Figure 3: The y-axis label is missing for panel C. 

      This is now fixed. (dF/F).

      (9) Figure 4: Axis labels are missing.

      Added.

      Minor: 

      (10) Line 36: "progressive increase in septal spiking activity upon multi-whisker stimulation". There is no increase in septal spiking activity upon MWS; the ratio MWS/SWS increases.

      We have changed the sentence as follows: Genetic removal of Elfn1, which regulates the incoming excitatory synaptic dynamics onto SST+ interneurons, leads to the loss of the progressive increase in septal spiking ratio (MWS/SWS) upon stimulation.

      (11) Line 105: domain-specific, rather than column-specific, for consistency.

      We have changed it.

      (12) Lines 173-174: "a divergence between barrel and septa domain activity also occurred in Layer 4 from the 2nd pulse onward (Figure 1E)". The authors only show a restricted number of comparisons. Why not show the p-values as for SWS?

      The statistics is now presented in current Figure 1E.

      (13) Lines 151-153: "Correspondingly, when a single whisker is stimulated repeatedly, the response to the first pulse is principally bottom-up thalamic-driven responses, while the later pulses in the train are expected to also gradually engage cortico-thalamo-cortical and cortico-cortical loops." Can the authors please provide a reference?

      We have now added the following references : (Kyriazi and Simons, 1993; Middleton et al., 2010; Russo et al., 2025).

      (14) Lines 184-186: "Our electrophysiological experiments show a significant divergence of responses over time upon both SWS and MWS in L4 between barrels (principal and neighboring) and adjacent septa, with minimal initial difference". The only difference between the neighboring barrel and septa is the responses to the initial pulse. Can the author clarify? 

      We have now changed the sentence as follows: Our electrophysiological experiments show a significant divergence of responses between domains upon both SWS and MWS in L4. (Line 198 now)

      (15) Line 214: "suggest these interneurons may play a role in diverging responses between barrels and septa upon SWS". Why SWS specifically?

      We have changed the sentence as follows: These results confirmed that SST+ and VIP+ interneurons have higher densities in septa compared to barrels in L4 and suggest these interneurons may play a role in diverging responses between barrels and septa. (Line 231 now).

      (16) Line 235: "This result suggests that differential activation of SST+ interneurons is more likely to be involved in the domain-specific temporal ratio differences between barrels and septa". Why? The results here are not domain-specific.

      We have now revised this statement to: This result suggested that temporal ratio differences specific to barrels and septa might involve differential activation of SST+ interneurons rather than VIP+ interneurons.

      (17) Lines 241-243: "SST+ interneurons in the cortex are known to show distinct short-term synaptic plasticity, particularly strong facilitation of excitatory inputs, which enables them to regulate the temporal dynamics of cortical circuits." Please provide a reference.

      We have now added the following references: (Grier et al., 2023; Liguz-Lecznar et al., 2016).

      (18) Lines 245-247: "A key regulator of this plasticity is the synaptic protein Elfn1, which mediates short-term synaptic facilitation of excitation on SST+ interneurons (Stachniak et al., 2021, 2019; Tomioka et al., 2014)". Is Stachniak et al., 2021 not about the role of Elf1n in excitatory-to-VIP+ neuron synapses?

      The reviewer correctly spotted this discrepancy . This reference has now been removed from this statement.

      (19) Lines 271-272: "Building on our findings that Elfn1-dependent facilitation in SST+ interneurons is critical for maintaining barrel-septa response divergence". The authors did not show that.

      We have now changed the statement to: Building on our findings that Elfn1 is critical for maintaining barrel-septa response divergence  

      (20) Line 280: second firing peak, not "peal".

      Thank you, it is now fixed.

      (21) Lines 304-305: "These results highlight the critical role of Elfn1 in facilitating the temporal integration of 305 sensory inputs through its effects on SST+ interneurons". This claim is also overstated. 

      We have now changed the statement to: These results highlight the contribution of Elfn1 to the temporal integration of sensory inputs. (Line 362)

      (22) Line 329: Any reason why not cite Chen et al., Nature 2013?

      We have now added this reference, as also pointed out by reviewer 1.

      (23) Line 341-342: "wS1" and "wS2" instead of S1 and S2 for consistency.

      Thanks, we have now updated the terms.

      Reviewer #2 (Recommendations for the authors): 

      (1) Figure 3D - the SW conditions are labeled but not the MW conditions (two right graphs) - they should be labeled similarly (SSTMW, VIPMW). 

      The two right graphs in Figure 3D represent paired SW vs MW comparisons of the evoked responses for SST and VIP populations, respectively.

      (2) Figure 6 D and E I think it would be better if the Depth measurements were to be on the yaxis, which is more typical of these types of plots. 

      We thank the reviewer for this comment. Although we appreciate this may be the case, we feel that the current presentation may be easier for the reader to navigate, and we have hence kept it. 

      (3) Having an operational definition of septa versus barrel would be useful. As the authors point out, this is a tough distinction in a mouse, and often you read papers that use Barrel Wall versus Barrel Hollow/Center - operationally defining how these areas were distinguished would be helpful. 

      We thank the reviewer for this comment and understand the point made.

      We have now updated the methods section in line 611: 

      DiI marks contained within the vGlut2 staining were defined as barrel recordings, while DiI marks outside vGlut2 staining were septal recordings.

      Reviewer #3 (Recommendations for the authors): 

      To support the manuscript's major claims, the authors should consider the following:

      (1) Validate the septal identity of the neurons studied, either anatomically or functionally at the single-cell level (e.g., via Ca²⁺ imaging with confirmed barrel/septa mapping). 

      We thank the reviewer for this suggestion, but we feel that these extensive experiments are beyond the scope of this study. 

      (2) Provide both anatomical and physiological evidence to assess the possibility of altered cortical development in Elfn1 KO mice, including potential changes in barrel structure or SST⁺ cell distribution. 

      To address the reviewer’s point, we have now added the following to the Discussion: “Although Elfn1 is constitutively knocked out, we find here and in previous studies that barrel structure is preserved (Stachniak et al., 2019, 2023). Further, the distribution of Elfn1 expressing interneurons is not different in KO mice, suggesting minimal developmental disruption (Dolan and Mitchell, 2013). Nonetheless, we acknowledge that subtle circuit changes cannot be ruled out without conditional knockouts.”,

      (3) Examine the sensory responses of SST⁺ and VIP⁺ interneurons in deeper cortical layers, particularly layer 4, which is central to the study's main conclusions.

      We thank the reviewer for this suggestion and appreciate the value it would bring to the study. We nevertheless feel that these extensive experiments are beyond the scope of this study and hence opted out from performing them. 

      Minor Comments:

      (1)  The authors used a CLARITY-based passive clearing protocol, which is known to sometimes induce tissue swelling or distortion. This may affect anatomical precision, especially when assigning neurons to narrow domains such as septa versus barrels. Please clarify whether tissue expansion was measured, corrected, or otherwise accounted for during analysis.

      Yes, the tissue expansion was accounted during analysis for the laminar specification. We excluded the brains with severe distortion. 

      (2) While the anatomical data are plotted as a function of "depth from the top of layer 4," the manuscript does not specify the precise depth ranges used to define individual cortical layers in the cleared tissue. Given the importance of laminar specificity in projection and cell type analyses, the criteria and boundaries used to delineate each layer should be explicitly stated.

      Thank you for pointing this out. We now include the criteria for delineating each layer in the manuscript. “Given that the depth of Layer 4 (L4) can be reliably measured due to its welldefined barrel boundaries, and that the relative widths of other layers have been previously characterized (El-Boustani et al., 2018), we estimated laminar boundaries proportionally. Specifically, Layer 2/3 was set to approximately 1.3–1.5 times the width of L4, Layer 5a to ~0.5 times, and Layer 5b to a similar width as L4. Assuming uniform tissue expansion across the cortical column, we extrapolated the remaining laminar thicknesses proportionally.”

      (3)  In several key comparisons (e.g., SST⁺ vs. VIP⁺ interneurons, or S2-projecting vs. M1projecting neurons), it is unclear whether the same barrel columns were analyzed across conditions. Given the anatomical and functional heterogeneity across wS1 columns, failing to control for this may introduce significant confounds. We recommend analyzing matched columns across groups or, if not feasible, clearly acknowledging this limitation in the manuscript.

      We thank the reviewer for raising this important point. For the comparison of SST⁺ versus VIP⁺ interneurons, it would in principle have been possible to analyze the same barrel columns across groups. However, because some of the cleared brains did not reach the optimal level of clarity, our choice of columns was limited, and we were not always able to obtain sufficiently clear data from the same columns in both groups. Similarly, for the analysis of S2- versus M1-projecting neurons, variability in the position and spread of retrograde virus injections made it difficult to ensure measurements from identical barrel columns. We have now added a statement in the Discussion to acknowledge this limitation.

      (4) Figure 1C: Clarify what each point in the t-SNE plot represents-e.g., a single trial, a recording channel, or an averaged response. Also, describe the input features used for dimensionality reduction, including time windows and preprocessing steps.

      In response to the reviewer’s comment, we have now added the following in the methods: In summary, each point in the t-SNE plots represents an averaged response across 20 trials for a specific domain (barrel, septa, or neighbor) and genotype (WT or KO), with approximately 14 points per domain derived from the 280 trials in each dataset. The input features are preprocessed by averaging blocks of 20 trials into 1900-dimensional vectors (95ms × 20), which are then reduced to 2D using t-SNE with the specified parameters. This approach effectively highlights the segregation and clustering patterns of neural responses across cortical domains in both WT and KO conditions.

      (5) Figures 1D, E (left panels): The y-axes lack unit labeling and scale bars. Please indicate whether values are in spikes/sec, spikes/bin, or normalized units.

      We have now clarified this. 

      (6) Figures 1D, E (right panels): The color bars lack units. Specify whether the values represent raw firing rates, z-scores, or other normalized measures. Replace the vague term "Matrix representation" with a clearer label such as "Pulse-aligned firing heatmap."

      Thank you, we have now done it.

      (7) Figure 1E (bottom panel): There appears to be no legend referring to these panels. Please define labels such as "B" and "S." 

      Thank you, we have now done it.

      (8) Figure 1E legend: If it duplicates the legend from Figure 1D, this should be made explicit or integrated accordingly. 

      We have changed the structure of this figure.

      (9) Figure 1F: Define "AUC" and explain how it was computed (e.g., area under the firing rate curve over 0-50 ms). Indicate whether the plotted values represent percentages and, if so, label the y-axis accordingly. If normalization was applied, describe the procedure. Include sample sizes (n) and specify what each data point represents (e.g., animal, recording site). 

      The following paragraph has been added in the methods section:

      The Area Under the Curve (AUC) was computed as the integral of the smoothed firing rate (spikes per millisecond) over a 50ms window following each whisker stimulation pulse, using trapezoidal integration. Firing rate data for layer 4 barrel and septal regions in wild-type (WT) and knockout (KO) mice were smoothed with a 3-point moving average and averaged across blocks of 20 trials. Plotted values represent the percentage ratio of multi-whisker (MW) to single whisker (SW) AUC with error bars showing the standard error of the mean. Each data point reflects the mean AUC ratio for a stimulation pulse across approximately 11 blocks (220 trials total). The y-axis indicates percentages.

      (10) Figure 3C: Add units to the vertical axis.

      We have added them.

      (11) Figure 3D: Specify what each line represents (e.g., average of n cells, individual responses?). 

      Each line represents an average response of a neuron.  

      (12) Figure 4C legend: Same with what?". No legend refers to the bottom panels - please revise to clarify. 

      Thank you. We have now changed the figure structure and legends and fixed the missing information issue.

      (13) Supplementary Figure 1B: Indicate the physical length of the scale bar in micrometers. 

      This has been fixed. The scale bar is 250um.

      (14) Indicate the catalog number or product name of the 8×8 silicon probe used for recordings.

      We have added this information. It is the A8x8-Edge-5mm-100-200-177-A64

      References

      (1) Beierlein, M., Gibson, J. R. & Connors, B. W. (2003). Two dynamically distinct inhibitory networks in layer 4 of the neocortex. J. Neurophysiol. 90, 2987–3000.

      (2) Burkhalter, A., D’Souza, R. D. & Ji, W. (2023). Integration of feedforward and feedback information streams in the modular architecture of mouse visual cortex. Annu. Rev. Neurosci. 46, 259–280.

      (3) Chen, J. L., Margolis, D. J., Stankov, A., Sumanovski, L. T., Schneider, B. L. & Helmchen, F. (2015). Pathway-specific reorganization of projection neurons in somatosensory cortex during learning. Nat. Neurosci. 18, 1101–1108.

      (4) Connor, J. R. & Peters, A. (1984). Vasoactive intestinal polypeptide-immunoreactive neurons in rat visual cortex. Neuroscience 12, 1027–1044.

      (5) Cruikshank, S. J., Lewis, T. J. & Connors, B. W. (2007). Synaptic basis for intense thalamocortical activation of feedforward inhibitory cells in neocortex. Nat. Neurosci. 10, 462–468.

      (6) Dolan, J. & Mitchell, K. J. (2013). Mutation of Elfn1 in mice causes seizures and hyperactivity. PLoS One 8, e80491.

      (7) Gibson, J. R., Beierlein, M. & Connors, B. W. (1999). Two networks of electrically coupled inhibitory neurons in neocortex. Nature 402, 75–79.

      (8) Ji, W., Gămănuţ, R., Bista, P., D’Souza, R. D., Wang, Q. & Burkhalter, A. (2015). Modularity in the organization of mouse primary visual cortex. Neuron 87, 632–643.

      (9) Martin-Cortecero, J. & Nuñez, A. (2014). Tactile response adaptation to whisker stimulation in the lemniscal somatosensory pathway of rats. Brain Res. 1591, 27–37.

      (10) Mégevand, P., Troncoso, E., Quairiaux, C., Muller, D., Michel, C. M. & Kiss, J. Z. (2009). Long-term plasticity in mouse sensorimotor circuits after rhythmic whisker stimulation. J. Neurosci. 29, 5326–5335.

      (11) Meier, A. M., Wang, Q., Ji, W., Ganachaud, J. & Burkhalter, A. (2021). Modular network between postrhinal visual cortex, amygdala, and entorhinal cortex. J. Neurosci. 41, 4809– 4825.

      (12) Meier, A. M., D’Souza, R. D., Ji, W., Han, E. B. & Burkhalter, A. (2025). Interdigitating modules for visual processing during locomotion and rest in mouse V1. bioRxiv 2025.02.21.639505.

      (13) Scala, F., Kobak, D., Shan, S., Bernaerts, Y., Laturnus, S., Cadwell, C. R., Hartmanis, L., Froudarakis, E., Castro, J. R., Tan, Z. H., et al. (2019). Layer 4 of mouse neocortex differs in cell types and circuit organization between sensory areas. Nat. Commun. 10, 4174.

      (14) Stachniak, T. J., Sylwestrak, E. L., Scheiffele, P., Hall, B. J. & Ghosh, A. (2019). Elfn1induced constitutive activation of mGluR7 determines frequency-dependent recruitment of somatostatin interneurons. J. Neurosci. 39, 4461–4475.

      (15) Stachniak, T. J., Kastli, R., Hanley, O., Argunsah, A. Ö., van der Valk, E. G. T., Kanatouris, G. & Karayannis, T. (2021). Postmitotic Prox1 expression controls the final specification of cortical VIP interneuron subtypes. J. Neurosci. 41, 8150–8166.

      (16) Stachniak, T. J., Argunsah, A. Ö., Yang, J. W., Cai, L. & Karayannis, T. (2023). Presynaptic kainate receptors onto somatostatin interneurons are recruited by activity throughout development and contribute to cortical sensory adaptation. J. Neurosci. 43, 7101–7118.

      (17) Sun, Q.-Q., Huguenard, J. R. & Prince, D. A. (2006). Barrel cortex microcircuits: Thalamocortical feedforward inhibition in spiny stellate cells is mediated by a small number of fast-spiking interneurons. J. Neurosci. 26, 1219–1230.

      (18) Sylwestrak, E. L. & Ghosh, A. (2012). Elfn1 regulates target-specific release probability at CA1-interneuron synapses. Science 338, 536–540.

      (19) Tan, Z., Hu, H., Huang, Z. J. & Agmon, A. (2008). Robust but delayed thalamocortical activation of dendritic-targeting inhibitory interneurons. Proc. Natl. Acad. Sci. USA 105, 2187–2192.

      (20) Tomioka, N. H., Yasuda, H., Miyamoto, H., Hatayama, M., Morimura, N., Matsumoto, Y., Suzuki, T., Odagawa, M., Odaka, Y. S., Iwayama, Y., et al. (2014). Elfn1 recruits presynaptic mGluR7 in trans and its loss results in seizures. Nat. Commun. 5, 4501.

      (21) Yamashita, T., Vavladeli, A., Pala, A., Galan, K., Crochet, S., Petersen, S. S. & Petersen, C. C. (2018). Diverse long-range axonal projections of excitatory layer 2/3 neurons in mouse barrel cortex. Front. Neuroanat. 12, 33.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This important manuscript provides insights into the competition between Splicing Factor 1 (SF1) and Quaking (QKI) for binding at the ACUAA branch point sequence in a model intron, regulating exon inclusion. The study employs rigorous transcriptomic, proteomic, and reporter assays, with both mammalian cell culture and yeast models. Nevertheless, while the data are convincing, broadening the analysis to additional exons and narrowing the manuscript's title to better align with the experimental scope would strengthen the work.

      Public Reviews:

      Reviewer #1 (Public review):

      In this manuscript, the authors aimed to show that SF1 and QKI compete for the intron branch point sequence ACUAA and provide evidence that QKI represses inclusion when bound to it.

      Major strengths of this manuscript include:

      (1) Identification of the ACUAA-like motif in exons regulated by QKI and SF1.

      (2) The use of the splicing reporter and mutant analysis to show that upstream and downstream ACUAAC elements in intron 10 of RAI are required for repressing splicing.

      (3) The use of proteomic to identify proteins in C2C12 nuclear extract that binds to the wild type and mutant sequence.

      (4) The yeast studies showing that ectopic lethality when Qki5 expression was induced, due to increased mis-splicing of transcripts that contain the ACUAA element.

      The authors conclusively show that the ACUAA sequence is bound by QKI and provide strong evidence that this leads to differences in exons inclusion and exclusion. In animal cells, and especially in human, branchpoint sequences are degenerate but seem to be recognized by specific splicing factors. Although a subset of splicing factors shows tissue-specific expression patterns most don't, suggesting that yet-to-be-identified mechanisms regulate splicing. This work suggests that an alternate mechanism could be related to the binding affinity of specific RNA binding factors for branchpoint sequences coupled with the level of these different splicing factors in a given cell.

      We thank the reviewer for the positive comments.

      Reviewer #2 (Public review):

      Summary:

      In this manuscript, Pereira de Castro and coworkers are studying potential competition between a more standard splicing factor SF1, and an alternative splicing factor called QK1. This is interesting because they bind to overlapping sequence motifs and could potentially have opposing effects on promoting the splicing reaction. To test this idea, the authors KD either SF1 or QK1 in mammalian cells and uncover several exons whose splicing regulation follows the predicted pattern of being promoted for splicing by SF1 and repressed by QK1. Importantly, these have introns enriched in SF1 and QK1 motifs. The authors then focus on one exon in particular with two tandem motifs to study the mechanism of this in greater detail and their results confirm the competition model. Mass spec analysis largely agrees with their proposal; however, it is complicated by the apparently quick transition of SF1-bound complexes to later splicing intermediates. An inspired experiment in yeast shows how QK1 competition could potentially have a detrimental impact on splicing in an orthogonal system. Overall, these results show how splicing regulation can be achieved by competition between a "core" and alternative splicing factor and provide additional insight into the complex process of branch site recognition. The manuscript is exceptionally clear and the figures and data are very logically presented. The work will be valuable to those in the splicing field who are interested in both mechanism and bioinformatics approaches to deconvolve any apparent "splicing code" being used by cells to regulate gene expression. Criticisms are minor and the most important of them stem from overemphasis on parts of the manuscript on the evolutionary angle when evolution itself wasn't analyzed per se.

      We thank the reviewer for the positive comments and very clear and fair critical points.

      Strengths:

      (1) The main discovery of the manuscript involving evidence for SF1/QK1 competition is quite interesting and important for this field. This evidence has been missing and may change how people think about branch site recognition.

      (2) The experiments and the rationale behind them are exceptionally clearly and logically presented. This was wonderful!

      Thank you so much. We felt the overall flow of the paper and data make for a nice “story” that conveys a relatively easy-to-understand explanation for a complex subject.

      (3) The experiments are carried out to a high standard and well-designed controls are included.

      (4) The extrapolation of the result to yeast in order to show the potentially devastating consequences of the QK1 competition was very exciting and creative.

      We agree this is a very exciting result and finding! Thanks.

      Weaknesses:

      Overall the weaknesses are relatively minor and involve cases where clarification is necessary, some additional analysis could bolster the arguments, and suggestions for focusing the manuscript on its strengths.

      (1) The title (Ancient...evolutionary outcomes), abstract, and some parts of the discussion focus heavily on the evolutionary implications of this work. However, evolutionary analysis was not performed in these studies (e.g., when did QK1 and SF1 proteins arise and/or diverge? How does this line up with branch site motifs and evolution of U2? Any insight from recent work from Scott Roy et al?). I think this aspect either needs to be bolstered with experimental work/data or this should be tamped down in the manuscript. I suggest highlighting the idea expressed in the sentence "A nuanced implication of this model is that loss-of-function...". To me, this is better supported by the data and potentially by some analysis of mutations associated with human disease.

      We have revised the title and dampened the evolutionary aspects of the previous version of the manuscript.

      (2) One paper that I didn't see cited was that by Tanackovic and Kramer (Mol Biol Cell 2005). This paper is relevant because they KD SF1 and found it nonessential for splicing in vivo. Do their results have implications for those here? How do the results of the KD compare? Could QK1 competition have influenced their findings (or does their work influence the "nuanced implication" model referenced above?)?

      This is an interesting point, and thank you for the suggestion. We have now included a brief description of this study in the Introduction of the revised manuscript and do note that the authors measured intron retention of a beta globin reporter and SF3A1, SF3A2, and SF3A3 during SF1 knockdown, but did not detect elevated unspliced RNA in these targets.

      (3) Can the authors please provide a citation for the statement "degeneracy is observed to a higher degree in organisms with more alternative splicing"? Does recent evolutionary analysis support this?

      We have removed the statement, as it did not add much to the content and I am not sure I can state the concept I was attempting to convey in a simple manner with few citations.

      (4) For the data in Figure 3, I was left wondering if NMD was confounding this analysis. Can the authors respond to this and address this concern directly?

      We have not measured if the reporters used in Figure 3 produce protein(s). Presumably, though, all spliced reporter RNA would be degraded equally (the included/skipped isoforms’ “reading frames” are not altered from one another). This would not be case for unspliced nuclear reporter RNA, however. Given this difference, and that our analysis can not resolve the subcellular localization of the different reporter species, we have removed the measurement of and subsequent results describing unspliced reporter RNA from Figure 3.

      (5) To me, the idea that an engaged U2 snRNP was pulled down in Figure 4F would be stronger if the snRNA was detected. Was that able to be observed by northern or primer extension? Would SF1 be enriched if the U2 snRNA was degraded by RNaseH in the NE?

      We did not measure any co-associating RNAs in this experimental approach, but agree that this approach would strengthen the evidence for it.

      (6) I'm wondering how additive the effects of QK1 and SF1 are... In Figure 2, if QK1 and SF1 are both knocked down, is the splicing of exon 11 restored to "wt" levels?

      This is an interesting question that we were unfortunately unable to address experimentally here.

      (7) The first discussion section has two paragraphs that begin "How does competition between SF1..." and "Relatively little is known about how...". I found the discussion and speculation about localization, paraspekles, and lncRNAs interesting but a bit detracting from the strengths of the manuscript. I would suggest shortening these two paragraphs into a single one.

      We have revised the Discussion.

      Reviewer #3 (Public review):

      Summary:

      In this manuscript, the authors were trying to establish whether competition between the RNA-binding proteins SF1 and QKI controlled splicing outcomes. These two proteins have similar binding sites and protein sequences, but SF1 lacks a dimerization motif and seems to bind a single version of the binding sequence. Importantly, these binding sequences correspond to branchpoint consensus sequences, with SF1 binding leading to productive splicing, but QKI binding leading instead to association with paraspeckle proteins. They show that in human cells SF1 generally activates exons and QKI represses, and a large group of the jointly regulated exons (43% of joint targets) are reciprocally controlled by SF1 and QKI. They focus on one of these exons RAI14 that shows this reciprocal pattern of regulation, and has 2 repeats of the binding site that make it a candidate for joint regulation, and confirm regulation within a minigene context. The authors used the assembly of proteins within nuclear extracts to explain the effect of QKI versus SF1 binding. Finally, the authors show that the expression of QKI is lethal in yeast, and causes splicing defects.

      How this fits in the field. This study is interesting and provides a conceptual advance by providing a general rule on how SF1 and QKI interact in relation to binding sites, and the relative molecular fates followed, so is very useful. Most of the analysis seems to focus on one example, although the molecular analysis and global work significantly add to the picture from the previously published paper about NUMB joint regulation by QKI and SF (Zong et al, cited in text as reference 50, that looked at SF1 and QKI binding in relation to a duplicated binding site/branchpoint sequence in NUMB).

      Thank you for the encouraging remarks.

      Strengths:

      The data presented are strong and clear. The ideas discussed in this paper are of wide interest, and present a simple model where two binding sites generate a potentially repressive QKI response, whereas exons that have a single upstream sequence are just regulated by SF1. The assembly of splicing complexes on RNAs derived from RAI14 in nuclear extracts, followed by mass spec gave interesting mechanistic insight into what was occurring as a result of QKI versus SF1 binding.

      Weaknesses:

      I did not think the title best summarises the take-home message and could be perhaps a bit more modest. Although the authors investigated splicing patterns in yeast and human cells, yeast do not have QKI so there is no ancient competition in that case, and the study did not really investigate physiological or evolutionary outcomes in splicing, although it provides interesting speculation on them. Also as I understood it, the important issue was less conserved branchpoints in higher eukaryotes enabling alternative splicing, rather than competition for the conserved branchpoint sequence. So despite the the data being strong and properly analysed and discussed in the paper, could the authors think whether they fit best with the take-home message provided in the title? Just as a suggestion (I am sure the authors can do a better job), maybe "molecular competition between variant branchpoint sequences predict physiological and evolutionary outcomes in splicing"?

      Thank you for this point (Reviewer 2 had a similar comment) and the suggestion. We have revised the title.

      Although the authors do provide some global data, most of the detailed analysis is of RAI14. It would have been useful to examine members of the other quadrants in Figure 1C as well for potential binding sites to give a reason why these are not co-regulated in the same way as RAI14. How many of the RAI14 quadrants had single/double sites (the motif analysis seemed to pull out just one), and could one of the non-reciprocally regulated exons be moved into a different quadrant by addition or subtraction of a binding site or changing the branchpoint (using a minigene approach for example).

      This is an interesting point that we have considered. Our intent with the focus on RAI14 was to use a naturally occurring intron bps with evidence of strong QKI binding that did not require a high degree of sequence manipulation or engineering.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Most of my recommendations are really centered on the figures. In their current state, they detract from the data shown and could be improved: I recommend the authors use a uniform font. For example, Figure 1E and F have at least three different fonts of varying sizes making it very messy. In Figure 1C, the authors could bold the Ral14 ex11 or simply indicate that the blue is this exon in the legend, thus removing the text from this very busy graph. In Figure 4F, I would recommend, having all the labels the same size and putting those genes of interest like Sf3a1 in bold. This could also be done in Figure 4E.

      Thank you for the suggestion and we have edited these (FYI the font in Fig’s 1E and 1F were from the rMAPS default output, but I agree, it gives a sloppy appearance).

      (2) In Figures 4D and 4G, is there QKI binding to the downstream deletion mutant after 30 minutes? Also, in Figure 4G, are these all from the same blot? The band sizes seem to be very different between lanes. If these were not on the same blot, the original gels should be submitted.

      A small amount of Qki appears to be binding after 30 min. All lanes/blots are from the same gels/membranes; see new Supplemental Figure 4 for the original (uncropped) images of the blots.

      (3) The authors should indicate, the source and concentration of the antibodies used for their WB. They should also indicate the primers used for RT-PCRs.

      We have revised the methods to include the antibody information and have uploaded a supplemental table 8 with all oligonucleotide sequences used (which I (Sam Fagg) neglected to do initially, so that’s my bad).

      Reviewer #2 (Recommendations for the authors):

      (1) This may come down to the author's preference but branch point and branch site are frequently two words, not a single compound word (branch point vs. branchpoint). In addition, the authors may want to use branchsite with the abbreviation BS more frequently since they often don't describe the specific point of branching, and bp and bps could be confused for the more frequent abbreviations for base pair(s).

      Good suggestion; we have edited the text accordingly.

      (2) In general the addition of page numbers and line numbers to the manuscript would greatly aid reviewers!

      Point taken…

      (3) Introduction; "...under normal growth conditions they are efficiently spliced". I would say MOST introns in yeast are efficiently spliced. This is definitely not universal.

      Text edited to indicate that most are efficiently spliced.

      (4) Introduction; " recognition of the bps by SF1 (mammals) (20)". The choice of reference 20 is an odd one here. I think the Robin Reed and Michael Rosbash paper was the first to show SF1 was the human homolog of BBP.

      Got it, thanks (added #14 here and kept #20 also since it shows the structure of SF1 in complex with a UACUAAC bps.)

      (5) Results; "QK1 and SF1 co-regulate.."; it may be useful for the reader if you could explain in more detail why exon inclusion and intron retention are expected outcomes for QK1 knockdown and vice versa for SF1. The exon inclusion here is more obvious than the intron retention phenotype. (In other words, if more exons are included shouldn't it follow that more introns are removed?)

      We explain the expected results for exon inclusion in the Introduction and this paragraph of the Results. Although we have observed more intron retention under QKI loss-of-function approaches before, I am uncertain where the reviewer sees that we indicate any expected result for intron retention from either QKI or SF1 knockdown. I believe the statement you refer to might be on line 162 and starts with: “Consistent with potentially opposing functions in splicing…” ?

      Also, I agree that if SF1 is a “splicing activator,” one might expect more IR in its absence (but this is not the case; there is, in fact, less), but nonetheless, the opposite outcome is observed with QKI knockdown (more IR). It is unclear why this is the case, and we did not investigate it.

      (6) Results; "QK1 and SF1 co-regulate.."; "Thus the most highly represented set.." To me, the most highly represented set is those which are not both QK1-repressed and SF1-activated. Does this indicate that other factors are involved at most sites than simple competition between these two?

      We have revised the sentence in question to include the text “by quadrant” in order to convey our meaning more precisely.

      (7) Throughout the manuscript, 5 apostrophes and 3 apostrophes are used instead of 5 prime symbols and 3 prime symbols.

      Thank you for pointing that out. We have fixed each instance of this.

      (8) Sometimes SF1 is written as Sf1. (also Tatsf1)

      This was a mouse/human gene/protein nomenclature error that we have fixed; thank you for pointing this out.

      (9) You may want to make sure that figures are labeled consistently with the manuscript text. In Figure 1B, it is RI rather than IR. In Figure 4 it is myoblast NE rather than C2C12 nuclear extract.

      We have fixed these, checked for other examples, and where relevant, edited those too.

      (10) I think Figure 1A could be improved by also including a depiction of the domain arrangements of SF1 and QK1.

      Done.

      (11) I was a bit confused with all the lines in Figure 1E and 1F. What is the difference between the log (pVal) and upregulated plots? Can these figures be simplified or explained more thoroughly?

      Based on this comment and one from Reviewer 1, we have slightly revised the wording (and font) on the output, which hopefully clarifies. These are motif enrichment plots generated by rMAPS (Refs 61 and 62) analysis of rMATS (Ref 60) data for exons more included (depicted by the red lines) or more skipped (depicted by the blue lines) compared to control versus a “background” set of exons that are detectable but unchanged. The -log<sub>10</sub> is P-value (dotted line) indicates the significance of exons more included in shRNA treatment vs control shRNA (previously read “upregulated”) compared to background exons that are detectable but unchanged; the solid lines indicate the motif score; these are described in the references indicated.

      (12) Figure 1B, it is a bit hard to conclude that there is more AltEx or "RI/IR" in one sample vs. the other from these plots since the points overlay one another. Can you include numbers here?

      Added (and deleted Suppl Fig S1, which was simply a chart showing the numbers).

      (13) How was PSI calculated in Figure 2A?

      VAST-tools (we state this in the legend in the revised version).

      You may want to include rel protein (or the lower limit of detection) for Figure 2B to be consistent with 2C. Why is KD of SF1 so poor and variable between 2C and 2D?

      We have not investigated this, but these blots show an optimized result that we were able to obtain for the knockdown in each cell type. It may be that HEK293 cells (Fig 2B) have a stronger requirement for SF1 than C2C12 cells…? I would argue that it is not necessarily “poor” in Fig 2C, as we observe ~70% depletion of the protein.

      Why are two bands present in the gel?

      Two to three isoforms of SF1 are present in most cell types.

      A good (or bad, really) example of an SF1 western blot (and knockdown of ~35% in K562 or ~45% in HepG2 can also be seen on the ENCODE project website, for reference:

      https://www.encodeproject.org/documents/6001a414-b096-4073-94ff-3af165617eb5/@@download/attachment/SF1_BGKLV28-49.pdf

      By comparison, I think ours are much more cosmetically pleasing, and our knockdown (especially in C2C12) is much more efficient.

      (14) Figure 3, The asterisk refers to a cryptic product. Can the uaAcuuuCAG be used as a branch point? Presumably the natural 3' SS is now too close so this would result in activation of a downstream 3'SS?

      We did not pursue determining the identity of this minor and likely artefactual product, but we (and others) have observed a similar phenomenon when using splicing reporter-based mutational approaches.

      (15) For the methods. The "RNA extraction, RT -PCR,..." subheading needs to be on its own line. Please add (w/v) or (v/v) to percentages where appropriate. Please convert ug to the symbol for "micro".

      Thank you, we have made these changes.

      (16) In Figure 4B, the text here and legend are microscopic. Even with reading glasses, I couldn't make anything out!

      We have increased the font sizes for the text and scale bar…when referring to “legend” does the reviewer mean the scale bar?

      (17) As a potential discussion item, it is worth noting that SF1 could also repress splicing if it could either not engage with U2AF or be properly displaced by U2 snRNP so the snRNA could pair. I was wondering if QK1 could similarly be activating if it could engage with U2AF. I'm unsure if this could be tested by domain swaps (and is beyond the scope of this paper). It just may be worth speculating about.

      Good point and suggestion…we are looking into this.

      Reviewer #3 (Recommendations for the authors):

      (1) Is the reference in the text to Figure 5F correct for actin splicing (this is just before the discussion)?

      I see references several lines up from this, but I do not see a reference just before the discussion…?

      (2) I was not sure why the minigene experiments showed such high levels of intron retention that seemed to be impacted also by deletion of the branchpoint sequences, and suggest that the two branchpoints are not equal in strength.

      Neither were we, but Reviewer 2 has suggested that degradation of the spliced products could be rapid (NMD substrates) which could complicate the interpretation of what appears to be higher levels of intron retention. Given the possibility that this could be a non-physiological artefact, we have removed the measurement of unspliced reporter and now only show the spliced products (equally subject to degradation) and report their percent inclusion.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We thank the editors of eLife and the reviewers for their thorough evaluation of our study. As regards the final comments of reviewer 1 please note that all experimental replicates were first analyzed separately, and were then pooled, since the observed changes were comparable between experiments. This mean that statistical analyses were done on pooled biological replicates.


      The following is the authors’ response to the original reviews.

      General Statements

      We thank the reviewers for their thorough and constructive evaluation of our work. We have revised the manuscript carefully and addressed all the criticisms raised, in particular the issues mentioned by several of the reviewers (see point-by-point response below). We have also added a number of explanations in the text for the sake of clarity, while trying to keep the manuscript as concise as possible.

      In our view, the novelty of our research is two-fold. From a neurobiological point of view, we provide conclusive evidence for the existence of glycine receptors (GlyRs) at inhibitory synapses in various brain regions including the hippocampus, dentate gyrus and sub-regions of the striatum. This solves several open questions and has fundamental implications for our understanding of the organisation and function of inhibitory synapses in the telencephalon. Secondly, our study makes use of the unique sensitivity of single molecule localisation microscopy (SMLM) to identify low protein copy numbers. This is a new way to think about SMLM as it goes beyond a mere structural characterisation and towards a quantitative assessment of synaptic protein assemblies.

      Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      In this manuscript, the authors investigate the nanoscopic distribution of glycine receptor subunits in the hippocampus, dorsal striatum, and ventral striatum of the mouse brain using single-molecule localization microscopy (SMLM). They demonstrate that only a small number of glycine receptors are localized at hippocampal inhibitory synapses. Using dual-color SMLM, they further show that clusters of glycine receptors are predominantly localized within gephyrinpositive synapses. A comparison between the dorsal and ventral striatum reveals that the ventral striatum contains approximately eight times more glycine receptors and this finding is consistent with electrophysiological data on postsynaptic inhibitory currents. Finally, using cultured hippocampal neurons, they examine the differential synaptic localization of glycine receptor subunits (α1, α2, and β). This study is significant as it provides insights into the nanoscopic localization patterns of glycine receptors in brain regions where this protein is expressed at low levels. Additionally, the study demonstrates the different localization patterns of GlyR in distinct striatal regions and its physiological relevance using SMLM and electrophysiological experiments. However, several concerns should be addressed. 

      The following are specific comments: 

      (1) Colocalization analysis in Figure 1A. The colocalization between Sylite and mEos-GlyRβ appears to be quite low. It is essential to assess whether the observed colocalization is not due to random overlap. The authors should consider quantifying colocalization using statistical methods, such as a pixel shift analysis, to determine whether colocalization frequencies remain similar after artificially displacing one of the channels. 

      Following the suggestion of reviewer 1, we re-analysed CA3 images of Glrb<sup>eos/eos</sup> hippocampal slices by applying a pixel-shift type of control, in which the Sylite channel (in far red) was horizontally flipped relative to the mEos4b-GlyRβ channel (in green, see Methods). As expected, the number of mEos4b-GlyRβ detections per gephyrin cluster was markedly reduced compared to the original analysis (revised Fig. 1B), confirming that the synaptic mEos4b detections exceed chance levels (see page 5). 

      (2) Inconsistency between Figure 3A and 3B. While Figure 3B indicates an ~8-fold difference in the number of mEos4b-GlyRβ detections per synapse between the dorsal and ventral striatum, Figure 3A does not appear to show a pronounced difference in the localization of mEos4bGlyRβ on Sylite puncta between these two regions. If the images presented in Figure 3A are not representative, the authors should consider replacing them with more representative examples or providing an expanded images with multiple representative examples. Alternatively, if this inconsistency can be explained by differences in spot density within clusters, the authors should explain that. 

      The pointillist images in Fig. 3A are essentially binary (red-black). Therefore, the density of detections at synapses cannot be easily judged by eye. For clarity, the original images in Fig. 3A have been replaced with two other examples that better reflect the different detection numbers in the dorsal and ventral striatum. 

      (3) Quantification in Figure 5. It is recommended that the authors provide quantitative data on cluster formation and colocalization with Sylite puncta in Figure 5 to support their qualitative observations. 

      This is an important point that was also raised by the other reviewers. We have performed additional experiments to increase the data volume for analysis. For quantification, we used two approaches. First, we counted the percentage of infected cells in which synaptic localisation of the recombinant receptor subunit was observed (Fig. 5C). We found that mEos4b-GlyRa1 consistently localises at synapses, indicating that all cells express endogenous GlyRb. When neurons were infected with mEos4b-GlyRb, fewer cells had synaptic clusters, meaning that indeed, GlyR alpha subunits are the limiting factor for synaptic targeting. In cultures infected with mEos4b-GlyRa2, only very few neurons displayed synaptic localisation (as judged by epifluorescence imaging). We think this shows that GlyRa2 is less capable of forming heteromeric complexes than GlyRa1, in line with our previous interpretation (see pp. 9-10, 13). 

      Secondly, we quantified the total intensity of each subunit at gephyrin-positive domains, both in infected neurons as well as non-infected control cultures (Fig. 5D). We observed that mEos4bGlyRa1 intensity at gephyrin puncta was higher than that of the other subunits, again pointing to efficient synaptic targeting of GlyRa1. Gephyrin cluster intensities (Sylite labelling) were not significantly different in GlyRb and GlyRa2 expressing neurons compared to the uninfected control, indicating that the lentiviral expression of recombinant subunits does not fundamentally alter the size of mixed inhibitory synapses in hippocampal neurons. Interestingly, gephyrin levels were slightly higher in hippocampal neurons expressing mEos4b-GlyRa1. In our view, this comes from an enhanced expression and synaptic targeting of mEos4b-GlyRa1 heteromers with endogenous GlyRb, pointing to a structural role of GlyRa1/b in hippocampal synapses (pp. 10, 13).

      The new data and analyses have been described and illustrated in the relevant sections of the manuscript.

      (4) Potential for pseudo replication. It's not clear whether they're performing stats tests across biological replica, images, or even synapses. They often quote mean +/- SEM with n = 1000s, and so does that mean they're doing tests on those 1000s? Need to clarify. 

      All experiments were repeated at least twice to ensure reproducibility (N independent experiments). Statistical tests were performed on pooled data across the biological replicates; n denotes the number of data points used for testing (e.g., number of synaptic clusters, detections, cells, as specified in each case). We have systematically given these numbers in the revised manuscript (n, N, and other experimental parameters such as the number of animals used, coverslips, images or cells). Data are generally given as mean +/- SEM or as mean +/- SD as indicated.

      (5) Does mEoS effect expression levels or function of the protein? Can't see any experiments done to confirm this. Could suggest WB on homogenate, or mass spec? 

      The Glrb<sup>eos/eos</sup> knock-in mouse line has been characterised previously and does not to display any ultrastructural or functional deficits at inhibitory synapses (Maynard et al. 2021 eLife). GlyRβ expression and glycine-evoked responses were not significantly different to those of the wildtype. The synaptic localisation of mEos4b-GlyRb in KI animals demonstrates correct assembly of heteromeric GlyRs and synaptic targeting. Accordingly, the animals do not display any obvious phenotype. We have clarified this in the manuscript (p. 4). In the case of cultured neurons, long-term expression of fluorescent receptor subunits with lentivirus   has proven ideal to achieve efficient synaptic targeting. The low and continuous supply of recombinant receptors ensures assembly with endogenous subunits to form heteropentameric receptor complexes (e.g. [Patrizio et al. 2017 Sci Rep]). In the present study, lentivirus infection did not induce any obvious differences in the number or size of inhibitory synapses compared to control neurons, as judged by Sylite labelling of synaptic gephyrin puncta (new Fig. 5D).

      (6) Quantification of protein numbers is challenging with SMLM. Issues include i) some of FP not correctly folded/mature, and ii) dependence of localisation rate on instrument, excitation/illumination intensities, and also the thresholds used in analysis. Can the authors compare with another protein that has known expression levels- e.g. PSD95? This is quite an ask, but if they could show copy number of something known to compare with, it would be useful. 

      We agree that absolute quantification with SMLM is challenging, since the number of detections depends on fluorophore maturation, photophysics, imaging conditions, and analysis thresholds (discussed in Patrizio & Specht 2016, Neurophotonics). For this reason, only very few datasets provide reliable copy numbers, even for well-studied proteins such as PSD-95. One notable exception is the study by Maynard et al. (eLife 2021) that quantified endogenous GlyRβcontaining receptors in spinal cord synapses using SMLM combined with correlative electron microscopy. The strength of this work was the use of a KI mouse strain, which ensures that mEos4b-GlyRβ expression follows intrinsic regional and temporal profiles. The authors reported a stereotypic density of ~2,000 GlyRs/µm² at synapses, corresponding to ~120 receptors per synapse in the dorsal horn and ~240 in the ventral horn, taking into account various parameters including receptor stoichiometry and the functionality of the fluorophore. These values are very close to our own calculations of GlyR numbers at spinal cord synapses that were obtained slightly differently in terms of sample preparation, microscope setup, imaging conditions, and data analysis, lending support to our experimental approach. Nevertheless, the obtained GlyR copy numbers at hippocampal synapses clearly have to be taken as estimates rather than precise figures, because the number of detections from a single mEos4b fluorophore can vary substantially, meaning that the fluorophores are not represented equally in pointillist images. This can affect the copy number calculation for a specific synapse, in particular when the numbers are low (e.g. in hippocampus), however, it should not alter the average number of detections (Fig. 1B) or the (median) molecule numbers of the entire population of synapses (Fig. 1C). We have discussed the limitations of our approach (p. 11).

      (7) Rationale for doing nanobody dSTORM not clear at all. They don't explain the reason for doing the dSTORM experiments. Why not just rely on PALM for coincidence measurements, rather than tagging mEoS with a nanobody, and then doing dSTORM with that? Can they explain? Is it to get extra localisations- i.e. multiple per nanobody? If so, localising same FP multiple times wouldn't improve resolution. Also, no controls for nanobody dSTORM experiments- what about non-spec nb, or use on WT sections? 

      As discussed above (point 6), the detection of fluorophores with SMLM is influenced by many parameters, not least the noise produced by emitting molecules other than the fluorophore used for labelling. Our study is exceptional in that it attempts to identify extremely low molecule numbers (down to 1). To verify that the detections obtained with PALM correspond to mEos4b, we conducted robust control experiments (including pixel-shift as suggested by the reviewer, see point 1, revised Fig. 1B). The rationale for the nanobody-based dSTORM experiments was twofold: (1) to have an independent readout of the presence of low-copy GlyRs at inhibitory synapses and (2) to analyse the nanoscale organisation of GlyRs relative to the synaptic gephyrin scaffold using dual-colour dSTORM with spectral demixing (see p. 6). The organic fluorophores used in dSTORM (AF647, CF680) ensure high photon counts, essential for reliable co-localisation and distance analysis. PALM and dSTORM cannot be combined in dual-colour mode, as they require different buffers and imaging conditions. 

      The specificity of the anti-Eos nanobody was demonstrated by immunohistochemistry in spinal cord cultures expressing mEos4b-GlyRb and wildtype control tissue (Fig. S3). In response to the reviewer's remarks, we also performed a negative control experiment in Glrb<sup>eos/eos</sup> slices (dSTORM), in which the nanobody was omitted (new Fig. S4F,G). Under these conditions, spectral demixing produced a single peak corresponding to CF680 (gephyrin) without any AF647 contribution (Fig. S4F). The background detection of "false" AF647 detections at synapses was significantly lower than in the slices labelled with the nanobody. We conclude that the fluorescence signal observed in our dual-colour dSTORM experiments arises from the specific detection of mEos4b-GlyRb by the nanobody, rather than from background, crossreactivity or wrong attribution of colour during spectral demixing. We have added these data and explanations in the results (p. 7) and in the figure legend of Fig. S4F,G.

      (8) What resolutions/precisions were obtained in SMLM experiments? Should perform Fourier Ring Correlation (FRC) on SR images to state resolutions obtained (particularly useful for when they're presenting distance histograms, as this will be dependent on resolution). Likewise for precision, what was mean precision? Can they show histograms of localisation precision. 

      This is an interesting question in the context of our experiments with low-copy GlyRs, since the spatial resolution of SMLM is limited also by the density of molecules, i.e. the sampling of the structure in question (Nyquist-Shannon criterion). Accordingly, the priority of the PALM experiments was to improve the sensibility of SMLM for the identification of mEos4b-GlyRb subunits, rather than to maximize the spatial resolution. The mean localisation precision in PALM was 33 +/- 12 nm, as calculated from the fitting parameters of each detection (Zeiss, ZEN software), which ultimately result from their signal-to-noise ratio. This is a relatively low precision for SMLM, which can be explained by the low brightness of mEos4b compared to organic fluorophores together with the elevated fluorescence background in tissue slices.

      In the case of dSTORM, the aim was to study the relative distribution of GlyRs within the synaptic scaffold, for which a higher localisation precision was required (p. 6). Therefore, detections with a precision ≥ 25 nm were filtered during analysis with NEO software (Abbelight). The retained detections had a mean localisation precision of 12 +/- 5 for CF680 (Sylite) and 11 +/- 4 for AF647 (nanobody). These values are given in the revised manuscript (pp. 18, 22).

      (9) Why were DBSCAN parameters selected? How can they rule out multiple localisations per fluor? If low copy numbers (<10), then why bother with DBSCAN? Could just measure distance to each one. 

      Multiple detections of the same fluorophore are intrinsic to dSTORM imaging and have not been eliminated from the analysis. Small clusters of detections likely represent individual molecules (e.g. single receptors in the extrasynaptic regions, Fig. 2A). DBSCAN is a robust clustering method that is quite insensitive to minor changes in the choice of parameters. For dSTORM of synaptic gephyrin clusters (CF680), a relatively low length (80 nm radius) together with a high number of detections (≥ 50 neighbours) were chosen to reconstruct the postsynaptic domain with high spatial resolution (see point 8). In the case of the GlyR (nanobody-AF647), the clustering was done mostly for practical reasons, as it provided the coordinates of the centre of mass of the detections. The low stringency of this clustering (200 nm radius, ≥ 5 neighbours) effectively filters single detections that can result from background noise or incorrect demixing. An additional reference explaining the use of DBSCAN including the choice of parameters is given on p. 22 (see also R2 point 4).

      (10) For microscopy experiment methods, state power densities, not % or "nominal power". 

      Done. We now report the irradiance (laser power density) instead of nominal power (pp. 18, 21). 

      (11) In general, not much data presented. Any SI file with extra images etc.? 

      The original submission included four supplementary figures with additional data and representative images that should have been available to the reviewer (Figs. S1-S4). The SI file has been updated during revision (new Fig. S4E-G). 

      (12) Clarification of the discussion on GlyR expression and synaptic localization: The discussion on GlyR expression, complex formation, and synaptic localization is sometimes unclear, and needs terminological distinctions between "expression level", "complex formation" and "synaptic localization". For example, the authors state:"What then is the reason for the low protein expression of GlyRβ? One possibility is that the assembly of mature heteropentameric GlyR complexes depends critically on the expression of endogenous GlyR α subunits." Does this mean that GlyRβ proteins that fail to form complexes with GlyRα subunits are unstable and subject to rapid degradation? If so, the authors should clarify this point. The statement "This raises the interesting possibility that synaptic GlyRs may depend specifically on the concomitant expression of both α1 and β transcripts." suggests a dependency on α1 and β transcripts. However, is the authors' focus on synaptic localization or overall protein expression levels? If this means synaptic localization, it would be beneficial to state this explicitly to avoid confusion. To improve clarity, the authors should carefully distinguish between these different aspects of GlyR biology throughout the discussion. Additionally, a schematic diagram illustrating these processes would be highly beneficial for readers. 

      We thank the reviewer to point this out. We are dealing with several processes; protein expression that determines subunit availability and the assembly of pentameric GlyRs complexes, surface expression, membrane diffusion and accumulation of GlyRb-containing receptor complexes at inhibitory synapses. We have edited the manuscript, particularly the discussion and tried to be as clear as possible in our wording.

      We chose not to add a schematic illustration for the time being, because any graphical representation is necessarily a simplification. Instead, we preferred to summarise the main numbers in tabular form (Table 1). We are of course open to any other suggestions.

      (13) Interpretation of GlyR localization in the context of nanodomains. The distribution of GlyR molecules on inhibitory synapses appears to be non-homogeneous, instead forming nanoclusters or nanodomains, similar to many other synaptic proteins. It is important to interpret GlyR localization in the context of nanodomain organization. 

      The dSTORM images in Fig. 2 are pointillist representations that show individual detections rather than molecules. Small clusters of detections are likely to originate from a single AF647 fluorophore (in the case of nanobody labelling) and therefore represent single GlyRb subunits. Since GlyR copy numbers are so low at hippocampal synapses (≤ 5), the notion of nanodomain is not directly applicable. Our analysis therefore focused on the integration of GlyRs within the postsynaptic scaffold, rather than attempting to define nanodomain structures (see also response to point 8 of R1). A clarification has been added in the revised manuscript (p. 6).

      Reviewer #1 (Significance): 

      The paper presents biological and technical advances. The biological insights revolve mostly on the documentation of Glycine receptors in particular synapses in forebrain, where they are typically expressed at very low levels. The authors provide compelling data indicating that the expression is of physiological significance. The authors have done a nice job of combining genetically-tagged mice with advanced microscopy methods to tackle the question of distributions of synaptic proteins. Overall these advances are more incremental than groundbreaking. 

      We thank the reviewer for acknowledging both the technical and biological advances of our study. While we recognize that our work builds upon established models, we consider that it also addresses important unresolved questions, namely that GlyRs are present and specifically anchored at inhibitory synapses in telencephalic regions, such as the hippocampus and striatum. From a methodological point of view, our study demonstrates that SMLM can be applied not only for structural analysis of highly abundant proteins, but also to reliably detect proteins present at very low copy numbers. This ability to identify and quantify sparse molecule populations adds a new dimension to SMLM applications, which we believe increases the overall impact of our study beyond the field of synaptic neuroscience.

      Reviewer #2 (Evidence, reproducibility and clarity): 

      In their manuscript "Single molecule counting detects low-copy glycine receptors in hippocampal and striatal synapses" Camuso and colleagues apply single molecule localization microscopy (SMLM) methods to visualize low copy numbers of GlyRs at inhibitory synapses in the hippocampal formation and the striatum. SMLM analysis revealed higher copy numbers in striatum compared to hippocampal inhibitory synapses. They further provide evidence that these low copy numbers are tightly linked to post-synaptic scaffolding protein gephyrin at inhibitory synapses. Their approach profits from the high sensitivity and resolution of SMLM and challenges the controversial view on the presence of GlyRs in these formations although there are reports (electrophysiology) on the presence of GlyRs in these particular brain regions. These new datasets in the current manuscript may certainly assist in understanding the complexity of fundamental building blocks of inhibitory synapses. 

      However I have some minor points that the authors may address for clarification: 

      (1) In Figure 1 the authors apply PALM imaging of mEos4b-GlyRß (knockin) and here the corresponding Sylite label seems to be recorded in widefield, it is not clearly stated in the figure legend if it is widefield or super-resolved. In Fig 1 A - is the scale bar 5 µm? Some Sylite spots appear to be sized around 1 µm, especially the brighter spots, but maybe this is due to the lower resolution of widefield imaging? Regarding the statistical comparison: what method was chosen to test for normality distribution, I think this point is missing in the methods section. 

      This is correct; the apparent size of the Sylite spots does not reflect the real size of the synaptic gephyrin domain due to the limited resolution of widefield imaging including the detection of outof-focus light. We have clarified in the legend of Fig. 1A that Sylite labelling was with classic epifluorescence microscopy. The scale bar in Fig. 1A corresponds to 5 µm. Since the data were not normally distributed, nonparametric tests (Kruskal- Wallis one-way ANOVA with Dunn’s multiple comparison test or Mann-Whitney U-test for pairwise comparisons) were used (p. 23). 

      Moreover I would appreciate a clarification and/or citation that the knockin model results in no structural and physiological changes at inhibitory synapses, I believe this model has been applied in previous studies and corresponding clarification can be provided. 

      The Glrbeos/eos mouse model has been described previously and does not exhibit any structural or physiological phenotypes (Maynard et al. 2021 eLife). The issue was also raised by reviewer R1 (point 5) and has been clarified in the revised manuscript (p. 4).

      (2) In the next set of experiments the authors switch to demixing dSTORM experiments - an explanation why this is performed is missing in the text - I guess better resolution to perform more detailed distance measurements? For these experiments: which region of the hippocampus did the authors select, I cannot find this information in legend or main text. 

      Yes, the dSTORM experiments enable dual-colour structural analysis at high spatial resolution (see response to R1 point 7). An explanation has been added (p. 6).

      (3) Regarding parameters of demixing experiments: the number of frames (10.000) seems quite low and the exposure time higher than expected for Alexa 647. Can the authors explain the reason for chosing these particular parameters (low expression profile of the target - so better separation?, less fluorophores on label and shorter collection time?) or is there a reference that can be cited? The laser power is given in the methods in percentage of maximal output power, but for better comparison and reproducibility I recommend to provide the values of a power meter (kW/cm2) as lasers may change their maximum output power during their lifetime. 

      Acquisition parameters (laser power, exposure time) for dSTORM were chosen to obtain a good localisation precision (~12 nm; see R1 point 8). The number of frames is adequate to obtain well sampled gephyrin scaffolds in the CF680 channel. In the case of the GlyR (nanobody-AF647), the concept of spatial resolution does not really apply due to the low number of targets (see R1, point 13). Power density (irradiance) values have now been given (pp. 18, 21).

      (4) For analysis of subsynaptic distribution: how did the authors decide to choose the parameters in the NEO software for DBSCAN clustering - was a series of parameters tested to find optimal conditions and did the analysis start with an initial test if data is indeed clustered (K-ripley) or is there a reference in literature that can be provided? 

      DBSCAN parameters were optimised manually, by testing different values. Identification of dense and well-delimited gephyrin clusters (CF680) was achieved with a small radius and a high number of detections (80 nm, ≥ 50 neighbours), whereas filtering of low-density background in the AF647 channel (GlyRs) required less stringent parameters (200 nm, ≥ 5) due to the low number of target molecules. Similar parameters were used in a previous publication (Khayenko et al. 2022, Angewandte Chemie). The reference has been provided on p. 22 (see also R1 point 9).

      (5) A conclusion/discussion of the results presented in Figure 5 is missing in the text/discussion. 

      This part of the manuscript has been completely overhauled. It includes new experimental data, quantification of the data (new Fig.5), as well as the discussion and interpretation of our findings (see also R1, point 3). In agreement with our earlier interpretation, the data confirm that low availability of GlyRa1 subunits limits the expression and synaptic targeting of GlyRa1/b heteropentamers. The observation that GlyRa1 overexpression with lentivirus increases the size of the postsynaptic gephyrin domain further points to a structural role, whereby GlyRs can enhance the stability (and size) of inhibitory synapses in hippocampal neurons, even at low copy numbers (pp. 13-14). 

      (6) In line 552 "suspension" is misleading, better use "solution" 

      Done.

      Reviewer #2 (Significance): 

      Significance: The manuscript provides new insights to presence of low-copy numbers by visualizing them via SMLM. This is the first report that visualizes GlyR optically in the brain applying the knock-in model of mEOS4b tagged GlyRß and quantifies their copy number comparing distribution and amount of GlyRs from hippocampus and striatum. Imaging data correspond well to electrophysiological measurements in the manuscript. 

      Field of expertise: Super-Resolution Imaging and corresponding analysis 

      Reviewer #4 (Evidence, reproducibility and clarity): 

      In this study, Camuso et al., make use of a knock-in mouse model expressing endogenously mEos4b-tagged GlyRβ to detect endogenous glycine receptors using single-molecule localization microscopy. The main conclusion from this study is that in the hippocampus GlyRβ molecules are barely detected, while inhibitory synapses in the ventral striatum seem to express functionally relevant GlyR numbers. 

      I have a few points that I hope help to improve the strength of this study. 

      - In the hippocampus, this study finds that the numbers of detections are very low. The authors perform adequate controls to indicate that these localizations are above noise level. Nevertheless, it remains questionable that these reflect proper GlyRs. The suggestion that in hippocampal synapses the low numbers of GlyRβ molecules "are important in assembly or maintenance of inhibitory synaptic structures in the brain" is on itself interesting, but is not at all supported. It is also difficult to envision how such low numbers could support the structure of a synapse. A functional experiment showing that knockdown of GlyRs affects inhibitory synapse structure in hippocampal neurons would be a minimal test of this. 

      It is not clear what the reviewer means by “it remains questionable that these reflect proper GlyRs”. The PALM experiments include a series of stringent controls (see R1, point 1) demonstrating the existence of low-copy GlyRs at inhibitory synapses in the hippocampus (Fig. 1) and in the striatum (Fig. 3), and are backed up by dSTORM experiments (Fig. 2). We have no reason to doubt that these receptors are fully functional (as demonstrated for the ventral striatum (Fig. 4). However, due to their low number, a role in inhibitory synaptic transmission is clearly limited, at least in the hippocampus and dorsal striatum. 

      We therefore propose a structural role, where the GlyRs could be required to stabilise the postsynaptic gephyrin domain in hippocampal neurons. This is based on the idea that the GlyRgephyrin affinity is much higher than that of the GABAAR-gephyrin interaction (reviewed in Kasaragod & Schindelin 2018 Front Mol Neurosci). Accordingly, there is a close relationship between GlyRs and gephyrin numbers, sub-synaptic distribution, and dynamics in spinal cord synapses that are mostly glycinergic (Specht et al. 2013 Neuron; Maynard et al. 2021 eLife; Chapdelaine et al. 2021 Biophys J). It is reasonable to assume that low-copy GlyRs could play a similar structural role at hippocampal synapses. A knockdown experiment targeting these few receptors is technically very challenging and beyond the scope of this study. However, in response to the reviewer's question we have conducted new experiments in cultured hippocampal neurons (new Fig. 5). They demonstrate that overexpression of GlyRa1/b heteropentamers increases the size of the postsynaptic domain in these neurons, supporting our interpretation of a structural role of low-copy GlyRs (p. 14).

      - The endogenous tagging strategy is a very strong aspect of this study and provides confidence in the labeling of GlyRβ molecules. One caveat however, is that this labeling strategy does not discriminate whether GlyRβ molecules are on the cell membrane or in internal compartments. Can the authors provide an estimate of the ratio of surface to internal GlyRβ molecules? 

      Gephyrin is known to form a two-dimensional scaffold below the synaptic membrane to which inhibitory GlyRs and GABAARs attach (reviewed in Alvarez 2017 Brain Res). The majority of the synaptic receptors are therefore thought to be located in the synaptic membrane, which is supported by the close relationship between the sub-synaptic distribution of GlyRs and gephyrin in spinal cord neurons (e.g. Maynard et al. 2021 eLife). To demonstrate the surface expression of GlyRs at hippocampal synapses we labelled cultured hippocampal neurons expressing mEos4b-GlyRa1 with anti-Eos nanobody in non-permeabilised neurons (see Author response image 1). The close correspondence between the nanobody (AF647) and the mEos4b signal confirms that the majority of the GlyRs are indeed located in the synaptic membrane.

      Author response image 1.

      Left: Lentivirus expression of mEos4b-GlyRa1 in fixed and non-permeabilised hippocampal neurons (mEos4b signal). Right: Surface labelling of the recombinant subunit with anti-Eos nanoboby (AF647). 

      - “We also estimated the absolute number of GlyRs per synapse in the hippocampus. The number of mEos4b detections was converted into copy numbers by dividing the detections at synapses by the average number of detections of individual mEos4b-GlyRβ containing receptor complexes”. In essence this is a correct method to estimate copy numbers, and the authors discuss some of the pitfalls associated with this approach (i.e., maturation of fluorophore and detection limit). Nevertheless, the authors did not subtract the number of background localizations determined in the two negative control groups. This is critical, particularly at these low-number estimations. 

      We fully agree that background subtraction can be useful with low detection numbers. In the revised manuscript, copy numbers are now reported as background-corrected values. Specifically, the mean number of detections measured in wildtype slices was used to calculate an equivalent receptor number, which was then subtracted from the copy number estimates across hippocampus, spinal cord and striatum. This procedure is described in the methods (p. 20) and results (p. 5, 8), and mentioned in the figure legends of Fig. 1C, 3C. The background corrected values are given in the text and Table 1.

      - Furthermore, the authors state that "The advantage of this estimation is that it is independent of the stoichiometry of heteropentameric GlyRs". However, if the stoichometry is unknown, the number of counted GlyRβ subunits cannot simply be reported as the number of GlyRs. This should be discussed in more detail, and more carefully reported throughout the manuscript. 

      The reviewer is right to point this out. There is still some debate about the stoichiometry of heteropentameric GlyRs. Configurations with 2a:3b, 3a:2b and 4a:1b subunits have been advanced (e.g. Grudzinska et al. 2005 Neuron; Durisic et al. 2012 J Neurosci; Patrizio et al. 2017 Sci Rep; Zhu & Gouaux 2021 Nature). We have therefore chosen a quantification that is independent of the underlying stoichiometry. Since our quantification is based on very sparse clusters of mEos4b detections that likely originate from a single receptor complex (irrespective of its stoichiometry), the reported values actually reflect the number of GlyRs (and not GlyRb subunits). We have clarified this in the results (p. 5) and throughout the manuscript (Table 1). 

      - The dual-color imaging provides insights in the subsynaptic distribution of GlyRβ molecules in hippocampal synapses. Why are similar studies not performed on synapses in the ventral striatum where functionally relevant numbers of GlyRβ molecules are found? Here insights in the subsynaptic receptor distribution would be of much more interest as it can be tight to the function. 

      This is an interesting suggestion. However, the primary aim of our study was to identify the existence of GlyRs in hippocampal regions. At low copy numbers, the concept of sub-synaptic domains (SSDs, e.g. Yang et al. 2021 EMBO Rep) becomes irrelevant (see R1 point 13). It should be pointed out that the dSTORM pointillist images (Fig. 2A) represent individual GlyR detections rather than clusters of molecules. In the striatum, our specific purpose was to solve an open question about the presence of GlyRs in different subregions (putamen, nucleus accumbens).

      - It is unclear how the experiments in Figure 5 add to this study. These results are valid, but do not seem to directly test the hypothesis that "the expression of α subunits may be limiting factor controlling the number of synaptic GlyRs". These experiments simply test if overexpressed α subunits can be detected. If the α subunits are limiting, measuring the effect of α subunit overexpression on GlyRβ surface expression would be a more direct test. 

      Both R1 and R2 have also commented on the data in Fig. 5 and their interpretation. We have substantially revised this section as described before (see R1 point 3) including additional experiments and quantification of the data (new Fig. 5). The findings lend support to our earlier hypothesis that GlyR alpha subunits (in particular GlyRa1) are the limiting factor for the expression of heteropentameric GlyRa/b in hippocampal neurons (pp. 13-14). Since the GlyRa1 subunit itself does not bind to gephyrin (Patrizio et al. 2017 Sci Rep), the synaptic localisation of the recombinant mEos4b-GlyRa1 subunits is proof that they have formed heteropentamers with endogenous GlyRb subunits and driven their membrane trafficking, which the GlyRb subunits are incapable of doing on their own.

      Reviewer #4 (Significance): 

      These results are based on carefully performed single-molecule localization experiments, and are well-presented and described. The knockin mouse with endogenously tagged GlyRβ molecules is a very strong aspect of this study and provides confidence in the labeling, the combination with single-molecule localization microscopy is very strong as it provides high sensitivity and spatial resolution. 

      The conceptual innovation however seems relatively modest, these results confirm previous studies but do not seem to add novel insights. This study is entirely descriptive and does not bring new mechanistic insights. 

      This study could be of interest to a specialized audience interested in glycine receptor biology, inhibitory synapse biology and super-resolution microscopy. 

      My expertise is in super-resolution microscopy, synaptic transmission and plasticity 

      As we have stated before, the novelty of our study lies in the use of SMLM for the identification of very small numbers of molecules, which requires careful control experiments. This is something that has not been done before and that can be of interest to a wider readership, as it opens up SMLM for ultrasensitive detection of rare molecular events. Using this approach, we solve two open scientific questions: (1) the demonstration that low-copy GlyRs are present at inhibitory synapses in the hippocampus, (2) the sub-region specific expression and functional role of GlyRs in the ventral versus dorsal striatum.

      The following review was provided later under the name “Reviewer #4”. To avoid confusion with the last reviewer from above we will refer to this review as R4-2.

      Reviewer #4-2 (Evidence, reproducibility and clarity):  

      Summary:

      Provide a short summary of the findings and key conclusions (including methodology and model system(s) where appropriate).

      The authors investigate the presence of synaptic glycine receptors in the telencephalon, whose presence and function is poorly understood. 

      Using a transgenically labeled glycine receptor beta subunit (Glrb-mEos4b) mouse model together with super-resolution microscopy (SLMM, dSTORM), they demonstrate the presence of a low but detectable amount of synaptically localized GLRB in the hippocampus. While they do not perform a functional analysis of these receptors, they do demonstrate that these subunits are integrated into the inhibitory postsynaptic density (iPSD) as labeled by the scaffold protein gephyrin. These findings demonstrate that a low level of synaptically localized glycerine receptor subunits exist in the hippocampal formation, although whether or not they have a functional relevance remains unknown.

      They then proceed to quantify synaptic glycine receptors in the striatum, demonstrating that the ventral striatum has a significantly higher amount of GLRB co-localized with gephyrin than the dorsal striatum or the hippocampus. They then recorded pharmacologically isolated glycinergic miniature inhibitory postsynaptic currents (mIPSCs) from striatal neurons. In line with their structural observations, these recordings confirmed the presence of synaptic glycinergic signaling in the ventral striatum, and an almost complete absence in the dorsal striatum. Together, these findings demonstrate that synaptic glycine receptors in the ventral striatum are present and functional, while an important contribution to dorsal striatal activity is less likely.

      Lastly, the authors use existing mRNA and protein datasets to show that the expression level of GLRA1 across the brain positively correlates with the presence of synaptic GLRB.

      The authors use lentiviral expression of mEos4b-tagged glycine receptor alpha1, alpha2, and beta subunits (GLRA1, GLRA1, GLRB) in cultured hippocampal neurons to investigate the ability of these subunits to cause the synaptic localization of glycine receptors. They suggest that the alpha1 subunit has a higher propensity to localize at the inhibitory postsynapse (labeled via gephyrin) than the alpha2 or beta subunits, and may therefore contribute to the distribution of functional synaptic glycine receptors across the brain.

      Major comments:

      - Are the key conclusions convincing?

      The authors are generally precise in the formulation of their conclusions.

      (1) They demonstrate a very low, but detectable, amount of a synaptically localized glycine receptor subunit in a transgenic (GlrB-mEos4b) mouse model. They demonstrate that the GLRB-mEos4b fusion protein is integrated into the iPSD as determined by gephyrin labelling. The authors do not perform functional tests of these receptors and do not state any such conclusions.

      (2) The authors show that GLRB-mEos4b is clearly detectable in the striatum and integrated into gephyrin clusters at a significantly higher rate in the ventral striatum compared to the dorsal striatum, which is in line with previous studies.

      (3) Adding to their quantification of GLRB-mEos4b in the striatum, the authors demonstrate the presence of glycinergic miniature IPSCs in the ventral striatum, and an almost complete absence of mIPSCs in the dorsal striatum. These currents support the observation that GLRB-mEos4b is more synaptically integrated in the ventral striatum compared to the dorsal striatum.

      (4) The authors show that lentiviral expression of GLRA1-mEos4b leads to a visually higher number of GLR clusters in cultured hippocampal neurons, and a co-localization of some clusters with gephyrin. The authors claim that this supports the idea that GLRA1 may be an important driver of synaptic glycine receptor localization. However, no quantification or statistical analysis of the number of puncta or their colocalization with gephyrin is provided for any of the expressed subunits. Such a claim should be supported by quantification and statistics 

      A thorough analysis and quantification of the data in Fig.5 has been carried out as requested by all the other reviewers (e.g. R1, point 3). The new data and results have been described in the revised manuscript (pp. 9-10, 13-14).

      - Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      One unaddressed caveat is the fact that a GLRB-mEos4b fusion protein may behave differently in terms of localization and synaptic integration than wild-type GLRB. While unlikely, it is possible that mEos4b interacts either with itself or synaptic proteins in a way that changes the fused GLRB subunit’s localization. Such an effect would be unlikely to affect synaptic function in a measurable way, but might be detected at a structural level by highly sensitive methods such as SMLM and STORM in regions with very low molecule numbers (such as the hippocampus). Since reliable antibodies against GLRB in brain tissue sections are not available, this would be difficult to test. Considering that no functional measures of the hippocampal detections exist, we would suggest that this possible caveat be mentioned for this particular experiment.

      This question has also been raised before (R1, point 5). According to an earlier study the mEos4b-GlyRb knock-in does not cause any obvious phenotypes, with the possible exception of minor loss of glycine potency (Maynard et al. 2021 eLife). The fact that the synaptic levels in the spinal cord in heterozygous animals are precisely half of those of homozygous animals argues against differences in receptor expression, heteropentameric assembly, forward trafficking to the plasma membrane and integration into the synaptic membrane as confirmed using quantitative super-resolution CLEM (Maynard et al. 2021 eLife). Accordingly, we did not observe any behavioural deficits in these animals, making it a powerful experimental model. We have added this information in the revised manuscript (p. 4). 

      In addition, without any quantification or statistical analysis, the author’s claims regarding the necessity of GLRA1 expression for the synaptic localization of glycine receptors in cultured hippocampal neurons should probably be described as preliminary (Fig. 5).

      As mentioned before, we have substantially revised this part (R1, point 3). The quantification and analysis in the new Fig. 5 support our earlier interpretation.

      - Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      The authors show that there is colocalization of gephyrin with the mEos4b-GlyRβ subunit using the Dual-colour SMLM. This is a powerful approach that allows for a claim to be made on the synaptic location of the glycine receptors. The images presented in Figure 1, together with the distance analysis in Figure 2, display the co-localization of the fluorophores. The co-localization images in all the selected regions, hippocampus and striatum, also show detections outside of the gephyrin clusters, which the authors refer to as extrasynaptic. These punctated small clusters seem to have the same size as the ones detected and assigned as part of the synapse. It would be informative if the authors analysed the distribution, density and size of these nonsynaptic clusters and presented the data in the manuscript and also compared it against the synaptic ones. Validating this extrasynaptic signal by staining for a dendritic marker, such as MAP-2 or maybe a somatic marker and assessing the co-localization with the non-synaptic clusters would also add even more credibility to them being extrasynaptic. 

      The existence of extrasynaptic GlyRs is well attested in spinal cord neurons (e.g. Specht et al. 2013 Neuron; this study see Fig. S2). The fact that these appear as small clusters of detections in SMLM recordings results from the fact that a single fluorophore can be detected several times in consecutive image frames and because of blinking. Therefore, small clusters of detections likely represent single GlyRs (that can be counted), and not assemblies of several receptor complexes. Due to their diffusion in the neuronal membrane, they are seen as diffuse signals throughout the somatodendritic compartment in epifluorescence images (e.g. Fig. 5A). SMLM recordings of the same cells resolves this diffuse signal into discrete nanoclusters representing individual receptors (Fig. 5B). It is not clear what information co-localisation experiments with specific markers could provide, especially in hippocampal neurons, in which the copy numbers (and density) of GlyRs is next to zero.

      In addition we would encourage the authors to quantify the clustering and co-localization of virally expressed GLRA1, GLRA2, and GLRB with gephyrin in order to support the associated claims (Fig. 5). Preferably, the density of GLR and gephyrin clusters (at least on the somatic surface, the proximal dendrites, or both) as well as their co-localization probability should be quantified if a causal claim about subunit-specific requirements for synaptic localization is to be made.

      Quantification of the data have been carried out (new Fig.5C,D). The results have been described before (R1, point 3) and support our earlier interpretation of the data (pp. 13-14).

      Lastly, even though it may be outside of the scope of such a study analysing other parts of the hippocampal area could provide additional important information. If one looks at the Allen Institute’s ISH of the beta subunit the strongest signal comes from the stratum oriens in the CA1 for example, suggesting that interneurons residing there would more likely have a higher expression of the glycine receptors. This could also be assessed by looking more carefully at the single cell transcriptomics, to see which cell types in the hippocampus show the highest mRNA levels. If the authors think that this is too much additional work, then perhaps a mention of this in the discussion would be good. 

      We have added the requested information from the ISH database of the Allen Institute in the discussion as suggested by the reviewer (p. 12). However, in combination with the transcriptomic data (Fig. S1) our finding strongly suggest that the expression of synaptic GlyRs depends on the availability of alpha subunits rather than on the presence of the GlyRb transcript. This is obvious when one compares the mRNA levels in the hippocampus with those in the basal ganglia (striatum) and medulla. While the transcript concentrations of GlyRb are elevated in all three regions and essentially the same, our data show that the GlyRb copy numbers at synapses differ over more than 2 orders of magnitude (Fig. 1B, Table 1). 

      - Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      Since the labeling and some imaging has been performed already, the requested experiment would be a matter of deploying a method of quantification. In principle, it should not require any additional wet-lab experiments, although it may require additional imaging of existing samples.

      - Are the data and the methods presented in such a way that they can be reproduced?

      Yes, for the most part.

      - Are the experiments adequately replicated and statistical analysis adequate?

      Yes

      Minor comments:

      - Specific experimental issues that are easily addressable.

      N/A

      - Are prior studies referenced appropriately?

      Yes

      - Are the text and figures clear and accurate?

      Yes, although quantification in figure 5 is currently not present.

      A quantification has been added (see R1, point 3).

      - Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

      This paper presents a method that could be used to localize receptors and perhaps other proteins that are in low abundance or for which a detailed quantification is necessary. I would therefore suggest that Figure S4 is included into Figure 2 as the first panel, showcasing the demixing, followed by the results. 

      We agree in principle with this suggestion. However, the revised Fig. S4 is more complex and we think that it would distract from the data shown in Fig. 2. Given that Fig. S4 is mostly methodological and not essential to understand the text, we have kept it in the supplement for the time being. We leave the final decision on this point to the editor.

      Reviewer #4-2 (Significance): 

      [This review was supplied later]

      - Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.

      Using a novel and high resolution method, the authors have provided strong evidence for the presence of glycine receptors in the murine hippocampus and in the dorsal striatum. The number of receptors calculated is small compared to the numbers found in the ventral striatum. This is the first study to quantify receptor numbers in these region. In addition it also lays a roadmap for future studies addressing similar questions. 

      - Place the work in the context of the existing literature (provide references, where appropriate).

      This is done well by the authors in the curation of the literature. As stated above, the authors have filled a gap in the presence of glycine receptors in different brain regions, a subject of importance in understanding the role they play in brain activity and function. 

      - State what audience might be interested in and influenced by the reported findings.

      Neuroscientists working at the synaptic level, on inhibitory neurotransmission and on fundamental mechanisms of expression of genes at low levels and their relationship to the presence of the protein would be interested. Furthermore, researchers in neuroscience and cell biology may benefit from and be inspired by the approach used in this manuscript, to potentially apply it to address their own aims. 

      We thank the reviewer for the positive assessment of the technical and biological implications of our work, as well as the interest of our findings to a wide readership of neuroscientists and cell biologists. 

      - Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.

      Synaptic transmission, inhibitory cells and GABAergic synapses functionally and structurally, cortex and cortical circuits. No strong expertise in super-resolution imaging methods.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review): 

      Summary: 

      This very thorough anatomical study addresses the innervation of the Drosophila male reproductive tract. Two distinct glutamatergic neuron types were classified: serotonergic (SGNs) and octopaminergic (OGNs). By expansion microscopy, it was established that glutamate and serotonin /octopamine are co-released. The expression of different receptors for 5-HT and OA in muscles and epithelial cells of the innervation target organs was characterized. The pattern of neurotransmitter receptor expression in the target organs suggests that seminal fluid and sperm transport and emission are subjected to complex regulation. While silencing of abdominal SGNs leads to male infertility and prevents sperm from entering the ejaculatory duct, silencing of OGNs does not render males infertile. 

      Strengths: 

      The studied neurons were analysed with different transgenes and methods, as well as antibodies against neurotransmitter synthesis enzymes, building a consistent picture of their neurotransmitter identity. The careful anatomical description of innervation patterns together with receptor expression patterns of the target organs provides a solid basis for advancing the understanding of how seminal fluid and sperm transport and emission are subjected to complex regulation. The functional data showing that SGNs are required for male fertility and for the release of sperm from the seminal vesicle into the ejaculatory duct is convincing. 

      Weaknesses: 

      The functional analysis of the characterized neurons is not as comprehensive as the anatomical description, and phenotypic characterization was limited to simple fertility assays. It is understandable that a full functional dissection is beyond the scope of the present work. The paper contains experiments showing neuron-independent peristaltic waves in the reproductive tract muscles, which are thematically not very well integrated into the paper. Although very interesting, one wonders if these experiments would not fit better into a future work that also explores these peristaltic waves and their interrelation with neuromodulation mechanistically. 

      Reviewer #2 (Public review): 

      Summary: 

      Cheverra et al. present a comprehensive anatomical and functional analysis of the motor neurons innervating the male reproductive tract in Drosophila melanogaster, addressing a gap in our understanding of the peripheral circuits underlying ejaculation and male fertility. They identify two classes of multi-transmitter motor neurons-OGNs (octopamine/glutamate) and SGNs (serotonin/glutamate)-with distinct innervation patterns across reproductive organs. The authors further characterize the differential expression of glutamate, octopamine, and serotonin receptors in both epithelial and muscular tissues of these organs. Behavioral assays reveal that SGNs are essential for male fertility, whereas OGNs and glutamatergic transmission are dispensable. This work provides a high-resolution map linking neuromodulatory identity to organ-specific motor control, offering a valuable framework to explore the neural basis of male reproductive function. 

      Strengths: 

      Through the use of an extensive set of GAL4 drivers and antibodies, this work successfully and precisely defines the neurons that innervate the male reproductive tract, identifying the specific organs they target and the nature of the neurotransmitters they release. It also characterizes the expression patterns and localization of the corresponding neurotransmitter receptors across different tissues. The authors describe two distinct groups of dual-identity neurons innervating the male reproductive tract: OGNs, which co-express octopamine and glutamate, and SGNs, which co-express serotonin and glutamate. They further demonstrate that the various organs within the male reproductive system differentially express receptors for these neurotransmitters. Based on these findings, the authors propose that a single neuron capable of co-releasing a fast-acting neurotransmitter alongside a slower-acting one may more effectively synchronize and stagger events that require precise timing. This, together with the differential expression of ionotropic glutamate receptors and metabotropic aminergic receptors in postsynaptic muscle tissue, adds an additional layer of complexity to the coordinated regulation of fluid secretion, organ contractility, and directional sperm movement-all contributing to the optimization of male fertility. 

      Weaknesses: 

      The main weakness of the manuscript is the lack of detail in the presentation of the results. Specifically, all microscopy image figures are missing information about the number of samples (N), and in the case of colocalization experiments, quantitative analyses are not provided. Additionally, in the first behavioral section, it would be beneficial to complement the data table with figures similar to those presented later in the manuscript for consistency and clarity. 

      Wider context: 

      This study delivers the first detailed anatomical map connecting multi-transmitter motor neurons with specific male reproductive structures. It highlights a previously unrecognized functional specialization between serotonergic and octopaminergic pathways and lays the groundwork for exploring fundamental neural mechanisms that regulate ejaculation and fertility in males. The principles uncovered here may help explain how males of Drosophila and other organisms adjust reproductive behaviors in response to environmental changes. Furthermore, by shedding light on how multi-transmitter systems operate in reproductive control, this model could provide insights into therapeutic targets for conditions such as male infertility and prostate cancer, where similar neuronal populations are involved in humans. Ultimately, this genetically accessible system serves as a powerful tool for uncovering how multi-transmitter neurons orchestrate coordinated physiological actions necessary for the functioning of complex organs. 

      Reviewer #3 (Public review): 

      Summary: 

      This work provides an overview of the motor neuron landscape in the male reproductive system. Some work had been done to elucidate the circuits of ejaculation in the spine, as well as the cord, but this work fills a gap in knowledge at the level of the reproductive organs. Using complementary approaches, the authors show that there are two types of motor neurons that are mutually exclusive: neurons that co-express octopamine and glutamate and neurons that co-express serotonin and glutamate. They also show evidence that both types of neurons express large dense core vesicles, indicating that neuropeptides play a role in male fertility. This paper provides a thorough characterization of the expression of the different glutamate, octopamine, and serotonin receptors in the different organs and tissues of the male reproductive system. The differential expression in different tissues and organs allows building initial theories on the control of emission and expulsion. Additionally, the authors characterize the expression of synaptic proteins and the neuromuscular junction sites. On a mechanistic level, the authors show that neither octopamine/glutamate neuron transmission nor glutamate transmission in serotonin/glutamate neurons is required for male fertility. This final result is quite surprising and opens up many questions on how ejaculation is coordinated. 

      Strengths: 

      This work fills an important gap in the characterization of innervation of the male reproductive system by providing an extensive characterization of the motor neurons and the potential receptors of motor neuron release. The authors show convincing evidence of glutamate/monoamine co-release and of mutual exclusivity of serotonin/glutamate and octopamine/glutamate neurons. 

      Weaknesses: 

      (1) Often, it is mentioned that the expression is higher or lower or regional without quantification or an indication of the number of samples analysed. 

      (2) The experiment aimed at tracking sperm in the male reproductive system is difficult to interpret when it is not assessed whether ejaculation has occurred. 

      (3) The experiment looking at peristaltic waves in the male organs is missing labeling of the different regions and quantification of the observed waves. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors): 

      (1) While the peripheral innervations are very carefully described, it is not clear to which SGNs and OGNs (i.e., cell bodies in the central nervous system) these innervations belong. Are SV, AG, and ED innervated by branches of one neuron or by separate neurons? Multi-color flip-out experiments could provide an answer to this. 

      We agree this is important and are planning these experiments for follow-up study.

      (2) In contrast, for the analysis of the VT19028 split line (Figure 9), only vnc and cell body images are shown. How do the arborisations of these split combinations look in the periphery? Are the same reproductive organs innervated as shown in Figure 2?

      Figure 9S3 was inadvertently omitted from the initial submission.  That figure is now included and shows that the VT019028 split broadly innervates the SV, AG, and ED.

      (3) In the discussion, I think it would be helpful to offer some potential explanations for the role of octopaminergic and glutamatergic signaling. If not required for basic fertility, they probably have some other role.

      Thank you, we have included speculation in the Discussion section "Potential for adaptation to environment".

      (4) Line 543: Figure 8S4 E, (not 8E). 

      Correction made.

      Reviewer #2 (Recommendations for the authors): 

      (1) Line 213-217 

      Comment:

      The use of "significantly less expression" may be misleading, as no quantification or statistical analysis is provided to support this comparison. 

      Suggestion:

      Consider using a more neutral term, such as "markedly less" or "noticeably less," unless quantitative data and statistical analysis are included to substantiate the claim.

      Good recommendation.This suggestion has been incorporated.

      (2) Line 264-267 

      Comment:

      The observation regarding the distinct morphology of SGNs and OGNs is interesting and could strengthen the argument regarding functional differences. 

      Suggestion: 

      Consider including a quantification of morphological complexity (e.g., branching) to support the claim. A method such as Sholl analysis (Sholl, 1953), as adapted in Fernández et al., 2008, could be applied. 

      This is a good suggestion, and we will consider it as part of a follow-up study.

      (3) Line 269-271 

      Comment:

      The anatomical context of the observation is not explicitly stated. 

      Suggestion:

      Add "in the ED" for clarity: "With the TRH-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, A and E) and 6XV5-vMAT (Figure 5S1, B and F) were both present with a highly overlapping distribution (Figure 5S1, I)." 

      Suggestion has been incorporated.

      (4) Line 275-276 

      Comment:

      The claim about the reduced ability to distinguish SGNs and OGNs in the ED would benefit from quantitative support. 

      Suggestion:

      Include a morphological comparison or quantification between SGNs and OGNs in the ED and SV to reinforce this point.

      Certain information on morphological comparison can be inferred within the images themselves, and we will include quantitation in a follow-up study.

      (5) Line 277-279 

      Comment:

      As with line 269, the anatomical site could be specified more clearly. 

      Suggestion: 

      Rephrase as: "With the Tdc2-GAL4 experiment in the ED, vGlut-40XMYC (Figure 5S1, M and Q) and 6XV5-vMAT (Figure 5S1, N and R) were both observed in a highly overlapping distribution (Figure 5S1, U)." 

      Suggestion has been incorporated.

      (6) Line 348-350 

      Comment:

      The phrase "significantly higher density" implies a statistical comparison that is not shown. 

      Suggestion:

      If no quantification is provided, replace with a qualitative term such as "visibly higher" or "notably more dense." Alternatively, add a quantitative analysis with statistical testing to justify the use of "significantly." 

      Suggestion has been incorporated.

      (7) Lines 415-458 (Section comment) 

      Comment:

      There appears to be differential localization of neurotransmitter receptor expression (glutamate in muscle vs. 5-HT in epithelium or neurons), which could have functional implications. 

      Suggestion:

      Expand this section to briefly discuss the differential localization patterns of these receptors and potential implications for signal transduction in male reproductive tissues. 

      (8) Lines 638-682 (Section comment) 

      Comment:

      The table summarizing fertility phenotypes would be more informative with additional detail on experimental outcomes. 

      Suggestion:

      Add a column showing the number of fertile males over the total tested (e.g., "n fertile / n total"). Also, clarify whether the fertility assays are identical to those reported in Figure 10S2, and whether similar analyses were conducted for females. Consider including a figure summarizing fertility results for all genotypes listed in the table, similar to Figure 10S2. 

      The fertility tests reported in Table 1 were separate from those reported in Figure 10S2.  For these tests, the results were clear-cut with 100% of males and females reported as infertile exhibiting the infertile phenotype.  For the males and females reported as fertile, it was also clear-cut with nearly 100% showing fertility at a high level.  In subsequent figures we attempted to assess degrees of fertility.

      (9) Line 724-727 

      Comment:

      There seems to be a mistake in the identification of the driver lines used to silence OA neurons. Also, figure references might be incorrect. 

      Suggestion:

      The OA neuron driver line should be corrected to "Tdc2-GAL4-DBD ∩ AbdB-AD" instead of TRH-GAL4. Additionally, the figure references should be verified; specifically, the letter "B" (in "Figure 10B, D" and "10B, E") appears to be unnecessary or misplaced.

      Thanks for catching this, the corrections have been made.

      (10) Line 872-877 

      Comment:

      The discussion on the co-release of fast-acting glutamate and slower aminergic neurotransmitters is interesting and well-articulated. However, it remains somewhat disconnected from the behavioral findings. 

      Suggestion:

      Consider linking this proposed mechanism to the results observed in the mating duration assays. For instance, the sequential action of neurotransmitters described here could potentially underlie the prolonged mating observed when specific neuromodulators are active, helping to functionally integrate molecular and behavioral data. 

      (11) Line 926-928 

      Comment:

      The interpretation of 5-HT7 receptor expression in the sphincter is compelling, suggesting a role in regulating its function. However, this anatomical observation could be further contextualized with the functional data. 

      Suggestion:

      It may strengthen the interpretation to explicitly connect this finding with the fertility assays, where SGNs - presumably acting via serotonergic signaling - are shown to be necessary for male fertility. This would support a functional role for 5-HT7 in reproductive success via sphincter regulation.

      This has been added. 

      (12) Figure 1 

      Comment:

      The figure legend is generally clear, but could benefit from more consistency and precision in the color-coded labeling. Additionally, the naming of some structures could be more explicit. 

      Suggestion: 

      Revise the figure and the legend as follows:

      Figure 1. The Drosophila male reproductive system. A) Schematic diagram showing paired testes (colour), SVs (green), AGs (purple), Sph (red), ED (gray), and EB (colour). B) Actual male reproductive system. Te - testes, SV - seminal vesicle, AG - accessory gland, Sph - singular sphincter, ED - ejaculatory duct, EB - ejaculatory bulb. Scale bar: 200 µm.

      This suggestion has been incorporated.

      (13) Figure 3S2 

      Comment:

      There appears to be a typographical error in the description of the genotypes, which may lead to confusion. 

      Suggestion:

      Correct the legend to reflect the appropriate genotypes:

      Figure 3S2. Expression of vGlut-LexA and Tdc2-GAL4 in the Drosophila male reproductive system. A, D, G, J, M, P) vGlut-LexA, LexAop-6XmCherry; B, E, H, K, N, Q) Tdc2-GAL4, UAS-6XGFP; C, F, I, L, O, R) Overlay. Scale bars: O - 50 µm; R - 10 µm.

      The corrections have been made.

      (14) Figure 3S3

      Comment:

      The genotypes for panels D and E appear to be incomplete; the DBD component of the split-GAL4 drivers is missing. 

      Suggestion:

      Update the figure legend to: 

      Figure 3S3. Fruitless and Doublesex expression in the Drosophila male reproductive system. A) fru-GAL4, UAS-6XGFP; B) vGlut-LexA, LexAop-6XmCherry; C) Overlay; D) Tdc2-AD ∩ dsx-GAL4-DBD; E) TRH-AD ∩ dsx-GAL4-DBD. Scale bar: 200 µm.

      The corrections have been made.

      (15) Figure 4S4 

      Comment: 

      There is a repeated segment in the figure legend, which makes it unclear and redundant. 

      Suggestion:

      Edit the legend to remove the duplicated lines: 

      Figure 4S4. Expression of vGlut, TβH-GFP, and 5-HT at the junction of the SV and AGs with the ED of the Drosophila male reproductive system. A) vGlut-40XV5; B) TβH-GFP; C) 5-HT; D) vGlut-40XV5, TβH-GFP overlay; E) vGlut-40XV5, 5-HT overlay; F) TβH-GFP, 5-HT overlay. Scale bar: 50 µm.

      The correction has been made.

      (16) Figure 6S5 

      Comment:

      Within this figure, the orientation and/or scale of the tissue varies noticeably between individual panels, making it difficult to directly compare the different experimental conditions. 

      Suggestion:

      For improved clarity and interpretability, consider standardizing the orientation and size of the tissue shown across all panels within the figure. Consistent presentation will facilitate direct comparisons between treatments or genotypes. 

      There is often variation in the size of the male reproductive organs. They were all acquired at the same magnification. The only point of this figure is there is no vGAT or vAChT at these NMJs and the result is unambiguously negative. 

      (17) Figure 10 

      Comment:

      Panel A appears redundant, as it shows the same information as the other panels but without indicating statistical significance. 

      Suggestion:

      Consider removing panel A and keeping only the remaining four graphs, which include relevant statistical comparisons and clearly show significant differences.

      We realize there is some redundancy of panel A with the other panels, but we feel there is value in having all the genotypes in a single panel for comparison.

      Reviewer #3 (Recommendations for the authors): 

      Here are some suggestions to improve the manuscript: 

      (1) Prot B GFP experiment: the authors should explain better the time chosen to look at the sperm content of the male reproductive system. At 10 minutes, it is expected that the male has already ejaculated, and therefore, a failure to ejaculate would result in more sperm in the reproductive system, not less. Since we are not certain when the male ejaculates, it would be important to do the analysis at different time points.

      In the Prot-GFP experiments, the 10-minute time point was chosen because we nearly always observe sperm in the ejaculatory duct of control males.  In the experimental males, we never observed sperm in the ejaculatory duct at this time point.  Also, no Prot-GFP sperm were observed in the reproductive tract of females mated to experimental males even when mating was allowed to go to completion, while abundant sperm were found in females mated to Prot-GFP controls.  Figure 10S1 has been updated to include Images of these female reproductive systems.  The results showing the absence of Prot-GFP sperm in the female reproductive tract mated to experimental males indicates sperm transfer in these males isn't occurring earlier during the copulation process than in control males and that we didn't miss it by only examining at the ejaculatory duct.

      (2) Discuss what may be the role of the octopamine/glutamate neurons and glutamate transmission in serotonin/glutamate neurons in the male reproductive system, given that they are not required for fertility (at least under the context in which it was tested). It is quite a striking result that deserves some attention. 

      We agree it is a surprising result and have included speculation on the role of glutamate and octopamine in male reproduction in the Discussion section "Potential for adaptation to environment".

      (3) Very important: 

      (a) Figure 3 is present in the Word document but not the PDF. 

      (b) Figure 9S3 is not present 

      (c) In Figure 5 X), the legend does not correspond to the panel.

      All of these corrections have been made. 

      (4) Other suggestions:

      (a) A summary schematic (or several) of the findings would make it an easier read.

      (b) Explain why the ejaculatory bulb was left out of the analysis.

      (c) Explain in the main text some of the tools, such as, BONT-C and the conditional vGlut mutation.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review): 

      Summary: 

      In this paper, the authors developed a chemical labeling reagent for P2X7 receptors, called X7-uP. This labeling reagent selectively labels endogenous P2X7 receptors with biotin based on ligand-directed NASA chemistry (Ref. 41). After labeling the endogenous P2X7 receptor with biotin, the receptor can be fluorescently labeled with streptavidin-AlexaFluor647. The authors carefully examined the binding properties and labeling selectivity of X7-uP to P2X7, characterized the labeling site of P2X7 receptors, and demonstrated fluorescence imaging of P2X7 receptors. The data obtained by SDS-PAGE, Western blot, and fluorescence microscopy clearly show that X7-uP labels the P2X7 receptor. Finally, the authors fluorescently labeled the endogenous P2X7 in BV2 cells, which are a murine microglia model, and used dSTORM to reveal a nanoscale P2X7 redistribution mechanism under inflammatory conditions at high resolution. 

      Strengths: 

      X7-uP selectively labels endogenous P2X7 receptors with biotin. Streptavidin-AlexaFluor647 binds to the biotin labeled to the P2X7 receptor, allowing visualization of endogenous P2X7 receptors. 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Weaknesses & Comments 

      (1) The P2X7 receptor exists in a trimeric form. If it is not a monomer under the conditions of the pull-down assay in Figure 2C, the quantitative values may not be accurate. 

      We thank the reviewer for this comment. As shown in Figure 2C, the band observed on the denaturing SDS-PAGE corresponds to the monomeric form of the P2X7 receptor. While we cannot exclude the presence of non-monomeric species under native conditions, no such higher-order forms are visible in the gel. This observation supports the conclusion that the quantitative values presented are based on the monomeric form and are therefore reliable.

      (2) In Figure 3, GFP fluorescence was observed in the cell. Are all types of P2X receptors really expressed on the cell surface ? 

      We thank the reviewer for this excellent comment, which was also raised by reviewer 2. To address this concern, we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X receptors reach the plasma membrane. As expected, all P2X subtypes except P2X6 were detected at the cell surface in HEK293T cells, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (3) The reviewer was not convinced of the advantages of the approach taken in this paper, because the endogenous receptor labeling in this study could also be done using conventional antibody-based labeling methods. 

      We thank the reviewer for raising this important point and would like to highlight several advantages of our approach compared to conventional antibody-based labeling.

      First, commercially available P2X7 antibodies often suffer from poor specificity and are generally not suitable for reliably detecting endogenous P2X7 receptors, as documented in previous studies (e.g., PMID: 16564580 and PMID: 15254086). While recent advances have been made using nanobodies with improved specificity for P2X7 (e.g., PMID: 30074479 and PMID: 38953020), our strategy is distinct and complementary to nanobody-based approaches.

      Second, antibodies rely on non-covalent interactions with the receptor, which can result in dissociation over time. In contrast, our X7-uP probe covalently biotinylates lysine residues on the P2X7 receptor through stable amide bond formation. This covalent labeling ensures that the biotin moiety remains permanently attached, an advantage not afforded by reversible binding strategies.

      Third, by selectively biotinylating P2X7 receptors, our method provides a versatile platform for the chemical attachment of a wide range of probes or functional moieties. Although we did not demonstrate this application in the current study, we believe this modularity represents an additional advantage of our approach.

      We have now revised the discussion to highlight these key advantages, allowing the reader to form their own opinion. We hope this addresses the reviewer’s concerns and clarifies the benefits of our approach.

      (4) Although P2X7 was successfully labeled in this paper, it is not new as a chemistry. There is a need for more attractive functional evaluation such as live trafficking analysis of endogenous P2X7. 

      We agree with the reviewer that the underlying chemistry is not novel per se. However, to our knowledge, it has not previously been applied to the P2X7 receptor, and thus constitutes a novel application with specific relevance for studying native P2X7 biology.

      We also appreciate the reviewer’s suggestion regarding live trafficking analysis of endogenous P2X7. While this is indeed a valuable and interesting direction, we believe it lies beyond the scope of the present study, as it would first require demonstrating that the labeling itself does not affect P2X7 function (see below). This important step would necessitate additional experiments, which we consider more appropriate for a follow-up investigation.

      (5) The reviewer has concerns that the use of the large-size streptavidin to label the P2X7 receptor may perturbate the dynamics of the receptor. 

      We thank the reviewer for raising this important point. Although we did not directly measure receptor dynamics, it is indeed possible that tetrameric streptavidin (tStrept-A 647) could promote P2X7 clustering by cross-linking nearby receptors due to its tetravalency (see also point 7 raised by the reviewer). To address this concern, we performed additional dSTORM experiments using a monomeric form of streptavidin-Alexa 647 (mSA) (see PMID: 26979420). Owing to its reduced size and lack of tetravalency, mSA has been shown to minimize artificial crosslinking of synaptic receptors (PMID: 26979420). A drawback of using mSA, however, is that the monomeric form carries only two fluorophores (estimated degree of labeling, DOL ≈ 2, PMID: 26979420), whereas the tetrameric form, according to the manufacturer’s certificate of analysis (Invitrogen S21374), has an average DOL of three fluorophores per monomer, resulting in a total of ~12 fluorophores per streptavidin.

      We tested three conditions with mSA incubation: (i) control BV2 cells (without X7-uP), (ii) untreated X7-uP-labeled BV2 cells, and (iii) X7-uP-labeled BV2 cells treated with LPS and ATP (using the same concentrations and incubation times described in the manuscript). As shown in Author response image 1, only LPS+ATP treatment induced a clear increase in the mean cluster density compared to quiescent (untreated) BV2 cells. This effect closely matches the results obtained with tStrept-A 647, supporting the conclusion the tetrameric streptavidin does not artificially promote P2X7 clustering. It is also possible that the cellular environment of BV2 microglia differs from the confined architecture of synapses, which may further explain why cross-linking effects are less pronounced in our system.

      As expected, the overall fluorescence signal with mSA was about tenfold lower than with tStrept-A 647, consistent with the expected fluorophore stoichiometry. This lower signal may explain why the values for the untreated condition appeared slightly higher than for the control, although the difference was not statistically significant (P = 0.1455).

      We hope these additional experiments adequately address the reviewer’s concerns.

      Author response image 1.

      BV2 labeling with monomeric streptavidin–Alexa 647 (mSA).(A) Bright-field and dSTORM images of BV2 cells labeled with mSA in the presence (untreated and LPS+ATP) or absence (control) of 1 µM X7-uP. Treatment: LPS (1 µg/mL for 24 hours) and ATP (1 mM for 30 minutes). Scale bars, 10 µm. Insets: Magnified dSTORM images. Scale bars, 1 µm.(B) Quantification of the number of localizations (n = 2 independent experiments). Bars represent mean ± s.e.m. One-way ANOVA with Tukey’s multiple comparisons (P values are indicated above the graph).

      (6) It is better to directly label Alexa647 to the P2X7 receptor to avoid functional perturbation of P2X7. 

      Directly labeling of Alexa647 to the P2X7 receptor would require the design and synthesis of a novel probe, which is currently not available. Implementing such a strategy would involve substantial new experimental work that lies beyond the scope of the present study.

      (7) In all imaging experiments, the addition of streptavidin, which acts as a cross-linking agent, may induce P2X7 receptor clustering. This concern would be dispelled if the receptors were labeled with a fluorescent dye instead of biotin and observed. 

      We refer the reviewer to our response in point 5, where we addressed this concern by comparing tetrameric and monomeric streptavidin conjugates. As noted above (see also point 6), directly labeling the receptor with a fluorescent dye would require the development of a new probe, which is outside the scope of the present study.

      (8) There are several mentions of microglia in this paper, even though they are not used. This can lead to misunderstanding for the reader. The author conducted functional analysis of the P2X7 receptor in BV-2 cells, which are a model cell line but not microglia themselves. The text should be reviewed again and corrected to remove the misleading parts that could lead to misunderstanding. e.g. P8. lines 361-364

      First, it combines N-cyanomethyl NASA chemistry with the high-affinity AZ10606120 ligand, enabling rapid labeling in microglia (within 10 min)

      P8. lines 372-373 

      Our results not only confirm P2X7 expression in microglia, as previously reported (6, 26-33), but also reveal its nanoscale localization at the cell surface using dSTORM. 

      We agree with the reviewer’s comment. We have now modified the text, including the title.

      Reviewer #2 (Public review): 

      Summary: 

      In this manuscript, Arnould et. al. develop an unbiased, affinity-guided reagent to label P2X7 receptor and use super-resolution imaging to monitor P2X7 redistribution in response to inflammatory signaling. 

      Strengths: 

      I think the X7-uP probe that they developed is very useful for visualizing localization of P2X7 receptor. They convincingly show that under inflammatory conditions, there is a reorganization of P2X7 localization into receptor clusters. Moreover, I think they have shown a very clever way to specifically label any receptor of interest. This has broad appeal 

      We thank the reviewer for their positive comment.

      Weaknesses: 

      Overall, the manuscript is novel and interesting. However, I do have some suggestions for improvement. 

      (1) While the authors state that chemical modification of AZ10606120 to produce the X7-UP reagent has "minimal impact" on the inhibition of P2X7, we can see from Figure 2A and 2B that it does not antagonize P2X7 as effectively as the original antagonist. For the sake of completeness and quantitation, I think it would be great if the authors could determine the IC50 for X7-uP and compare it to the IC50 of AZ10606120. 

      We thank the reviewer for this insightful comment. Unfortunately, due to the limited availability of X7-uP, we were not able to establish a complete concentration–response curve to determine its IC<sub>50</sub>, which would require testing at concentrations >1 µM. Nevertheless, to estimate the effect of the modification, we assessed current inhibition at 300 µM X7-uP and compared it with the reported IC<sub>50</sub> of AZ10606120 (10 nM). Under these conditions, both compounds produced a similar level of inhibition, indicating that while the chemical modification reduces potency relative to AZ10606120, X7-uP still functions as an effective probe for P2X7. We have now included these data in Figure 2 and revised the text accordingly.

      (2) Do the authors know whether modification of the lysines with biotin affects the receptor's affinity for ATP (or ability to be activated by ATP)? What about P2X7 that has been modified with biotin and then labeled with Alexa 647? For the sake of completeness and quantitation, I think it would be great if the authors could determine the EC50 of biotinylated P2X7 for ATP as well as biotinylated and then Alexa 647 labeled P2X7 for ATP and compare these values to the affinity of unmodified WT P2X7 for ATP.

      We thank the reviewer for raising this important point. At present, we have not determined whether modification of lysine residues with biotin, or subsequent labeling with Alexa647, affects the ATP sensitivity or functional properties of P2X7. However, we believe this does not impact the conclusions of the current study, as all functional assays were conducted prior to X7-uP labeling. The labeling is used here as a terminal "snapshot" to visualize the endogenous receptor without interfering with the functional characterization.

      We fully agree that assessing the functional integrity of P2X7 following biotinylation and fluorophore labeling—such as by determining the EC<sub>50</sub> for ATP—would be essential for studies involving dynamic or post-labeling functional analyses, such as live trafficking. However, as noted earlier in our response to Reviewer 1 (point 4), these experiments lie beyond the scope of the current study.

      (3) It is a little misleading to color the fluorescence signal from mScarlet green (for example, in Figure 3 and Figure 4). The fluorescence is not at the same wavelength as GFP. In fact, the wavelength (570 nm - 610 nm) for emission is closer to orange/red than to green. I think this color should be changed to differentiate the signal of mScarlet from the GFP signal used for each of the other P2X receptor subtypes. 

      As suggested, we changed the mScarlet color to orange for all relevant figures.

      (4) It is my understanding that P2X6 does not form homotrimers. Thus, I was a little surprised to see that the density and distribution of P2X6-GFP in Figure 3 looks very similar to the density and distribution of the other P2X subtypes. Do the authors have an explanation for this? Are they looking at P2X6 protomers inserted into the plasma membrane? Does the cell line have endogenous P2X receptor subtypes? Is Figure 3 showing heterotrimers with P2X6 receptor? A little explanation might be helpful.

      We thank the reviewer for raising this important point. Indeed, it is well established that P2X6 does not form functional channels, which supports the conclusion that it does not form homotrimeric complexes. Although previous studies have shown that P2X6–GFP expression is generally lower, more diffuse, and not efficiently targeted to the cell surface compared with other P2X subtypes (see PMID: 12077178), the similar fluorescence distribution and density observed in our Figure 3 do not imply that P2X6 forms homotrimers.

      We did not directly assess the presence of endogenous P2X6 in our HEK293T cells; however, according to the Human Protein Atlas, there is no detectable P2X6 RNA expression in HEK293 cells (nTPM = 0), indicating that endogenous P2X6 is not expressed in this cell line. To further investigate surface expression (see also point 2 of reviewer 1), we performed a commercial cell-surface protein biotinylation assay to assess whether GFP-tagged P2X6 reaches the plasma membrane. As expected, P2X6 was not detected at the cell surface in HEK293T cells, whereas GFP-tagged P2X1 to P2X5 were readily detected. These results further support the conclusion that P2X6 does not insert into the plasma membrane as a homotrimer, thereby validating our confocal fluorescence microscopy assay. These new data are now included in Figure 3 — figure supplement 1.

      (5) It is easy to overlook the fact that the antagonist leaves the binding pocket once the biotin has been attached to the lysines. It might be helpful if the authors made this a little more apparent in Figure 1 or in the text describing the NASA chemistry reaction.

      We thank the reviewer for this insightful suggestion. To address this, we have modified Figure 1A and updated the legend.

      Reviewer #3 (Public review): 

      Summary: 

      This manuscript describes the development of a covalent labeling probe (X7-uP) that selectively targets and tags native P2X7 receptors at the plasma membrane of BV2 microglial cells. Using super-resolution imaging (dSTORM), the authors demonstrate that P2X7 receptors form nanoscale clusters upon microglial activation by lipopolysaccharide (LPS) and ATP, correlating with synergistic IL-1β release. These findings advance understanding of P2X7 reorganization during inflammation and provide a generalizable labeling strategy for monitoring endogenous P2X7 in immune cells. 

      Strengths: 

      (1) The authors designed X7-uP by coupling a high-affinity, P2X7-specific antagonist (AZ10606120) with N-cyanomethyl NASA chemistry to achieve site-directed biotinylation. This approach offers high specificity, minimal off-target reactivity, and a straightforward pull-down/imaging readout. 

      (2) The results connect P2X7's nanoscale clustering directly with IL-1β secretion in microglia, reinforcing the role of P2X7 in inflammation. By localizing endogenous P2X7 at single-molecule resolution, the authors reveal how LPS priming and ATP stimulation synergistically reorganize the receptor. 

      (3) The authors systematically validate their method in recombinant systems (HEK293 cells) and in BV2 cells, showing selective inhibition, mutational confirmation of the binding site, and Western blot pulldown experiments.

      We thank the reviewer for their positive comment.

      Weaknesses: 

      (1) While the data strongly indicate that P2X7 clustering contributes to IL-1β release, the manuscript would benefit from additional experiments (if feasible) or discussion on how receptor clustering interfaces with downstream inflammasome assembly. Clarification of whether the P2X7 clusters physically colocalize with known inflammasome proteins would solidify the mechanism. 

      We thank the reviewer for this valuable suggestion. Determining the physical colocalization of P2X7 clusters with known inflammasome components would provide important insight into the molecular partners involved in inflammasome activation. However, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      Nevertheless, in response to the reviewer’s suggestion, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation. We also revised the text to tone down the hypothesis of physical colocalization.

      (2) The authors might expand on the scope of X7-uP in other native cells that endogenously express P2X7 (e.g., macrophages, dendritic cells). Although they mention the possibility, demonstrating the probe's applicability in at least one other primary immune cell type would strengthen its general utility. 

      We thank the reviewer for this valuable suggestion. Again, we believe that such an investigation would constitute a substantial study on its own and therefore lies beyond the scope of the present work.

      (3) The authors do include appropriate negative controls, yet providing additional details (e.g., average single-molecule on-time or blinking characteristics) in supplementary materials could help readers assess cluster calculations. 

      As suggested, we have included additional data showing single-molecule blinking events in untreated and LPS+ATP-treated BV2 cells, along with the corresponding movies. The data are now presented in Figure 5—supplement figure 3A and B and Figure 5—Videos 1 and 2.

      Recommendations for the authors:

      Reviewer #2 (Recommendations for the authors): 

      (1) On line 96, the authors refer to the "ballast" domain of P2X7 receptor but do not cite the original article from which this nomenclature originated (McCarthy et al., 2019, Cell). This article should be cited to give appropriate credit. 

      Done.

      (2) On line 602, the authors state that they use models from PDB 1MK5 and 6U9W to generate the cartoons in Figure 6. The manuscripts from which these PDB files were generated need to be appropriately cited. 

      Done.

      (3) On line 319, the authors say "300 mM BzATP" but I think they mean 300 uM.

      Done. Thank you for catching the typo.

      Reviewer #3 (Recommendations for the authors): 

      Overall, excellent data quality. The paper would benefit from a discussion of the physiological implications of clustering. It would also be helpful to elaborate about the potential mechanisms for clustering: diffusion and/or insertion. Finally, the authors should comment on work by Mackinnon's (PMID: 39739811) and Santana lab (PMID: 31371391) on two distinct models for clustering of proteins. 

      As suggested by the reviewer, we have revised the discussion to incorporate their comments. First, we have added the following text:

      “Upon BV2 activation, we observed significant nanoscale reorganization of P2X7. Both LPS and ATP (or BzATP) trigger P2X7 upregulation and clustering, increasing the overall number of surface receptors and the number of receptors per cluster, from one to three (Figure 6). By labeling BV2 cells with X7-uP shortly after IL-1b release, we were able to correlate the nanoscale distribution of P2X7 with the functional state of BV2 cells, consistent with the two-signal, synergistic model for IL-1b secretion observed in microglia and other cell types (Ferrari et al, 1996; Perregaux et al, 2000; Ferrari et al, 2006; Di Virgilio et al, 2017; He et al, 2017; Swanson et al, 2019). In this model, LPS priming leads to intracellular accumulation of pro-IL-1b, while ATP stimulation activates P2X7, triggering NLRP3 inflammasome activation and the subsequent release of mature IL-1b.

      What is the mechanism underlying P2X7 upregulation that leads to an overall increase in surface receptors—does it result from the lateral diffusion of previously masked receptors already present at the plasma membrane, or from the insertion of newly synthesized receptors from intracellular pools in response to LPS and ATP? Although our current data do not distinguish between these possibilities, a recent study suggests that the a1 subunit of the Na<sup>+</sup>/K</sup>+</sup>-ATPase (NKAa1) forms a complex with P2X7 in microglia, including BV2 cells, and that LPS+ATP induces NKAa1 internalization (Huang et al, 2024). This internalization appears to release P2X7 from NKAa1, allowing P2X7 to exist in its free form. We speculate that the internalization of NKAa1 induced by both LPS and ATP exposes previously masked P2X7 sites, including the allosteric AZ10606120 sites, thus making them accessible for X7-uP labeling.”

      Second, we have added a short paragraph at the end of the Discussion section addressing potential mechanisms by which P2X7 clustering may contribute to downstream inflammasome activation:

      “What mechanisms underlie P2X7 clustering in response to inflammatory signals? Several models have been proposed to explain membrane protein clustering, including recruitment to structural scaffolds (Feng & Zhang, 2009), partitioning into membrane domains enriched in specific chemical components such as lipid rafts (Simons & Ikonen, 1997), and self-assembly mechanisms (Sieber et al, 2007). These self-assembly mechanisms include an irreversible stochastic model (Sato et al, 2019) and a more recent reversible self-oligomerization model which gives rise to higher-order transient structures (HOTS) (Zhang et al, 2025). Supported by cryogenic optical localization microscopy with very high resolution (~5 nm), the HOTS model has been observed in various membrane proteins, including ion channels and receptors (Zhang et al, 2025). Furthermore, HOTS are suggested to be dynamically modulated and to play a functional role in cell signaling, potentially influencing both physiological and pathological processes (Zhang & MacKinnon, 2025). While this hypothesis is compelling, our current dSTORM data lack sufficient spatial resolution to confirm whether P2X7 trimers form HOTS via self-oligomerization. Further biophysical and ultra-high-resolution imaging studies are required to test this model in the context of P2X7 clustering.”

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Reviewer #1 (Public review):

      Summary:

      This manuscript by Pournejati et al investigates how BK (big potassium) channels and CaV1.3 (a subtype of voltage-gated calcium channels) become functionally coupled by exploring whether their ensembles form early-during synthesis and intracellular trafficking-rather than only after insertion into the plasma membrane. To this end, the authors use the PLA technique to assess the formation of ion channel associations in the different compartments (ER, Golgi or PM), single-molecule RNA in situ hybridization (RNAscope), and super-resolution microscopy.

      Strengths:

      The manuscript is well written and addresses an interesting question, combining a range of imaging techniques. The findings are generally well-presented and offer important insights into the spatial organization of ion channel complexes, both in heterologous and endogenous systems.

      Weaknesses:

      The authors have improved their manuscript after revisions, and some previous concerns have been addressed.

      Still, the main concern about this work is that the current experiments do not quantitatively or mechanistically link the ensembles observed intracellularly (in the endoplasmic reticulum (ER) or Golgi) to those found at the plasma membrane (PM). As a result, it is difficult to fully integrate the findings into a coherent model of trafficking. Specifically, the manuscript does not address what proportion of ensembles detected at the PM originated in the ER. Without data on the turnover or halflife of these ensembles at the PM, it remains unclear how many persist through trafficking versus forming de novo at the membrane. The authors report the percentage of PLApositive ensembles localized to various compartments, but this only reflects the distribution of pre-formed ensembles. What remains unknown is the proportion of total BK and Ca<sub>V</sub>1.3 channels (not just those in ensembles) that are engaged in these complexes within each compartment. Without this, it is difficult to determine whether ensembles form in the ER and are then trafficked to the PM, or if independent ensemble formation also occurs at the membrane. To support the model of intracellular assembly followed by coordinated trafficking, it would be important to quantify the fraction of the total channel population that exists as ensembles in each compartment. A comparable ensemble-to-total ratio across ER and PM would strengthen the argument for directed trafficking of pre-assembled channel complexes.

      We appreciate the reviewer’s thoughtful comment and agree that quantitatively linking intracellular hetero-clusters to those at the plasma membrane is an important and unresolved question. Our current study does not determine what proportion of ensembles at the plasma membrane originated during trafficking. It also does not quantify the fraction of total BK and Ca<sub>V</sub>1.3 channels engaged in these complexes within each compartment. Addressing this requires simultaneous measurement of multiple parameters—total BK channels, total Ca<sub>V</sub>1.3 channels, hetero-cluster formation (via PLA), and compartment identity—in the same cell. This is technically challenging. The antibodies used for channel detection are also required for the proximity ligation assay, which makes these measurements incompatible within a single experiment.

      To overcome these limitations, we are developing new genetically encoded tools to enable real-time tracking of BK and Ca<sub>V</sub>1.3 dynamics in live cells. These approaches will enable us to monitor channel trafficking and the formation of hetero-clusters, as detected by colocalization. This kind of experiments will provide insight into their origin and turnover. While these experiments are beyond the scope of the current study, the findings in our current manuscript provide the first direct evidence that BK and CaV channels can form hetero-clusters intracellularly prior to reaching the plasma membrane. This mechanistic insight reveals a previously unrecognized step in channel organization and lays the foundation for future work aimed at quantifying ensemble-to-total ratios and determining whether coordinated trafficking of pre-assembled complexes occurs.

      This limitation is acknowledged in the discussion section, page 23. It reads: “Our findings highlight the intracellular assembly of BK-Ca<sub>V</sub>1.3 hetero-clusters, though limitations in resolution and organelle-specific analysis prevent precise quantification of the proportion of intracellular complexes that ultimately persist on the cell surface.”

      Reviewer #2 (Public review):

      Summary:

      The co-localization of large conductance calcium- and voltage activated potassium (BK) channels with voltage-gated calcium channels (CaV) at the plasma membrane is important for the functional role of these channels in controlling cell excitability and physiology in a variety of systems.

      An important question in the field is where and how do BK and CaV channels assemble as 'ensembles' to allow this coordinated regulation - is this through preassembly early in the biosynthetic pathway, during trafficking to the cell surface or once channels are integrated into the plasma membrane. These questions also have broader implications for assembly of other ion channel complexes

      Using an imaging based approach, this paper addresses the spatial distribution of BKCaV ensembles using both overexpression strategies in tsa201 and INS-1 cells and analysis of endogenous channels in INS-1 cells using proximity ligation and superesolution approaches. In addition, the authors analyse the spatial distribution of mRNAs encoding BK and Cav1.3.

      The key conclusion of the paper that BK and Ca<sub>V</sub>1.3 are co-localised as ensembles intracellularly in the ER and Golgi is well supported by the evidence.However, whether they are preferentially co-translated at the ER, requires further work. Moreover, whether intracellular pre-assembly of BK-Ca<sub>V</sub>1.3 complexes is the major mechanism for functional complexes at the plasma membrane in these models requires more definitive evidence including both refinement of analysis of current data as well as potentially additional experiments.

      The reviewer raises the question of whether BK and Ca<sub>V</sub>1.3 channels are preferentially co-translated. In fact, I would like to propose that co-translation has not yet been clearly defined for this type of interaction between ion channels. In our current work, we 1) observed the colocalization between BK and Ca<sub>V</sub>1.3 mRNAs and 2) determined that 70% of BK mRNA in active translation also colocalizes with Ca<sub>V</sub>1.3 mRNA. We think these results favor the idea of translational complexes that can underlie the process of co-translation. However, and in total agreement with the Reviewer, the conclusion that the mRNA for the two ion channels is cotranslated would require further experimentation. For instance, mRNA coregulation is one aspect that could help to define co-translation. 

      To avoid overinterpretation, we have revised the manuscript to remove references to “co-translation” in the Results section and included the word “potential” when referring to co-translation in the Discussion section. We also clarified the limitations of our evidence in the Discussion that can be found on page 25: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess co-translation.”

      Strengths & Weaknesses

      (1) Using proximity ligation assays of overexpressed BK and CaV1.3 in tsa201 and INS1 cells the authors provide strong evidence that BK and CaV can exist as ensembles (ie channels within 40 nm) at both the plasma membrane and intracellular membranes, including ER and Golgi. They also provide evidence for endogenous ensemble assembly at the Golgi in INS-1 cells and it would have been useful to determine if endogenous complexes are also observe in the ER of INS-1 cells. There are some useful controls but the specificity of ensemble formation would be better determined using other transmembrane proteins rather than peripheral proteins (eg Golgi 58K).

      We thank the reviewer for their thoughtful feedback and for recognizing the strength of our proximity ligation assay data supporting BK–Ca<sub>V</sub>1.3 hetero-clusters formation at both the plasma membrane and intracellular compartments. As for specificity controls, we appreciate the suggestion to use transmembrane markers. To strengthen our conclusion, we have performed an additional experiment comparing the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 and BK channels with the number of PLA puncta formed by the interaction of Ca<sub>V</sub>1.3 channels and ryanodine receptors in INS-1 cells. As shown in the figure below, the number of interactions between Ca<sub>V</sub>1.3 and BK channels is significantly higher than that between Ca<sub>V</sub>1.3 and RyR<sub>2</sub>. Of note, RyR<sub>2</sub> is a protein resident of the ER. These results provide additional evidence of the existence of endogenous complex formation in INS-1 cells. We have added this figure as a supplement.

      (2) Ensemble assembly was also analysed using super-resolution (dSTORM) imaging in INS-1 cells. In these cells only 7.5% of BK and CaV particles (endogenous?) co-localise that was only marginally above chance based on scrambled images. More detailed quantification and validation of potential 'ensembles' needs to be made for example by exploring nearest neighbour characteristics (but see point 4 below) to define proportion of ensembles versus clusters of BK or Cav1.3 channels alone etc. For example, it is mentioned that a distribution of distances between BK and Cav is seen but data are not shown.

      We thank the reviewer for this comment. To address the request for more detailed quantification and validation of ensembles, we performed additional analyses:

      Proportion of ensembles vs isolated clusters: We quantified clusters within 200 nm and found that 37 ± 3% of BK clusters are near one or more CaV1.3 clusters, whereas 15 ± 2% of CaV1.3 clusters are near BK clusters. Figure 8– Supplementary 1A

      Distance distribution: As shown in Figure 8–Supplementary 1B, the nearestneighbor distance distribution for BK-to-CaV1.3 in INS-1 cells (magenta) is shifted toward shorter distances compared to randomized controls (gray), supporting preferential localization of BK–CaV1.3 hetero-clusters.

      Together, these analyses confirm that BK–CaV1.3 ensembles occur more frequently than expected by chance and exhibit an asymmetric organization favoring BK proximity to CaV1.3 in INS-1 cells. We have included these data and figures in the revised manuscript, as well as description in the Results section. 

      (3) The evidence that the intracellular ensemble formation is in large part driven by cotranslation, based on co-localisation of mRNAs using RNAscope, requires additional critical controls and analysis. The authors now include data of co-localised BK protein that is suggestive but does not show co-translation. Secondly, while they have improved the description of some controls mRNA co-localisation needs to be measured in both directions (eg BK - SCN9A as well as SCN9A to BK) especially if the mRNAs are expressed at very different levels. The relative expression levels need to be clearly defined in the paper. Authors also use a randomized image of BK mRNA to show specificity of co-localisation with Cav1.3 mRNA, however the mRNA distribution would not be expected to be random across the cell but constrained by ER morphology if cotranslated so using ER labelling as a mask would be useful?

      We thank the reviewer for these constructive suggestions. We measured mRNA colocalization in both directions as recommended. As shown in the figure below, colocalization between KCNMA1 and SCN9A transcripts was comparable in both directions, with no statistically significant difference, supporting the specificity of the observed associations. We decided not to add this to the original figure to keep the figure simple. 

      We agree that co-localization of BK protein with BK mRNA is not conclusive evidence of co-translation, and we do not intend to mislead readers in our conclusion. Consequently, we were careful in avoiding the use of co-translation in the result section and added the word “potential” when referring to co-translation in the Discussion section. We added a sentence in the discussion to caution our interpretation: “It is important to note that while our data suggest mRNA coordination, additional experiments are required to directly assess cotranslation.”

      Author response image 1.

      (4) The authors attempt to define if plasma membrane assemblies of BK and CaV occur soon after synthesis. However, because the expression of BK and CaV occur at different times after transient transfection of plasmids more definitive experiments are required. For example, using inducible constructs to allow precise and synchronised timing of transcription. This would also provide critical evidence that co-assembly occurs very early in synthesis pathways - ie detecting complexes at ER before any complexes 

      We appreciate the reviewer’s insightful suggestion regarding the use of inducible constructs to synchronize transcription timing. This is an excellent approach and would allow direct testing of whether co-assembly occurs early in the synthesis pathway, including detection of complexes at the ER prior to plasma membrane localization. These experiments are beyond the scope of the present work but represent an important direction for future studies.

      We have added the following sentence to the Discussion section (page 24) to highlight this idea. “Future experiments using inducible constructs to precisely control transcription timing will enable more precise quantification of heterocluster formation in the ER compartment prior to plasma membrane insertion and reduce the variability introduced by differences in expression timing after plasmid transfection.” 

      (5) While the authors have improved the definition of hetero-clusters etc it is still not clear in superesolution analysis, how they separate a BK tetramer from a cluster of BK tetramers with the monoclonal antibody employed ie each BK channel will have 4 binding sites (4 subunits in tetramer) whereas Cav1.3 has one binding site per channel. Thus, how do authors discriminate between a single BK tetramer (molecular cluster) with potential 4 antibodies bound compared to a cluster of 4 independent BK channels.

      We appreciate the reviewer’s thoughtful comment regarding the interpretation of super-resolution data. We agree that distinguishing a single BK tetramer from a cluster of multiple BK channels is challenging when using an antibody that can bind up to four sites per channel. To clarify, our analysis does not attempt to resolve individual subunits within a tetramer; rather, it focuses on the nanoscale spatial proximity of BK and Ca<sub>V</sub>1.3 signals.

      We want to note that this limitation applies only to the super-resolution maps in Figures 8C and 9D and does not affect Airyscan-based analyses or measurements of BK–Ca<sub>V</sub>1.3 proximity.

      To address how we might distinguish between a single BK tetramer and a cluster of multiple BK channels, we considered two contrasting scenarios. In the first case, we assume that all four α-subunits within a tetramer are labeled. Based on cryoEM structures, a BK tetramer measures approximately 13 nm × 13 nm (≈169 nm²). Adding two antibody layers (primary and secondary) would increase the footprint by ~14 nm in each direction, resulting in an estimated area of ~41 nm × 41 nm (≈1681 nm²). Under this assumption, particles smaller than ~1681 nm² would likely represent individual tetramers, whereas larger particles would correspond to clusters of multiple tetramers. 

      In the second scenario, we propose that steric constraints at the S9–S10 segment, where the antibody binds, limit labeling to a single antibody per tetramer. If true, the localization precision would approximate 14 nm × 14 nm—the combined size of the antibody complex and the channel—close to the resolution limit of the microscope. To test this, we performed a control experiment using two antibodies targeting the BK C-terminal domain, raised in different species and labeled with distinct fluorophores. Super-resolution imaging revealed that only ~12% of particles were colocalized, suggesting that most channels bind a single antibody.

      If multiple antibodies could bind each tetramer, we would expect much greater colocalization.

      Although these data are not included in the manuscript, we have added the following clarification to the Results section (page 19): “It is important to note that this technique does not allow us to distinguish between labeling of four BK αsubunits within a tetramer and labeling of multiple BK channel clusters. Hence, particles smaller than ~1680 nm² may represent either a single tetramer or a cluster. This limitation applies to Figures 8C and 9D and does not affect measurements of BK–Ca<sub>V</sub>1.3 proximity.”

      Author response image 2.

      (6) The post-hoc tests used for one way ANOVA and ANOVA statistics need to be defined throughout

      We thank the reviewer for highlighting the need for clarity regarding our statistical analyses. We have now specified the post-hoc tests used for all one-way ANOVA and ANOVA comparisons throughout the manuscript, and updated figure legends.

      Reviewer #3 (Public review):

      Summary:

      The authors present a clearly written and beautifully presented piece of work demonstrating clear evidence to support the idea that BK channels and Cav1.3 channels can co-assemble prior to their assertion in the plasma membrane.

      Strengths:

      The experimental records shown back up their hypotheses and the authors are to be congratulated for the large number of control experiments shown in the ms.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      The authors have sufficiently addressed the specific points previously raised and the manuscript has improved clarity in those aspects. My main concern, which still remains, is stated in the public review.

      Reviewer #3 (Recommendations for the authors):

      I am content that the authors have attempted to fully address my previous criticisms.

      I have only three suggestions

      (1) I think the word Homo-clusters at the bottom right of Figure 1 is erroneously included.

      We thank the reviewer for bringing this to our attention. The figure has been corrected accordingly.

      (2) The authors should, for completeness, to refer to the beta, gamma and LINGO subunit families in the Introduction and include appropriate references:

      Knaus, H. G., Folander, K., Garcia-Calvo, M., Garcia, M. L., Kaczorowski, G. J., Smith, M., & Swanson, R. (1994). Primary sequence and immunological characterization of betasubunit of high conductance Ca2+-activated K+ channel from smooth muscle. The Journal of Biological Chemistry, 269(25), 17274-17278.

      Brenner, R., Jegla, T. J., Wickenden, A., Liu, Y., & Aldrich, R. W. (2000a). Cloning and functional characterization of novel large conductance calcium-activated potassium channel beta subunits, hKCNMB3 and hKCNMB4. The Journal of Biological Chemistry, 275(9), 6453-6461.

      Yan, J & R.W. Aldrich. (2010) LRRC26 auxiliary protein allows BK channel activation at resting voltage without calcium. Nature. 466(7305):513-516

      Yan, J & R.W. Aldrich. (2012) BK potassium channel modulation by leucine-rich repeatcontaining proteins. Proceedings of the National Academy of Sciences 109(20):7917-22

      Dudem, S, Large RJ, Kulkarni S, McClafferty H, Tikhonova IG, Sergeant, GP, Thornbury, KD, Shipston, MJ, Perrino BA & Hollywood MA (2020). LINGO1 is a novel regulatory subunit of large conductance, Ca2+-activated potassium channels. Proceedings of the National Academy of Sciences 117 (4) 2194-2200

      Dudem, S., Boon, P. X., Mullins, N., McClafferty, H., Shipston, M. J., Wilkinson, R. D. A., Lobb, I., Sergeant, G. P., Thornbury, K. D., Tikhonova, I. G., & Hollywood, M. A. (2023). Oxidation modulates LINGO2-induced inactivation of large conductance, Ca2+-activated potassium channels. The Journal of Biological Chemistry, 299 (3) 102975.

      We agree with the reviewer’s suggestion and have revised the Introduction to include references to the beta, gamma, and LINGO subunit families. Appropriate citations have been added to ensure completeness and contextual relevance.

      Additionally, BK channels are modulated by auxiliary subunits, which fine-tune BK channel gating properties to adapt to different physiological conditions. The β, γ, and LINGO1 subunits each contribute distinct structural and regulatory features: β-subunits modulate Ca²⁺ sensitivity and can induce inactivation; γ-subunits shift voltage-dependent activation to more negative potentials; and LINGO1 reduces surface expression and promotes rapid inactivation (18-24). These interactions ensure precise control over channel activity, allowing BK channels to integrate voltage and calcium signals dynamically in various cell types.

      (3) I think it may be more appropriate to include the sentence "The probes against the mRNAs of interest and tested in this work were designed by Advanced Cell Diagnostics." (P16, right hand column, L12-14) in the appropriate section of the Methods, rather than in Results.

      We thank the reviewer for this helpful suggestion. In response, we have relocated the sentence to the appropriate section of the Methods, where it now appears with relevant context.

    1. Note: This response was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity):

      Summary:

      The manuscript titled "Unravelling the Progression of the Zebrafish Primary Body Axis with Reconstructed Spatiotemporal Transcriptomics" presents a comprehensive analysis of the development of the primary body axis in zebrafish by integrating bulk RNA-seq, 3D images, and Stereo-Seq. The authors first clearly demonstrate the application of Palette for integrating RNA-seq and Stereo-Seq using published spatial transcriptomics data of Drosophila embryos. Subsequently, they produced serial bulk RNA-seq data for certain developmental stages of Danio rerio embryos and utilized published Stereo-Seq data. Through robust validation, the authors observe the molecular network involved in AP axis formation. While the authors show that integrating bulk RNA-seq data with Stereo-Seq improves spatial resolution, additional proof is required to demonstrate the extent of this improvement.

      Response: We thank the reviewer for the positive feedback on our Palette pipeline, zSTEP construction and analysis of primary body axis development. We appreciate the constructive suggestions provided, which we can implement to improve our manuscript. As pointed out by the reviewer, some analysis procedures were not described in sufficient detail. To address this, we have added more explanatory texts and additional schematic diagrams to make the methods clearer and more understandable. We also thank the reviewer for the meticulous reading and for reminding us to include parameters, references and essential texts, which significantly improve the manuscript quality and make the manuscript more rigorous. Furthermore, as suggested by the reviewer, the extent of the improvement on the spatial resolution was not clearly demonstrated in the manuscript. Therefore, we have provided an additional figure to show the original expression on the stacked Stereo-seq slices and 3D live image compared to the expression from zSTEP, and the results indicate that zSTEP provides better, more continuous expression patterns. We still have two remaining tasks that are expected to be completed within the next month. We hope our responses have address the concerns raised by the reviewer, and we are pleased to provide any additional proof as needed.

      Major Comments:

      1. Lines 66-68: Discuss the limitations of existing tools and explicitly state the advantages of using Palette.

      Response: We thank the reviewer for the valuable suggestion. We have added the following new texts after line 68 to emphasize the features and advantages of Palette.

      "Newly developed tools are committed to integrating bulk and/or scRNA-seq data with ST data to enhance spatial resolution, focusing on expression at the spot level. However, gene expression patterns are closely correlated to the biological functions and are more critical for understanding biological processes. Therefore, a tool focusing on inferring spatial gene expression patterns would be desirable."

      1. Body Pattern Genes Analysis: For both Drosophila and Danio rerio, it would be valuable to examine body pattern genes in Stereo-Seq and apply Palette to determine if the resolution of the segments improves or merges. The resolution of the A-P axis is convincing, but further evidence for other segments would be beneficial.

      Response: We thank the reviewer for the suggestions. For the Drosophila data, we only used two adjacent slices for Palette performance assessment, and thus were only able to evaluate the expression patterns within the slice.

      For the zebrafish data, although we have construct zSTEP as a 3D transcriptomic atlas, we have to admit that the left-right (LR) and dorsal-ventral (DV) patterning is not satisfactory enough. Here we show a section from the dorsal part of 16 hpf zSTEP that displays a relatively well-defined left-right pattern (Fig. 2). Along the left-right axis, the notochord cells are centrally located, flanked by somite cells on either side, with the outermost cells being pronephros.

      One reason for the limited LR and DV patterning is that the original annotation of the ST data does not clearly distinguish all the cell types. Another reason is likely due to the disordered cell positions when stacking ST slices. Thus, our zSTEP is most suitable for investigating the AP patterns, while the performances on LR and DV patterns may not achieve the same level of accuracy.

      See response letter for the figure.

      1. Figure 2d: Include the A-P line for which the intensity profile was plotted in the main figure, rather than just in the supplementary material. Additionally, consider simplifying the plot by not combining three lines into one, as it complicates the interpretation of observations.

      Response: We thank the reviewer for the helpful suggestions. We have updated Figure 2d and Figure S1b by adding a A-P line on each subfigure (Fig. 3). Additionally, as the reviewer suggested, we have separated the intensity plots so that each subfigure now includes a dedicated intensity plot along A-P axis.

      See response letter for the figure.

      1. Drosophila Data Analysis: While the alignment and validation of Danio rerio sections are clearly explained, the analysis and validation of Drosophila data are insufficiently detailed. Provide a more thorough explanation of how the intensity profiles between BDGP in situ data and Stereo-Seq data are adjusted.

      Response: We thank the reviewer for raising this issue. To make the analysis procedure clearer, we have updated Figure 2a (Fig. 4) and added explanatory texts in the figure legends to describe the processing procedure for the Drosophila ST data.

      See response letter for the figure.

      Additionally, the following sentences have been added into the Methods section to describe the generation of the intensity profiles.

      "The intensity plot profiles along AP axis were generated through the following steps: The expression pattern plot images or in situ hybridization images were imported into ImageJ and converted to grayscale. The colour was then inverted, and a line of a certain width (here set as 10) was drawn across from the anterior part to the posterior part (Fig. S1a). The signal intensities along the width of the line were measured and imported into R for generating intensity plots."

      1. Figure 3d: Present a plot with the expected expression profiles of the three genes if the embryo is aligned as anticipated.

      Response: We thank the reviewer for this helpful suggestion, which improves the clarity of our manuscript. We have added the following subfigure in as Figure 3d (Fig. 5) to show the expected expression profiles of the three midline genes along left-right axis.

      See response letter for the figure.

      1. Analysis Without Palette: Between lines 277-438, the outcome of using Palette with bulk RNA-seq and Stereo-Seq is convincing. However, consider the following:

      o What would be the observations if the analysis were conducted solely with Stereo-Seq data, without incorporating bulk RNA-seq data and employing Palette?

      Response: We thank the reviewer for raising this important question. Here we show the comparison of ST expression on stacked Stereo-seq slices, ST expression projected on 3D live images, and the Palette-inferred expression (Fig. 6). The stacked ST slices do not fully reflect the zebrafish morphology, and the gene expression appears sparse, making it look massive (the first row). While after projecting ST expression onto the live image, the expression patterns can be observed on zebrafish morphology, but the expression is still sparsely distributed in spots (the second row). However, the expression patterns captured by Palette in zSTEP show more continuous expression patterns (the third row), which are more similar to the observations in in situ hybridization images (the fourth row). We are considering put these analyses into the supplementary figure.

      See response letter for the figure.

      o This study uses only Stereo-Seq as the spatial transcriptomics reference. It would strengthen the argument to use at least one other spatial transcriptomics method, such as Visium or MERFISH, in conjunction with bulk RNA-seq and Palette, to demonstrate whether Palette consistently improves gene expression resolution.

      Response: We thank the reviewer for raising this professional question. To demonstrate a broad application of Palette, it would be necessary to test Palette performance using different types of ST references. We plan to perform extra analyses to evaluate Palette performance using Visium and MERFISH data as ST references, respectively. Additionally, our Palette pipeline only takes the overlapped genes for inference. As only hundreds of genes can be detected by MERFISH, Palette can only infer the expression patterns of these genes. As mentioned in the work of Liu et al. (2023), MERFISH can independently resolve distinct cell types and spatial structures, and thus we believe Palette will also show great performance when using MERFISH as ST reference. We've already started the analyses and expect to accomplish it within the next month. And we will update the analyses as separated tutorials to the GitHub repository.

      Reference:

      Liu, J. et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance 6 (2023).

      1. PDAC Data Analysis: Provide a more detailed explanation of the PDAC data analysis and use appropriate colors in the tissue images to clearly distinguish cell types.

      Response: We thank the reviewer for the suggestions. We have updated the colours used in the tissue images to be consistent to the colours in tissue clustering analysis. Additionally, we have added an additional subfigure in supplementary figure (Fig. 7) with more explanatory texts in the figure legends to provide a more thorough explanation for the analysis.

      See response letter for the figure.

      1. Comparison with Other Methods: State the limitations of not using STitch3D and Spateo for alignment and explain why these methods were not employed.

      Response: We thank the reviewer for raising this constructive comment. We fully agree with you that the introduction of published alignment algorithms would be helpful in our analysis. Currently, the slice alignment is adjusted manually, and thus the main limitation of not using these tools is that manual operation may induce bias compared to the alignment generated by computational algorithm. Unfortunately, STitch3D and Spateo are not included in this study because of two reasons. First, these two newly developed tools have been recently posted, and our analyses were largely completed before that. Therefore, we only mentioned these tools in the Discussion section. Second, we do not want to embed too many external tools into our analysis, which may increase the difficulties for researchers' operation. Specifically, STitch3D and Spateo are configured to run in Python environment, while Palette is based on R packages. Moreover, without these tools, our current manual alignment also achieves desired performance. However, we value this enlightening suggestion by the reviewer and therefore plan to further compare the performance of manual alignment versus the mentioned two alignment tools. At present, we have a preliminary comparison scheme and collected relevant datasets. Hopefully, we will complete this analysis within the next 1 to 2 weeks.

      Minor Comments:

      1. References: Add references to the statements in lines 51-53.

      Response: We thank the reviewer for reminding us of the missing references. We have added the works of Junker et al. (2014), Liu et al. (2022), Chen et al. (2022), Wang et al. (2022), Shi et al. (2023) and Satija et al. (2015) as references in line 53 as follows.

      "Thus, great efforts are ongoing to construct gene expression maps of these models with higher resolution, depth, and comprehensiveness1-6."

      References:

      1. Junker, J.P. et al. Genome-wide RNA Tomography in the zebrafish embryo. Cell 159, 662-675 (2014).
      2. Liu, C. et al. Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis. Dev Cell 57, 1284-1298 e1285 (2022).
      3. Chen, A. et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777-1792 e1721 (2022).
      4. Wang, M. et al. High-resolution 3D spatiotemporal transcriptomic maps of developing Drosophila embryos and larvae. Dev Cell 57, 1271-1283 e1274 (2022).
      5. Shi, H. et al. Spatial atlas of the mouse central nervous system at molecular resolution. Nature 622, 552-561 (2023).
      6. Satija, R. et al. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33, 495-502 (2015)
      1. Scientific Name Consistency: Ensure consistency in using either "Danio rerio" or "zebrafish" throughout the manuscript.

      Response: We thank the reviewer for this suggestion. We have changed "Danio rerio" to "zebrafish" to make "zebrafish" consistent throughout the manuscript.

      1. Related References: Include the following relevant references:

      o https://academic.oup.com/bib/article/25/4/bbae316/7705532

      o https://www.life-science-alliance.org/content/6/1/e202201701

      Response: We thank the reviewer for bringing these two relevant works to us. Baul et al. (2024) presented STGAT leveraging Graph Attention Networks for integrating spatial transcriptomics and bulk RNA-seq, and Liu et al. (2023) demonstrated the concordance of MERFISH ST with bulk and single-cell RNA-seq. Both are excellent works and relevant to our work. We have added these two references in line 61 and line 68, respectively.

      References:

      Baul, S. et al. Integrating spatial transcriptomics and bulk RNA-seq: predicting gene expression with enhanced resolution through graph attention networks. Brief Bioinform 25 (2024).

      Liu, J. et al. Concordance of MERFISH spatial transcriptomics with bulk and single-cell RNA sequencing. Life Sci Alliance 6 (2023).

      1. Figure 1a: In the Venn diagram, include the number of genes in the bulk and Stereo-Seq datasets, as well as the number of overlapping genes.

      Response: We thank the reviewer reminding us to include these important numbers. And in our current manuscript, we have added the following sentences in the Methods section to provide the gene numbers (Fig. 8). While the Venn diagram in Figure 1a serves as a schematic representation, so we did not include the gene numbers, as these may vary depending on the actual data.

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      1. Figure 1 Improvement: Enlarge Figure 1 and reduce repetitive elements, such as parts of the deconvolution and Figure 1b.

      Response: We thank the reviewer for the helpful suggestion. We agree with the reviewer that the deconvolution sections appear repetitive. We have updated Figure 1 (Fig. 9) by replacing these repetitive elements with a clearer and simpler diagram.

      See response letter for the figure.

      1. Figure 3f: Explain the black discontinuous line in the plot.

      Response: We thank the reviewer for the reminder. We are sorry about the lack of the explanation. We have added the below explanation for the black discontinuous line in the legend of Figure 3 (Fig. 10) as follows.

      See response letter for the figure.

      1. Line 610: State the percentage of unpaired imaging spots.

      Response: We thank the review for the reminder. We are sorry about not including the paired and unpaired spot number. We have added the number of paired spots with the percentage in the total spots in the Method section as follows.

      "The numbers of mapped spots for the 10 hpf, 12 hpf and 16 hpf embryos are 15,379 (69.4% of the total spots), 14,697 (70.5% of the total spots) and 21,605 (77.2% of the total spots), respectively."

      1. Lines 616-618: Specify the unit for the spot diameter.

      Response: We thank the reviewer for the reminder. Again, we are sorry about not including the spot diameter information in our previous version of manuscript. We have added the spot diameter in Method section as follows.

      "In the Stereo-seq data, each spot contained 15 × 15 DNA nanoball (DNB) spots (The diameter of each spot is near 10 μm)."

      Reviewer #1 (Significance):

      This algorithm will be useful not only for the field of developmental biology but also for wider applications in spatial omics. Although I have expertise in spatial omics technology development, my understanding of computational biology is limited, which restricts my ability to fully evaluate the Palette algorithm presented in this paper.

      Response: We thank the reviewer for recognizing our work, and we greatly appreciate the constructive suggestions from the reviewer. Although the reviewer acknowledged limited expertise in computational biology, the comments from the reviewer are highly professional and valuable. Following the suggestions from the reviewer, we have not only included more explanatory texts and figures to make the analysis procedures clearer and more understandable, but also supplemented the important parameters that were missing in our previous manuscript. We also provided extra figure to demonstrate the improvements of zSTEP on gene expression patterns. We believe that our work is now more scientific and more understandable, and we will continue working to solve the remaining issues as planned. We express our thanks for the reviewer again.

      Reviewer #2 (Evidence, reproducibility and clarity):

      The authors of the study introduce the Palette method, a novel approach designed to infer spatial gene expression patterns from bulk RNA-sequencing (RNA-seq) data. This method is complemented by the development of the DreSTEP 3D spatial gene expression atlas of zebrafish embryos, establishing a comprehensive resource for visualizing gene expression and investigating spatial cell-cell interactions in developmental biology.

      Response: We sincerely appreciate the reviewer's positive feedback on our Palette pipeline and the zSTEP 3D spatial expression atlas of zebrafish embryos. We also thank the reviewer for the professional comments and constructive suggestions. The reviewer raised the concerns from the aspect of algorithm design and computational biology, which we did not address well in our previous manuscript. We agree with the reviewer that we did not clarify the selection criteria of the parameters in detail, and we are now working on the additional analyses to address this issue.

      We also agree with the reviewer that we did not provide enough discussion of the strategies used in the pipeline, the features of Palette and the application scenarios of Palette and zSTEP. For wide use of our tools, it is significantly important to state these aspects. In this revised version, we have added more paragraphs in the Discussion section to address this issue. Additionally, we acknowledge that we did not adequately demonstrate the computational efficacy and computational requirements, which are important for researchers. We are also working on the additional analyses to address this issue.

      Finally, we thank the reviewer again for the professional and constructive suggestions. These suggestions are addressable, and by following them, we believe our manuscript will see a significant improvement, especially in the Palette pipeline part, making the pipeline more rigorous and easier to access. We are confident that we can complete the planned additional tasks within the next 1-2 months.

      1. The efficacy of the Palette method may be compromised by its dependency on the quality of the reference spatial transcriptomics data. As highlighted in the study, variations in data quality can lead to significant challenges in reconstructing accurate spatial expression patterns from bulk data. This underscores the necessity of evaluating quality parameters, such as the number of gene detections and spatial resolution, to ensure reliable outcomes. Additional studies should rigorously assess how these quality factors influence the accuracy and efficiency of the algorithm in various data contexts, particularly under diverse conditions of gene detection.

      Response: We thank the reviewer for this valuable suggestion. We agree with the reviewer that the quality of the reference ST data may greatly influence the performance and efficacy of the Palette, and we have added paragraphs in the Discussion section to further discuss the impact of ST data quality on Palette performance. As mentioned by the reviewer, gene detections and spatial resolution are two important parameters that can influence the Palette performance. Low gene detection may impact the clustering process, making the cell types of spots not distinguished well. To evaluate the performance of Palette when ST data shows low gene detection, we plan to applied Palette using MERFISH data as the ST reference, which only captures hundreds of genes. Moreover, we will also investigate the impact of spatial resolution on Palette performance by merging ST spots to simulate lower resolution scenarios, as well as the impact of gene detection by randomly reducing detected genes. Through the comparison among the inferred expression patterns with ST data of different spatial resolutions or different numbers of detected genes, we can better access the performance of Palette and provide guidance to researchers on the appropriate ST data requirements for optimal performance. These analyses will take another one month to accomplish after this round of revision due to the limited response time.

      1. The methodology raises pertinent questions regarding how the clustering results from different algorithms may affect the reconstructions by the Palette method. The authors would better provide a detailed discussion/comparison of clustering processes that optimize the reconstruction of spatial patterns, ensuring precision in the downstream analyses.

      Response: We thank the reviewer for the constructive comments. We agree with the reviewer that the differences in clustering results would impact the inference of the Palette. In our Palette pipeline, rather than develop a new methodology for clustering, we employ the BayesSpace for spot clustering, which considers both spot transcriptional similarity and neighbouring structure for clustering. In this case, researchers may adjust the parameters in the BayesSpace package to achieve optimal clustering results. Actually, in most cases, the spot identities were achieved through UMAP analysis, which only considers the transcriptional differences but does not consider the spatial information. This kind of clustering strategy will potentially lead to an intricate arrangement of spots belonging to different clusters, and may result in sparse gene expression in Palette outcome, which is different from the patterns in bona fide tissues. Therefore, a suitable clustering strategy will definitely help capture the local patterns.

      Moreover, our Palette pipeline also can use the clustering results from the tissue histomorphology. Using tissue histomorphology for clustering would be a good choice, as it is closer to the real case. The following Figure (Fig. 11) displays the Palette performance on PDAC datasets using both spatial clustering and histomorphology clustering strategies. The result using histomorphology clustering captures the weak pattern (indicated by the red circle) that were missed when using the spatial clustering (Fig. 11d).

      See response letter for the figure.

      1. The choice to utilize only highly expressed genes in the initial stages of the Palette algorithm also warrants further exploration. Addressing the criteria for determining which genes qualify as "highly expressed" and outlining robust cutoff will enhance the algorithm's rigor and applicability. Similarly, in the iterative estimation of gene expression across spatial spots, establishing optimal iteration conditions is crucial. Implementing a loss function may offer a systematic method for concluding iterations, thus refining computational efficiency.

      Response: We thank the reviewer for the professional suggestions. As pointed out by the reviewer, the selection of highly expressed genes and the iteration times are two important parameters in our pipeline. The definition of highly expressed genes and the number of highly expressed genes are important for achieving a satisfactory clustering performance. We tested the impact of different numbers of highly expressed genes on cluster performance in our preliminary analyses, while we did not summarize these tests and specify the parameters. Therefore, we plan to include a supplementary figure showing the clustering performances under different definitions of highly expressed genes and different numbers of highly expressed genes. Additionally, for the iteration conditions, we have tested different iteration numbers to find out a suitable iteration number to achieve a stable expression in each spot. The following figure (Fig. 1) shows the results after performing Palette with different iteration times. We randomly selected 20 cells and compared their expression across tests with varying iteration times. The results indicate that for a ST dataset with 819 spots, the expression in each spot becomes nearly stable after 5000 iteration times. We previously did not consider the computational efficiency, while here the reviewer raises a valuable and professional suggestion to implement a loss function to determine the optimal number of iterations. We greatly appreciate this suggestion, and plan to apply a loss function to summarize the optimal iteration times for ST datasets of different sizes. This will provide guidance for potential researchers in selecting iteration times and enhance computational efficiency.

      See response letter for the figure.

      1. Performance metrics relating to processing speed and computational demands remain inadequately addressed in the current framework. Understanding how the Palette method scales across varying gene counts and bulk RNA-seq datasets will be essential for potential applications in larger biological contexts. Notably, the quantitative demands of analyzing 20,000 genes when processing 10, 100, or 1,000 bulk RNA profiles must be articulated to guide researchers in planning accordingly.

      Response: We thank the reviewer for this valuable and professional suggestion. In our previous analyses, we did not consider the computation efficiency, processing speed and computational demands, which are important information for potential researchers. To address this issue, we will list our computer configuration first. And under this configuration, we plan to run Palette on datasets with different numbers of overlapped genes or ST references with varying spot numbers, and then summarize the running times into a metrics table. This will help researchers estimate the running time for their datasets and guide them in planning the analyses. We will begin the analyses soon and expect to complete the analysis within the next 1 to 2 months.

      Minor opinions:

      1. Despite the promising advances offered by the zebrafish 3D reconstruction, there is a lack of details regarding numbers of the spatial transcriptomics (ST) data utilized, and the number of bulk RNA-seq data employed in the analyses. These parameters need to be clarified.

      Response: We thank the reviewer for reminding us of these parameters. We are sorry for not including these parameters in our previous manuscript. We have now included the numbers of bulk, ST and overlap genes in the Methods section as follows (Fig. 12).

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      1. Issues regarding spatial cell-cell communication, especially concerning interactions over longer distances, necessitate careful consideration. Introducing spatial distance constraints could help formulate more realistic models of cellular interactions, a vital aspect of embryonic development.

      Response: We thank the reviewer for this essential comment. We agree with the reviewer that the spatial distance is an essential factor to investigate in vivo cell-cell communication during embryonic development. Therefore, in our analyses, we employed CellChat for spatial cell-cell communication analysis, which can be used to infer and visualize spatial cell-cell communication network for ST datasets, considering the spatial distance as constrains of the computed communication probability. However, during our analyses, we observed that there were interactions between cell types over longer distances, as mentioned by the reviewer. We then investigated how these interactions of longer distances occurred. Here, we show the FGF interaction between tail bud and neural crest cells from our spatial cell-cell analysis as an example, and the distance between these two cell types appears quite significant (Fig. 13). We labelled tail bud cells and neural crest cells on the selected midline section and observed that, although most neural crest cells are distributed anteriorly, a small number of neural crest cells are located at tail, close to the tail bud cells. Therefore, the observed interaction between tail bud and neural crest cells is likely due to their adjacent distribution in the tail region, while the anteriorly distributed of neural crest spot in spatial cell-cell communication analysis reflects the anterior positioning of most neural crest cells. As a result, the distances shown on the spatial cell-cell communication analysis are not the real distance between two cell types.

      In most cases in our spatial cell-cell communication analyses, the observed interactions over longer distances are likely influenced by this visualization strategy. Additionally, pre-processing the dataset may enhance the performance of the analyses. Here we performed systematic analyses of the entire embryo, which can make the interactions between cell types appear massive. To investigate specific biological questions, researchers can subset cell types of interest or categorize them into different subtypes based on their positions.

      See response letter for the figure.

      1. Evaluation metrics such as the Adjusted Rand Index (ARI) and Root Mean Square Error (RMSE) represent critical tools for systematically measuring the similarity of inferred spatial patterns, yet their specific application within this context should be elaborated.

      Response: We thank the reviewer for recommending these two tools. We have applied them to evaluate the similarity between the expression patterns (Fig. 14). The inclusion of these statistical values makes our comparisons of expression patterns more scientific and convincing. And we have added the following texts in the Methods section to describe the calculation of these two values.

      "The Adjusted Rand Index (ARI) and Root Mean Square Error (RMSE) were used to evaluate the similarity of the expression patterns. The expression patterns of in situ hybridization images were considered as the expected values, and the expression patterns of ST data and inferred expression patterns were compared to the expected values. Common positions along the AP axis within all three expression profiles were used, and the RMSE were calculated based on the scaled intensity of these positions. Values greater than the threshold were set to 1; otherwise, they were set to 0, and the ARI was then calculated based on the intensity category. Higher ARI and lower RMSE indicate greater similarity."

      See response letter for the figure.

      1. The study's limitations surrounding ST data quality cannot be overstated. Discussing scenarios where only limited or poor-quality ST data are available will be crucial for guiding future studies. Furthermore, a clear explanation of how enhanced specificity and accuracy translate into tangible biological insights is essential for demystifying the underlying mechanisms driving developmental processes.

      Response: We thank the reviewer for raising this essential suggestion. We have realized that in our previous manuscript, our discussion on the advantages and limitations of Palette and zSTEP was neither broad nor detailed enough.

      Therefore, in our revised manuscript, we have added the following paragraphs to further discuss the advantages and limitations of Palette and zSTEP, as well as the potential application of zSTEP in developmental biology.

      In this section, we have emphasized again the impact of ST data quality on the performance of Palette and zSTEP, and then compared Palette with the strategy that uses well-established marker genes to infer spatial information. We demonstrated that although Palette cannot achieve single cell resolution, it captures the major expression patterns, which are closely correlated to biological functions and critical for embryonic development. Furthermore, we further discussed that zSTEP is not only a valuable tool for investigating gene expression patterns, but also has the potential in evaluating the reaction-diffusion model to investigate the complicated and well-choreographed pattern formation during embryonic development.

      As here we have provided a more comprehensive discussion about Palette and zSTEP, we think that the potential researchers will better understand the application scenarios of our inference pipeline and our datasets. We hope our study can assist and inspire further research in the field of spatial transcriptomics and developmental biology.

      "Thirdly, the performance of Palette and zSTEP heavily relied on the quality of ST data. If the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots, and the performance of spot clustering could also be affected. Moreover, in this study, for example, the Stereo-seq data of 12 hpf zebrafish embryo had fewer slices on the right side (Fig. S3b), resulting in more blank spots in the right part of zSTEP for the 12 hpf embryo. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be enhanced and demonstrate even greater potential for analysing spatiotemporal gene expression.

      On the other hand, compared to the brilliant strategy that infers spatial information of scRNA-seq data from well-established genes, our Palette pipeline cannot achieve single cell resolution. However, our Palette pipeline is based on the ST reference, and thus preserves the real positional relationships between spots. Furthermore, the focus of our pipeline is to infer the gene expression patterns, which are closely correlated to biological functions and critical for embryonic development, rather than the sparse expression within individual spots. In this regard, our Palette pipeline can be advantageous, as it allows for reconstruction of the major expression profiles, which are often more relevant for understanding developmental processes. Additionally, our Palette can be applied to serial sections, enabling the construction of 3D ST atlas.

      Finally, while the current analyses demonstrated that zSTEP can serve as a valuable tool for identifying genes having specific patterns at certain developmental stages, the exploration of zSTEP is still limited. During animal development, pattern formation is always one of the most important developmental issues. As demonstrated by the reaction-diffusion (RD) model, morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification, while interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette constructed zSTEP, as a comprehensive transcriptomic expression pattern during development, could be leveraged to evaluate and prove the RD model during development, including AP patterning. Moreover, the investigation of gene expression patterns should not be limited to morphogens and TFs, and further investigation of their roles in AP patterning is desirable. Additionally, here a random forest model may be sufficient for investigating the most essential morphogens and TFs for AP axis refinement, while more sophisticated machine learning models may be required for addressing more specific biological questions."

      Reviewer #2 (Significance):

      The Palette pipeline demonstrates a marked improvement in specificity and accuracy when predicting spatial gene expression patterns. Evaluative studies on Drosophila and zebrafish datasets affirm its enhanced performance compared to existing methodologies. By effectively reconstructing spatial information from bulk transcriptomic data, the Palette method innovatively merges the philosophy of leveraging single-cell transcriptomic data for deconvolution analyses. This integration is pivotal, advancing traditional bulk RNA-seq approaches while laying the groundwork for future research.

      One of the notable achievements in this work is the construction of the DreSTEP atlas, which integrates serial bulk RNA-seq data with advanced 3D imaging techniques. This resource grants researchers unprecedented access to the visualization of gene expression patterns across the zebrafish embryo, facilitating the investigation of spatial relationships and cell-cell interactions critical for developmental processes. Such capabilities are invaluable for understanding the intricate dynamics of embryogenesis and the distinct roles of individual cell types.

      Response: We thank the reviewer for the positive evaluation of our work, either the Palette pipeline or zSTEP. The reviewer has strong expertise in algorithm development and computational biology, and the concerns and suggestions from the reviewer are significantly precious and valuable for us. Regarding the bioinformatics tool development, we did not have extensive experiences, and thus we did not thoroughly address the selection criteria or clarify the parameters used in the pipeline, which may influence the application by other researchers. Therefore, we sincerely appreciate the professional suggestions from the reviewer, which we can follow to address these issues, improve our manuscript and make our work more impactful for researchers. Additionally, we did not consider computation efficiency, processing speed and computational demands, which would be important factors for other researchers to use Palette. We would like to add extra analyses to address these aspects.

      Currently, based on the suggestions from the reviewer, we have added extra texts discussing the clustering strategy in Palette pipeline, the advantages and limitations of Palette, and the potential application of zSTEP in developmental biology. We believe that readers will now have a clearer understanding of the performance of Palette and the application scenarios of both Palette and zSTEP. We have not fully addressed the comments raised by the reviewer yet, while we are working on the planned additional analyses and expect to complete all these tasks within the next 1-2 months. We sincerely thank the reviewer for the professional and valuable suggestions, which definitely improve our work and will make it accessible for a wide range of researchers.

      Finally, through this review process, we have learned a lot about the important considerations and requirements when designing bioinformatics tools, and we benefit a lot from the thoughtful guidance. We express our thanks to the reviewer again for the guidance, and we will try our best to address the remaining issues to further improve our manuscript.

      Reviewer #3 (Evidence, reproducibility and clarity):

      Evidence, reproducibility and clarity

      In this study, Dong and colleagues developed a computational pipeline to use spatial transcriptomics (ST) datasets as a reference to infer the spatial patterns of gene expression from bulk RNA sequencing data. This approach aims to overcome the low read depth and limited gene detection capabilities in current ST datasets, while exploiting its ability to provide highly resolved spatial information. By combining bulk RNA-seq datasets from 3 developmental stages during early zebrafish development with previously available ST and imaging datasets, the authors build DreSTEP (Danio rerio spatiotemporal expression profiles). Using this approach, they go on to identify the morphogens and transcription factors involved in anteroposterior patterning.

      The paper is well written, and the pipeline presented in this study is likely to be useful beyond the case studies included in this study. There are a few questions that, in my view, would be important to clarify to increase the impact of this work:

      Response: We sincerely appreciate the positive feedback from the reviewer on the Palette pipeline and zebrafish spatiotemporal expression profiles zSTEP. We thank the reviewer for the constructive suggestions, which have inspired us to think deeply about application and advantages of Palette and zSTEP for future studies.

      We fully agree with the reviewer that we do not sufficiently clarify the advantages and limitations of our inference pipeline in the original manuscript. The questions raised by the reviewer are very insightful. For example, while the inference expression patterns may closely resemble the in situ hybridization observation, which we consider as good performance, the reviewer pointed out that we should consider whether weak, yet real expression may have been removed. These questions have motivated us to think more deeply about the underlying principles and assumptions of our inference pipeline. Following the reviewer's questions, we have expanded our discussion on the application of zSTEP in developmental biology and the features of Palette compared to the existing strategies.

      We believe that after incorporating the revisions, our current manuscript now demonstrates the application scenario of Palette clearer and suggested the application of zSTEP for investigating biological questions in developmental biology. We are grateful for the reviewer's guidance, which helps us increase the impact of our work.

      1. The authors mention that they used a variable factor to adjust expression differences between the ST and bulk RNA-seq datasets. It would be important for the authors to comment on how much overlap in gene expression is necessary between the datasets for an accurate calculation of this variable factor? Can this be directly tested, for instance, by testing how their conclusions vary if expression is adjusted by a variable factor calculated from only a smaller set of genes?

      Response: We thank the reviewer for the professional questions. We are sorry about not including the gene numbers in our previous manuscript. And now we have provided the numbers of genes in bulk and ST data and the numbers of the overlapped genes (Fig. 15).

      "Palette was performed on the aligned slices using the overlapped genes. For the 10 hpf embryo, there were 24,658 genes in the bulk data, 18,698 genes in the Stereo-seq data, and 16,601 overlapped genes. For the 12 hpf embryo, there were 23,018 genes in the bulk data, 18,948 genes in the Stereo-seq data, and 16,401 overlapped genes. For the 16 hpf embryo, there were 24,357 genes in the bulk data, 23,110 genes in the Stereo-seq data, and 19,539 overlapped genes."

      See response letter for the figure.

      For Palette implementation, we took all the overlapped genes. To calculate the variable factor, we aggregated the expression of each gene in the ST data, and then used the expression of the bulk data to divide the aggregated expression for variable factor calculation. As a result, each overlapped gene was assigned a variable factor to adjust its expression, based on its difference between bulk and ST data. The rationale behind this approach is that by considering the ST data as a whole, we can effectively reduce the variations among individual spots. This allows the variable factors to provide reasonable adjustment to gene expression.

      Above all, the variable factors can be directly calculated. Currently Palette only can infer the expression patterns of overlapped genes. It means when the number of overlapped genes is small, such as MERFISH only detecting hundreds of genes, Palette can only infer the expression patterns of these genes. However, if the MERFISH data have good quality, which enable resolving distinct cell types, we believe Palette will also show good performance when using MERFISH as ST reference. Additionally, we plan to perform Palette using MERFISH as ST reference to further demonstrate its broad application when using different ST references.

      1. Palette gives rise to highly spatially precise patterns, which closely match those found in ISH. However, the smoothening of the expression can also remove weak, yet real, local expression patterns, as shown for idgf6 in Fig. 2a. Can the authors test this more extensively for other genes?

      Response: We thank the reviewer for this essential question. We agree with the reviewer that weak, yet real expression might be removed in our Palette inference pipeline. The weak, sparse expression may be due to the ST technique itself or the variations in samples. However, that sparse gene expression may not have biological meaning, and the focus of our pipeline in to capture the expression patterns, which are closely correlated with functions and crucial for embryonic development. Therefore, our algorithm considers spot characteristics and emphasize cluster-specific expression, resulting in spatial-specific expression patterns. In most cases, the main gene expression patterns can be captured, which can help understand gene functions and roles in embryonic development. We have updated Supplementary Figure S1a (Fig. 16) to include more gene patterns to demonstrate this point.

      See response letter for the figure.

      1. Using adjacent slices for ST and "bulk RNA-seq" may provide better results than those obtained when comparing two independent datasets. Could the authors also extend the analysis of Palette's functionalities by using separate, previously available but independent datasets, for ST and bulk RNA-seq in Drosophila as well?

      Response: We thank the reviewer for the valuable question. We agree with the reviewer that using adjacent slices may provide better results. The idea here is that the inferred spatial expression patterns from pseudo bulk RNA-seq can be used to compare with the real expression of ST to evaluate Palette performance. We have updated our Figure 2a (Fig. 17) to illustrate the analysis clearer.

      See response letter for the figure.

      To demonstrate the Palette's functionalities, we have used Palette to infer zebrafish bulk RNA-seq slice (Junker et al., 2014) using Stereo-seq slice (Liu et al., 2022) as ST reference, and these two datasets are separate and independent. We agree with the reviewer that it would be good to use separate datasets to test in Drosophila to further demonstrate the Palette's functionalities. However, unfortunately, we did not find the Drosophila serial bulk RNA-seq data along left-right axis of the corresponding stages, and thus we might be unable to perform the extra analyses using independent Drosophila datasets.

      References:

      Junker, J.P. et al. Genome-wide RNA Tomography in the zebrafish embryo. Cell 159, 662-675 (2014).

      Liu, C. et al. Spatiotemporal mapping of gene expression landscapes and developmental trajectories during zebrafish embryogenesis. Dev Cell 57, 1284-1298 e1285 (2022).

      1. The DreSTEP analysis in zebrafish embryos is interesting and validates well-established observations in the field. Can the authors also discuss whether and how their dataset allows them to refine our understanding of the spatial or temporal pattern of the morphogens and TFs involved in AP patterning? This would further validate their approach.

      Response: We appreciate the reviewer for recognition of our zSTEP and raising this valuable question, which has inspired us to think more deeply about the potential application of zSTEP in developmental biology. As the reviewer noted, our zSTEP analyses have validated well-established observations in the field. Rather than focusing on the sparse expression detected in ST data, zSTEP emphasizes the gene expression patterns that are closely correlated with biological functions and critical for embryonic development. Therefore, zSTEP can serve as a valuable tool for identifying the genes having specific patterns at certain developmental stages.

      Pattern formation is one of the most important developmental issues for all animals. The reaction-diffusion (RD) model is a widely recognized theoretical framework used to explain self-regulated pattern formation in developing animal embryos (Kondo & Miura, 2010). Morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification. Most importantly, interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette-constructed zSTEP provides a comprehensive transcriptomic expression pattern, including all morphogens and TFs, across the whole embryo during development. These valuable resources, in our opinion, could be leveraged to evaluate and prove the RD model during development, including AP patterning. In our current zSTEP analyses, we have already identified genes that exhibit specific expression patterns along AP axis, some of which have not been fully characterized. These genes could be potential targets for further investigation into their roles in AP patterning, although they are not the primary focus of this study. Additionally, our analyses only focused on morphogens and TFs, but zSTEP can be used to investigate the expression patterns of other genes as well. Moreover, we employed a random forest model to investigate the most essential morphogens and TFs for AP axis refinement, which is one of the basic applications of zSTEP. To investigate specific biological questions of interest, it would be worth exploring the use of more sophisticated machine learning models.

      We have added the following paragraph in the Discussion section to discuss the potential application of zSTEP in future studies.

      "Finally, while the current analyses demonstrated that zSTEP can serve as a valuable tool for identifying genes having specific patterns at certain developmental stages, the exploration of zSTEP is still limited. During animal development, pattern formation is always one of the most important developmental issues. As demonstrated by the reaction-diffusion (RD) model, morphogen molecules are produced at specific regions of the embryo, forming morphogen gradients to guide cell specification, while interactions between different morphogens instruct more complicated and well-choreographed pattern formation. Our Palette constructed zSTEP, as a comprehensive transcriptomic expression pattern during development, could be leveraged to evaluate and prove the RD model during development, including AP patterning. Moreover, the investigation of gene expression patterns should not be limited to morphogens and TFs, and further investigation of their roles in AP patterning is desirable. Additionally, here a random forest model may be sufficient for investigating the most essential morphogens and TFs for AP axis refinement, while more sophisticated machine learning models may be required for addressing more specific biological questions."

      Reference

      Kondo, S. & Miura, T. Reaction-Diffusion model as a framework for understanding biological pattern formation. Science 329, 1616-1620 (2010).

      1. Can the authors comment on the limits of this inference pipeline? And how it performs as compared to single-cell RNA sequencing datasets where spatial information is inferred from well-established marker genes?

      Response: We appreciate the reviewer for this insightful question, which has inspired us to further explore the advantages and limitations of the Palette pipeline in comparison with other inference strategies. As mentioned in the Discussion section, a key limitation of the inference pipeline is its heavy reliance on the quality of ST data. It is obvious that if the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots. We think it is a common issue for any inference tools using ST data as the reference. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be improved.

      As a comparison, the single-cell RNA sequencing datasets where spatial information is inferred from well-established marker genes do not face this limitation. The ground-breaking work by Satija et al. (2015) used such a strategy that combined scRNA-seq and in situ hybridizations of well-established marker genes to infer spatial location, enabling single cell resolution, as it maintains the high read depth and gene detection. One advantages of this scRNA-seq-based strategy is that it provides the transcriptomics of individual cells, rather than a combination of cell within a ST spot, although the positional relationships between cells are not real.

      However, compared to the inference from ST data, the positional relationships between cells are not directly captured. On the other hand, as the embryonic development progresses, more cell types will be specified, and the body patterning becomes more complex. In this scenario, using well-established marker gene to infer spatial information would be much more challenging. Additionally, there are not many scRNA-seq datasets of serial sections, and thus this strategy may not be used to construct 3D ST atlas.

      In contrast, our Palette inference pipeline is based on the ST data, which preserves the real positional relationships between spots. Although our inference pipeline cannot achieve single cell resolution, it focuses on the gene expression patterns rather than the sparse expression within individual spots. By applying Palette to paired serial sections, we were able to generated a 3D spatial expression atlas of zebrafish embryos, which has showed promising performance for investigating gene expression patterns and their involvement in AP patterning.

      Reference

      Satija, R. et al. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33, 495-502 (2015)

      We have updated the following paragraphs to further demonstrating the limitation of the inference pipeline in details in the Discussion section.

      "Thirdly, the performance of Palette and zSTEP heavily relied on the quality of ST data. If the quality of ST data is not of sufficient quality, the low-expression genes may not be detected or only appear in very few scattered spots, and the performance of spot clustering could also be affected. Moreover, in this study, for example, the Stereo-seq data of 12 hpf zebrafish embryo had fewer slices on the right side (Fig. S3b), resulting in more blank spots in the right part of zSTEP for the 12 hpf embryo. However, with the ongoing advancements in spatial resolution and data quality, the performance of Palette is expected to be enhanced and demonstrate even greater potential for analysing spatiotemporal gene expression.

      On the other hand, compared to the brilliant strategy that infers spatial information of scRNA-seq data from well-established genes, our Palette pipeline cannot achieve single cell resolution. However, our Palette pipeline is based on the ST reference, and thus preserves the real positional relationships between spots. Furthermore, the focus of our pipeline is to infer the gene expression patterns, which are closely correlated to biological functions and critical for embryonic development, rather than the sparse expression within individual spots. In this regard, our Palette pipeline can be advantageous, as it allows for reconstruction of the major expression profiles, which are often more relevant for understanding developmental processes. Additionally, our Palette can be applied to serial sections, enabling the construction of 3D ST atlas."

      Reviewer #3 (Significance):

      This study tackles an important challenge in biology - the difficult to resolve gene expression patterns with high spatial precision and in a high-throughput manner. By integrating sequencing datasets from previously published studies, as well as newly-generated datasets, the authors provide evidence that their novel inference pipeline enables them to obtain high-quality spatial information simply from bulk RNA-seq datasets, using ST as a reference. The development of this pipeline - Palette - is a major part of this manuscript and its applicability is validated using datasets from Drosophila and zebrafish embryos. This in an important advance for the field, but it would be nice for the authors to further comment on i) the validity of some of their approaches and how they may influence the quality of their inference, as well as, ii) potential pitfalls/limitations of this approach as compared to others available in the field. This would synthetize both previous and current findings into a conceptual and technological framework that would have a strong impact well beyond cell and developmental biology.

      Audience: This study would be relevant for a broad audience of biologists, interested in morphogen signaling, gene regulatory networks and cell fate specification.

      Expertise in zebrafish development, gastrulation, morphogen signaling and morphogenesis.

      Response: We thank the reviewer for providing the positive feedback, arising these valuable questions, which have motivated us to deeply consider the design concept and further application of Palette and zSTEP. Based on the insightful questions from the reviewer, we have added two extra paragraphs in the Discussion section to further discuss the potential application of zSTEP in developmental biology and application scenarios of the Palette pipeline. Specially, we have demonstrated that the performance of the inference pipeline relies on the spatial resolution and data quality of the ST data. We have then compared the advantages and limitations of Palette with the existing brilliant spatial inference strategy, which infers spatial information of scRNA-seq from well-established marker genes. Although our inference pipeline cannot achieve single cell resolution, it can capture the major expression patterns, which are closely correlated to functions and critical for embryonic development. We believe this will help readers gain a clearer understanding of the advantage and limitations of our pipeline compared to other tools, as well as the tasks for which Palette and our constructed zSTEP can be utilized. We express our thanks to the reviewer again for the valuable comments.

    1. Author response:

      The following is the authors’ response to the current reviews.

      We would like to thank the reviewers for their efforts and feedback on our preprint. We have elected to rework the manuscript for publication in a different journal. In this process we will alter many of the approaches and re-evaluate the conclusions. With this, many of the points raised by the reviewers will be no longer relevant and therefore do not require a response. Again, we thank the reviewers for their time and helpful feedback.


      The following is the authors’ response to the original reviews.

      eLife Assessment:

      The authors present a potentially useful approach of broad interest arguing that anterior cingulate cortex (ACC) tracks option values in decisions involving delayed rewards. The authors introduce the idea of a resource-based cognitive effort signal in ACC ensembles and link ACC theta oscillations to a resistance-based strategy. The evidence supporting these new ideas is incomplete and would benefit from additional detail and more rigorous analyses and computational methods.

      We are extremely grateful for the several excellent and comments of the reviewers. To address these concerns, we have completely reworked the manuscript adding more rigorous approaches in each phase of the analysis and computational model. We realize that this has taken some time to prepare the revision. However, given the comments of the reviewers, we felt it necessary to thoroughly rework the paper based on their input. Here is a (nonexhaustive) overview of the major changes we made:

      We have developed a way to more adequately capture the heterogeneity in the behavior

      We have completely reworked the RL model

      We have added additional approaches and rigor to the analysis of the value-tracking signal. 

      Reviewer #1 (Public Review):

      Summary:

      Young (2.5 mo [adolescent]) rats were tasked to either press one lever for immediate reward or another for delayed reward. 

      Please note that at the time of testing and training that the rats were > 4 months old. 

      The task had a complex structure in which (1) the number of pellets provided on the immediate reward lever changed as a function of the decisions made, (2) rats were prevented from pressing the same lever three times in a row. Importantly, this task is very different from most intertemporal choice tasks which adjust delay (to the delayed lever), whereas this task held the delay constant and adjusted the number of 20 mg sucrose pellets provided on the immediate value lever.

      Several studies parametrically vary the immediate lever (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183). While most versions of the task will yield qualitatively similar estimates of discounting, the adjusting amount is preferred as it provides the most consistent estimates (PMID: 22445576). More specifically this version of the task avoids contrast effects of that result from changing the delay during the session (PMID: 23963529, 24780379, 19730365, 35661751) which complicates value estimates. 

      Analyses are based on separating sessions into groups, but group membership includes arbitrary requirements and many sessions have been dropped from the analyses. 

      We have updated this approach and now provide a more comprehensive assessment of the behavior. The updated approach applies a hierarchical clustering model to the behavior in each session. This was applied at each delay to separate animals that prefer the immediate option more/less. This results in 4 statistically dissociable groups (4LO, 4HI, 8LO, 8HI) and includes all sessions. Please see Figure 1. 

      Computational modeling is based on an overly simple reinforcement learning model, as evidenced by fit parameters pegging to the extremes. 

      We have completely reworked the simulations in the revision. In the updated RL model we carefully add parameters to determine which are necessary to explain the experimental data. We feel that it is simplified yet more descriptive. Please see Figure 2 and associated text. 

      The neural analysis is overly complex and does not contain the necessary statistics to assess the validity of their claims.

      We have dramatically streamlined the spike train analysis approach and added several statistical tests to ensure the rigor of our results. Please see Figures 4,5,6 and associated text. 

      Strengths:

      The task is interesting.

      Thank you for the positive comment

      Weaknesses:

      Behavior:

      The basic behavioral results from this task are not presented. For example, "each recording session consisted of 40 choice trials or 45 minutes". What was the distribution of choices over sessions? Did that change between rats? Did that change between delays? Were there any sequence effects? (I recommend looking at reaction times.) Were there any effects of pressing a lever twice vs after a forced trial? 

      Please see the updated statistics and panels in Figures 1 and 2. We believe these address this valid concern.  

      This task has a very complicated sequential structure that I think I would be hard pressed to follow if I were performing this task. 

      Human tasks implement a similar task structure (PMID: 26779747). Please note the response above that outlines the benefits of using of this task.   

      Before diving into the complex analyses assuming reinforcement learning paradigms or cognitive control, I would have liked to have understood the basic behaviors the rats were taking. For example, what was the typical rate of lever pressing? If the rats are pressing 40 times in 45 minutes, does waiting 8s make a large difference?

      Thank you for this suggestion. Our additions to Figure 1 are intended to better explain and quantify the behavior of the animals. Note that this task is designed to hold the rate of reinforcement constant no matter the choices of the animals. Our analysis supports the long-held view in the literature that rats do not like waiting for rewards, even at small delays. Going from the 4 à 8 sec delay results in significantly more immediate choices, indicating that the rats will forgo waiting 8 sec for a larger reinforcer and take a smaller reinforcer at 4 sec.  

      For that matter, the reaction time from lever appearance to lever pressing would be very interesting (and important). Are they making a choice as soon as the levers appear? Are they leaning towards the delay side, but then give in and choose the immediate lever? What are the reaction time hazard distributions?

      This is an excellent suggestion, we have added a brief analysis of reaction times (Please see the section entitled “4 behavioral groups are observed across all sessions” in the Results). Please note that an analysis of the reaction times has been presented in a prior analysis of this data set (White et al., 2024). In addition, an analysis of reaction times in this task was performed in Linsenbardt et al. (2017). In short, animals tend to choose within 1 second of the lever appearing. In addition, our prior work shows that responses on the immediate lever tend to be slower, which we viewed as evidence of increased deliberation requirements (possibly required to integrate value signals).   

      It is not clear that the animals on this task were actually using cognitive control strategies on this task. One cannot assume from the task that cognitive control is key. The authors only consider a very limited number of potential behaviors (an overly simple RL model). On this task, there are a lot of potential behavioral strategies: "win-stay/lose-shift", "perseveration", "alternation", even "random choices" should be considered.

      The strategies the Reviewer mentioned are descriptors of the actual choices the rats made. For example, perseveration means the rat is choosing one of the levers at an excessively high rate whereas alternation means it is choosing the two levers more or less equally, independent of payouts. But the question we are interested in is why? We are arguing that the type of cognitive control determines the choice behavior, but cognitive control is an internal variable that guides behavior, rather than simply a descriptor of the behavior. For example, the animal opts to perseverate on the delayed lever because the cognitive control required to track ival is too high. We then searched the neural data for signatures of the two types of cognitive control.

      The delay lever was assigned to the "non-preferred side". How did side bias affect the decisions made?

      The side bias clearly does not impact performance as the animals prefer the delay lever at shorter delays, which works against this bias.  

      The analyses based on "group" are unjustified. The authors compare the proportion of delayed to immediate lever press choices on the non-forced trials and then did k-means clustering on this distribution. But the distribution itself was not shown, so it is unclear whether the "groups" were actually different. They used k=3, but do not describe how this arbitrary number was chosen. (Is 3 the optimal number of clusters to describe this distribution?) Moreover, they removed three group 1 sessions with an 8s delay and two group 2 sessions with a 4s delay, making all the group 1 sessions 4s delay sessions and all group 2 sessions 8s delay sessions. They then ignore group 3 completely. These analyses seem arbitrary and unnecessarily complex. I think they need to analyze the data by delay. (How do rats handle 4s delay sessions? How do rats handle 6s delay sessions? How do rats handle 8s delay sessions?). If they decide to analyze the data by strategy, then they should identify specific strategies, model those strategies, and do model comparison to identify the best explanatory strategy. Importantly, the groups were session-based, not rat based, suggesting that rats used different strategies based on the delay to the delayed lever.

      We have completely reworked our approach for capturing the heterogeneity in behavior. We have taken care to show more of the behavioral statistics that have gone into identifying each of the groups. All sessions are included in this analysis. As the reviewer suggests, we used the statistics from each of the behavioral groups to inform the RL model that explores neural signals that underly decisions in this task. We strongly disagree that groups should be rat and not session based as the behavior of the animal can, and does, change from day to day. This is important to consider when analyzing the neural data as rat-based groupings would ignore this potential source of variance. 

      The reinforcement learning model used was overly simple. In particular, the RL model assumes that the subjects understand the task structure, but we know that even humans have trouble following complex task structures. Moreover, we know that rodent decision-making depends on much more complex strategies (model-based decisions, multi-state decisions, rate-based decisions, etc). There are lots of other ways to encode these decision variables, such as softmax with an inverse temperature rather than epsilon-greedy. The RL model was stated as a given and not justified. As one critical example, the RL model fit to the data assumed a constant exponential discounting function, but it is well-established that all animals, including rodents, use hyperbolic discounting in intertemporal choice tasks. Presumably this changes dramatically the effect of 4s and 8s. As evidence that the RL model is incomplete, the parameters found for the two groups were extreme. (Alpha=1 implies no history and only reacting to the most recent event. Epsilon=0.4 in an epsilongreedy algorithm is a 40% chance of responding randomly.)

      While we agree that the approach was not fully justified, we do not agree that it was invalid. Simply stated, a softmax approach gives the best fit to the choice behavior, whereas our epsilon-greedy approach attempted to reproduce the choice behavior using a naïve agent that progressively learns the values of the two levers on a choice-by-choice basis. Nevertheless, we certainly appreciate that important insights can be gained by fitting a model to the data as suggested. We feel that the new modeling approach we have now implemented is optimal for the present purposes and it replaces the one used in the original manuscript.

      The authors do add a "dbias" (which is a preference for the delayed lever) term to the RL model, but note that it has to be maximal in the 4s condition to reproduce group 2 behavior, which means they are not doing reinforcement learning anymore, just choosing the delayed lever.

      The dbias term was dropped in the new model implementation

      Neurophysiology:

      The neurophysiology figures are unclear and mostly uninterpretable; they do not show variability, statistics or conclusive results.

      While the reviewer is justified in criticizing the clarity of the figures, the statement that “they do not show variability, statistics or conclusive results” is not correct. Each of the figures presented in the first draft of the manuscript, except Figure 3, are accompanied by statistics and measures of variability. Nonetheless we have updated each of the neurophysiology analyses. We hope that the reviewer will find our updates more rigorous and thorough.   

      As with the behavior, I would have liked to have seen more traditional neurophysiological analyses first. What do the cells respond to? How do the manifolds change aligned to the lever presses? Are those different between lever presses?

      We have added several figures that plot the mean +/- SEM of the neural activity (see Figures 4 and 5). Hopefully this provides a more intuitive picture of the changes in neural activity throughout the task.  

      Are there changes in cellular information (both at the individual and ensemble level) over time in the session? 

      We provide several analyses of how firing rate changes over trials in relation to ival over time and trials in the session. In addition, we describe how these signals change in each of the behavioral groups. 

      How do cellular responses differ during that delay while both levers are out, but the rats are not choosing the immediate lever?

      We were somewhat unclear about this suggestion as the delay follows the lever press. In addition, there is no delay after immediate presses 

      Figure 3, for example, claims that some of the principal components tracked the number of pellets on the immediate lever ("ival"), but they are just two curves. No statistics, controls, or justification for this is shown. BTW, on Figure 3, what is the event at 200s?

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      I'm confused. On Figure 4, the number of trials seems to go up to 50, but in the methods, they say that rats received 40 trials or 45 minutes of experience.

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      At the end of page 14, the authors state that the strength of the correlation did not differ by group and that this was "predicted" by the RL modeling, but this statement is nonsensical, given that the RL modeling did not fit the data well, depended on extreme values. Moreover, this claim is dependent on "not statistically detectable", which is, of course, not interpretable as "not different".

      This comment is no longer relevant based on the changes we’ve made to the manuscript. 

      There is an interesting result on page 16 that the increases in theta power were observed before a delayed lever press but not an immediate lever press, and then that the theta power declined after an immediate lever press. 

      Thank you for the positive comment. 

      These data are separated by session group (again group 1 is a subset of the 4s sessions, group 2 is a subset of the 8s sessions, and group 3 is ignored). I would much rather see these data analyzed by delay itself or by some sort of strategy fit across delays.

      Thank you for the excellent suggestion. Our new group assignments take delay into account. 

      That being said, I don't see how this description shows up in Figure 6. What does Figure 6 look like if you just separate the sessions by delay?

      We are unclear what the reviewer means by “this description”.  

      Discussion:

      Finally, it is unclear to what extent this task actually gets at the questions originally laid out in the goals and returned to in the discussion. The idea of cognitive effort is interesting, but there is no data presented that this task is cognitive at all. The idea of a resourced cognitive effort and a resistance cognitive effort is interesting, but presumably the way one overcomes resistance is through resourcelimited components, so it is unclear that these two cognitive effort strategies are different.

      The basis for the reviewers assertation that “the way one overcomes resistance is through resourcelimited components” is not clear. In the revised version, we have taken greater care to outline how each type of effort signal facilitates performance of the task and articulate these possibilities in our stochastic and RL models. We view the strong evidence for ival tracking presented herein as a critical component of resource based cognitive effort. 

      The authors state that "ival-tracking" (neurons and ensembles that presumably track the number of pellets being delivered on the immediate lever - a fancy name for "expectations") "taps into a resourced-based form of cognitive effort", but no evidence is actually provided that keeping track of the expectation of reward on the immediate lever depends on attention or mnemonic resources. They also state that a "dLP-biased strategy" (waiting out the delay) is a "resistance-based form of cognitive effort" but no evidence is made that going to the delayed side takes effort.

      We challenge the reviewers that assertation ival tracking is a “fancy name for expectations”. We make no claim about the prospective or retrospective nature of the signal. Clearly, expectations should be prospective and therefore different from ival tracking. Regarding the resistance signal: First, animals avoid the delay lever more often at the 8 sec delay (Figure 1). We have shown that increasing the delay systematically biases responses AWAY from the delay (Linsenbardt et al., 2017). This is consistent with a well-developed literature that rats and mice do not like waiting for delayed reinforcers. We contend that enduring something you don’t like takes effort. 

      The authors talk about theta synchrony, but never actually measure theta synchrony, particularly across structures such as amygdala or ventral hippocampus. The authors try to connect this to "the unpleasantness of the delay", but provide no measures of pleasantness or unpleasantness. They have no evidence that waiting out an 8s delay is unpleasant.

      We have added spike-field coherence to better contact the literature on synchrony. Note that we never refer to our results as “synchrony”. However, we would be remiss to not address the growing literature on theta synchrony in effort allocation. There is a well-developed literature that rats and mice do not like waiting for delayed reinforcers. If waiting out the delay was not pleasant then why do the animals forgo larger rewards to avoid it? 

      The authors hypothesize that the "ival-tracking signal" (the expectation of number of pellets on the immediate lever) "could simply reflect the emotional or autonomic response". Aside from the fact that no evidence for this is provided, if this were to be true, then, in what sense would any of these signals be related to cognitive control?

      This is proposed as an alternative explanation to the ival signal in the discussion. It was added as our due diligence. Emotional state could provide feedback to the currently implemented control mechanism. If waiting for reinforcement is too unpleasant this could drive them to ival tracking and choosing the immediate option more frequently. We provide this option only as a possibility, not a conclusion. We have clarified this in the revised text. Nevertheless, based on our review of the literature, autonomic tracking in some form, seems to be the most likely function of ACC (Seamans & Floresco 2022). While the reviewer may disagree with this, we feel it is at least as valid as all the complex, cognitively-based interpretations that commonly appear in the literature.

      Reviewer #2 (Public Review):

      Summary:

      This manuscript explores the neuronal signals that underlie resistance vs resource-based models of cognitive effort. The authors use a delayed discounting task and computational models to explore these ideas. The authors find that the ACC strongly tracks value and time, which is consistent with prior work. Novel contributions include quantification of a resource-based control signal among ACC ensembles, and linking ACC theta oscillations to a resistance-based strategy.

      Strengths:

      The experiments and analyses are well done and have the potential to generate an elegant explanatory framework for ACC neuronal activity. The inclusion of local-field potential / spike-field analyses is particularly important because these can be measured in humans.

      Thank you for the endorsement of our work.

      Weaknesses:

      I had questions that might help me understand the task and details of neuronal analyses.

      (1) The abstract, discussion, and introduction set up an opposition between resource and resistancebased forms of cognitive effort. It's clear that the authors find evidence for each (ACC ensembles = resource, theta=resistance?) but I'm not sure where the data fall on this dichotomy.

      (a) An overall very simple schematic early in the paper (prior to the MCML model? or even the behavior) may help illustrate the main point.

      (b) In the intro, results, and discussion, it may help to relate each point to this dichotomy.

      (c) What would resource-based signals look like? What would resistance based signals look like? Is the main point that resistance-based strategies dominate when delays are short, but resource-based strategies dominate when delays are long?

      (d) I wonder if these strategies can be illustrated? Could these two measures (dLP vs ival tracking) be plotted on separate axes or extremes, and behavior, neuronal data, LFP, and spectral relationships be shown on these axes? I think Figure 2 is working towards this. Could these be shown for each delay length? This way, as the evidence from behavior, model, single neurons, ensembles, and theta is presented, it can be related to this framework, and the reader can organize the findings.

      These are excellent suggestions, and we have implemented them, where possible. 

      (2) The task is not clear to me.

      (a) I wonder if a task schematic and a flow chart of training would help readers.

      Yes, excellent idea, we have now included this in Figure 1. 

      (b) This task appears to be relatively new. Has it been used before in rats (Oberlin and Grahame is a mouse study)? Some history / context might help orient readers.

      Indeed, this task has been used in rats in several prior studies in rats. Please see the following references (PMID: 39119916, 31654652, 28000083, 26779747, 12270518, 19389183).

      (c) How many total sessions were completed with ascending delays? Was there criteria for surgeries? How many total recording sessions per animal (of the 54?)

      Please note that the delay does not change within a session. There were no criteria for surgery. 

      (d) How many trials completed per session (40 trials OR 45 minutes)? Where are there errors? These details are important for interpreting Figure 1.

      Every animal in this data set completed 40 trials and we have updated the task description to clarify this issue. There are no errors in this task, but rather the task is designed to the tendency to make an impulsive choice (smaller reward now). 

      (3) Figure 1 is unclear to me.

      (a) Delayed vs immediate lever presses are being plotted - but I am not sure what is red, and what is blue. I might suggest plotting each animal.

      We have updated Figure 1 considerably for clarity. 

      (b) How many animals and sessions go into each data point?

      We hope this is clarified now with our new group assignments as all sessions were included in the analysis. 

      (c) Table 1 (which might be better referenced in the paper) refers to rats by session. Is it true that some rats (2 and 8) were not analyzed for the bulk of the paper? Some rats appear to switch strategies, and some stay in one strategy. How many neurons come from each rat?

      We have updated Table 1 based on our new groupings. The rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. 

      (d) Task basics - RT, choice, accuracy, video stills - might help readers understand what is going into these plots

      (e) Does the animal move differently (i.e., RTs) in G1 vs. G2?

      Excellent suggestion. We have added more analysis of the task variables in the revision (e.g. RT, choice comparisons across delays, etc…)

      (4) I wasn't sure how clustered G1 vs. G2 vs G3 are. To make this argument, the raw data (or some axis of it) might help.

      (a) This is particularly important because G3 appears to be a mix of G1 and G2, although upon inspection, I'm not sure how different they really are

      (b) Was there some objective clustering criteria that defined the clusters?

      (c) Why discuss G3 at all? Can these sessions be removed from analysis?

      Based on our updates to the behavioral analysis these comments are no longer relevant. 

      (5) The same applies to neuronal analyses in Fig 3 and 4

      (a) What does a single neuron peri-event raster look like? I would include several of these.

      (b) What does PC1, 2 and 3 look like for G1, G2, and G3?

      (c) Certain PCs are selected, but I'm not sure how they were selected - was there a criteria used? How was the correlation between PCA and ival selected? What about PCs that don't correlate with ival?

      (d) If the authors are using PCA, then scree plots and PETHs might be useful, as well as comparisons to PCs from time-shuffled / randomized data.

      We hope that our reworking of the neural data analysis has clarified these issues. We now include several firing rate examples and aggregate data.   

      (6) I had questions about the spectral analysis

      (a) Theta has many definitions - why did the authors use 6-12 Hz? Does it come from the hippocampal literature, and is this the best definition of theta? What about other bands (delta - 1-4 Hz), theta (4-7 Hz); and beta - 13- 30 Hz? These bands are of particular importance because they have been associated with errors, dopamine, and are abnormal in schizophrenia and Parkinson's disease.

      This designation comes mainly from the hippocampal and ACC literature in rodents. In addition, this range best captured the peak in the power spectrum in our data. Note that we focus our analysis on theta give the literature regarding theta in the ACC as a correlate of cognitive controls (references in manuscript). We did interrogate other bands as a sanity check and the results were mostly limited to theta. Given the scope of our manuscript and the concerns raised regarding complexity we are concerned that adding frequency analyses beyond theta obfuscates the take home message.

      However, the spectrograms in Figure 3 show a range of frequencies and highlight the ones in the theta band as the most dynamic prior to the choice. 

      (b) Power spectra and time-frequency analyses may justify the authors focus. I would show these (yaxis - frequency, x-axis - time, z-axis, power).

      Thank you for the suggestion. We have added this to Figure 3.    

      (7) PC3 as an autocorrelation doesn't seem the to be right way to infer theta entrainment or spikefield relationships, as PCA can be vulnerable to phantom oscillations, and coherence can be transient. It is also difficult to compare to traditional measures of phase-locking. Why not simply use spike-field coherence? This is particularly important with reference to the human literature, which the authors invoke.

      Excellent suggestion. Note that PCA provided a way to classify neurons that exhibited peaks in the autocorrelation at theta frequencies. We have added spike-field coherence, and this analysis confirms the differences in theta entrainment of the spike trains across the behavioral groups. Please see Figure 6D.   

      Reviewer #3 (Public Review):

      Summary:

      The study investigated decision making in rats choosing between small immediate rewards and larger delayed rewards, in a task design where the size of the immediate rewards decreased when this option was chosen and increased when it was not chosen. The authors conceptualise this task as involving two different types of cognitive effort; 'resistance-based' effort putatively needed to resist the smaller immediate reward, and 'resource-based' effort needed to track the changing value of the immediate reward option. They argue based on analyses of the behaviour, and computational modelling, that rats use different strategies in different sessions, with one strategy in which they consistently choose the delayed reward option irrespective of the current immediate reward size, and another strategy in which they preferentially choose the immediate reward option when the immediate reward size is large, and the delayed reward option when the immediate reward size is small. The authors recorded neural activity in anterior cingulate cortex (ACC) and argue that ACC neurons track the value of the immediate reward option irrespective of the strategy the rats are using. They further argue that the strategy the rats are using modulates their estimated value of the immediate reward option, and that oscillatory activity in the 6-12Hz theta band occurs when subjects use the 'resistancebased' strategy of choosing the delayed option irrespective of the current value of the immediate reward option. If solid, these findings will be of interest to researchers working on cognitive control and ACCs involvement in decision making. However, there are some issues with the experiment design, reporting, modelling and analysis which currently preclude high confidence in the validity of the conclusions.

      Strengths:

      The behavioural task used is interesting and the recording methods should enable the collection of good quality single unit and LFP electrophysiology data. The authors recorded from a sizable sample of subjects for this type of study. The approach of splitting the data into sessions where subjects used different strategies and then examining the neural correlates of each is in principle interesting, though I have some reservations about the strength of evidence for the existence of multiple strategies.

      Thank you for the positive comments. 

      Weaknesses:

      The dataset is very unbalanced in terms of both the number of sessions contributed by each subject, and their distribution across the different putative behavioural strategies (see table 1), with some subjects contributing 9 or 10 sessions and others only one session, and it is not clear from the text why this is the case. Further, only 3 subjects contribute any sessions to one of the behavioural strategies, while 7 contribute data to the other such that apparent differences in brain activity between the two strategies could in fact reflect differences between subjects, which could arise due to e.g. differences in electrode placement. To firm up the conclusion that neural activity is different in sessions where different strategies are thought to be employed, it would be important to account for potential cross-subject variation in the data. The current statistical methods don't do this as they all assume fixed effects (e.g. using trials or neurons as the experimental unit and ignoring which subject the neuron/trial came from).

      In the revised manuscript we have updated the group assignments. We have improved our description of the logic and methods for employing these groupings as well. With this new approach, all sessions are now included in the analysis. The group assignments are made purely on the behavioral statistics of an animal in each session. We feel this approach is preferable to eliminating neurons or session with the goal of balancing them, which may introduce bias. Further, the rats that contribute the most sessions also tend to be represented across the behavioral groups therefore it is unlikely that effort allocation strategies across groupings are an esoteric feature of an animal. As neurons are randomly sampled from each animal on a given session, we feel that we’re justified in treating these as fixed effects.   

      It is not obvious that the differences in behaviour between the sessions characterised as using the 'G1' and 'G2' strategies actually imply the use of different strategies, because the behavioural task was different in these sessions, with a shorter wait (4 seconds vs 8 seconds) for the delayed reward in the G1 strategy sessions where the subjects consistently preferred the delayed reward irrespective of the current immediate reward size. Therefore the differences in behaviour could be driven by difference in the task (i.e. external world) rather than a difference in strategy (internal to the subject). It seems plausible that the higher value of the delayed reward option when the delay is shorter could account for the high probability of choosing this option irrespective of the current value of the immediate reward option, without appealing to the subjects using a different strategy.

      Further, even if the differences in behaviour do reflect different behavioural strategies, it is not obvious that these correspond to allocation of different types of cognitive effort. For example, subjects' failure to modify their choice probabilities to track the changing value of the immediate reward option might be due simply to valuing the delayed reward option higher, rather than not allocating cognitive effort to tracking immediate option value (indeed this is suggested by the neural data). Conversely, if the rats assign higher value to the delayed reward option in the G1 sessions, it is not obvious that choosing it requires overcoming 'resistance' through cognitive effort.

      The RL modelling used to characterise the subject's behavioural strategies made some unusual and arguably implausible assumptions:

      Thank you for the feedback, based on these comments (and those above) we have completely reworked the RL model. In addition, we’ve taken care to separate out the variables that correspond to a resistance- versus a resource-based signal. 

      There were also some issues with the analyses of neural data which preclude strong confidence in their conclusions:

      Figure 4I makes the striking claim that ACC neurons track the value of the immediately rewarding option equally accurately in sessions where two putative behavioural strategies were used, despite the behaviour being insensitive to this variable in the G1 strategy sessions. The analysis quantifies the strength of correlation between a component of the activity extracted using a decoding analysis and the value of the immediate reward option. However, as far as I could see this analysis was not done in a cross-validated manner (i.e. evaluating the correlation strength on test data that was not used for either training the MCML model or selecting which component to use for the correlation). As such, the chance level correlation will certainly be greater than 0, and it is not clear whether the observed correlations are greater than expected by chance.

      We have added more rigorous methods to assess the ival tracking signal (Figure 4 and 5). In addition, we’ve dropped the claim that ival tracking is the same across the behavioral groups. We suspect that this was an artifact of a suboptimal group assignment approach in the previous version. 

      An additional caveat with the claim that ACC is tracking the value of the immediate reward option is that this value likely correlates with other behavioural variables, notably the current choice and recent choice history, that may be encoded in ACC. Encoding analyses (e.g. using linear regression to predict neural activity from behavioural variables) could allow quantification of the variance in ACC activity uniquely explained by option values after controlling for possible influence of other variables such as choice history (e.g. using a coefficient of partial determination).

      We agree that the ival tracking signal may be influenced by other variables – especially ones that are not cognitive but rather more generated by the autonomic system. We have included a discussion of this possibility in the Discussion section. Our previous work has explored the role of choice history on neural activity, please see White et al., (2024). 

      Figure 5 argues that there are systematic differences in how ACC neurons represent the value of the immediate option (ival) in the G1 and G2 strategy sessions. This is interesting if true, but it appears possible that the effect is an artefact of the different distribution of option values between the two session types. Specifically, due to the way that ival is updated based on the subjects' choices, in G1 sessions where the subjects are mostly choosing the delayed option, ival will on average be higher than in G2 sessions where they are choosing the immediate option more often. The relative number of high, medium and low ival trials in the G1 and G2 sessions will therefore be different, which could drive systematic differences in the regression fit in the absence of real differences in the activity-value relationship. I have created an ipython notebook illustrating this, available at: https://notebooksharing.space/view/a3c4504aebe7ad3f075aafaabaf93102f2a28f8c189ab9176d48 07cf1565f4e3. To verify that this is not driving the effect it would be important to balance the number of trials at each ival level across sessions (e.g. by subsampling trials) before running the regression.

      This is an excellent point and lead us to abandon the linear regression-based approach to quantify differences in ival coding across behavioral groups.

      Recommendations for the authors:

      Reviewer #1 (Recommendations For The Authors):

      This paper was extremely hard to read. In addition to the issues raised in the public review (overly complex and incomplete analyses), one of the hardest things to deal with was the writing.

      Thank you for the feedback. Hopefully we have addressed this with our thorough rewrite. 

      The presentation was extremely hard to follow. I had to read through it several times to figure out what the task was. It wasn't until I got to the RL model Figure 2A that I realized what was really going on with the task. I strongly recommend having an initial figure that lays out the actual task (without any RL or modeling assumptions) and identifies the multiple different kinds of sessions. What is the actual data you have to start with? That was very unclear.

      Excellent idea. We have implemented this in Figure 1.  

      Labeling session by "group" is very confusing. I think most readers take "group" as the group of subjects, but that's not what you mean at all. You mean some sessions were one way and some were another. (And, as I noted in the public review, you ignore many of the sessions, which I think is not OK.) I think a major rewrite would help a lot. Also, I don't think the group analysis is necessary at all. In the public review, I recommend doing the analyses very differently and more classically.

      We have updated the group assignments in a manner that is more intuitive, reflects the delays, and includes all sessions.  

      The paper is full of arbitrary abbreviations that are completely unnecessary. Every time I came to "ival", I had to translate that into "number of pellets delivered on the immediate lever" and every time I came to dLP, I had to translate that into "delayed lever press". Making the text shorter does not make the text easier to read. In general, I was taught that unless the abbreviation is the common term (such as "DNA" not "deoxyribonucleic acid"), you should never use an abbreviation. While there are some edge cases (ACC probably over "anterior cingulate cortex"), dLP, iLP, dLPs, iLPs, ival, are definitely way over the "don't do that" line.

      We completely agree here and apologize for the excessive use of abbreviations. We have removed nearly all of them

      The figures were incomplete, poorly labeled, and hard to read. A lot of figures were missing, for example

      Basic task structure

      Basic behavior on the task

      Scatter plot of the measures that you are clustering (lever press choice X number of pellets on the immediate lever, you can use color or multiple panels to indicate the delay to the delayed lever) Figure 3 is just a couple of examples. That isn't convincing at all.

      Figure 4 is missing labels. In Figure 4, I don't understand what you are trying to say.

      I don't see how the results on page 16 arise from Figure 6. I strongly recommend starting from the actual data and working your way to what it means rather than forcing this into this unreasonable "session group" analysis.

      We have completely reworked the Figures for clarity and content. 

      The statement that "no prior study has explored the cellular correlates of cognitive effort" is ludicrous and insulting. There are dozens of experiments looking at ACC in cognitive effort tasks, in humans, other primates, and rodents. There are many dozens of experiments looking at cellular correlates in intertemporal choice tasks, some with neural manipulations, some with ensemble recordings. There are many dozens of experiments looking at cellular relationships to waiting out a delay.

      We agree that our statement was extremely imprecise. We have updated this to say:  “Further, a role for theta oscillations in allocating physical effort has been identified. However, the cellular

      mechanisms within the ACC that control and deploy types of cognitive effort have not been identified.”

      Reviewer #2 (Recommendations For The Authors):

      In Figure 2, the panels below E and F are referred to as 'right' - but they are below? I would give them letters.

      I would make sure that animal #s, neuron #s, and LFP#s are clearly presented in the results and in each figure legend. This is important to follow the results throughout the manuscript.

      Some additional proofreading ('Fronotmedial') might help with clarity.

      Based on our updates, this is no longer relevant.  

      Reviewer #3 (Recommendations For The Authors):

      In addition to the suggestions above to address specific issues, it would be useful to report some additional information about aspects of the experiments and analyses:

      Specify how spike sorting was performed and what metrics were used to select well isolated single units.

      Done.

      Provide histology showing the recording locations for each subject.

      Histological assessments of electrodes placements are provided in White et al. 2024, but we provide an example placement. This has been added to the text. 

      Indicate the sequence of recording sessions that occurred for each subject, including for each session what delay duration was used and which dataset the session contributed to, and indicate when the neural probes were advanced between sessions.

      We feel that this adds complexity unnecessarily as we make no claims about holding units across sessions for differences in coding in the dorsoventral gradient of ACC. 

      Indicate the experimental unit when reporting uncertainty measures in figure legends (e.g. mean +/- SEM across sessions).

      Done.

    1. Author response:

      Before providing a brief provisional response to the two reviews, it is important to reiterate a few key points about our work. First, our paper is largely a computational biophysics paper, augmented by experimental results. Generally speaking, computational biophysics work intends to achieve one of two things (or both). One is to provide more molecular level insight into various behaviors of biomolecular systems that have not been (or cannot be) provided by qualitative experimental results alone. The second general goal of computational biophysics it to formulate new hypotheses to be tested subsequently by experiment. In our paper, we have achieved both of these goals and then confirmed the key computational results by experiment..

      The first reviewer has some valuable points, which can be addressed as follows (and will be emphasized in the revised version of the paper): (1) Yes the simulations of capsid rupture in the NPC and capsid-only are directly comparable as both have approximately the same number of bound LEN, as determined by following the LEN-capsid interaction protocol described in the main text (around Fig 6) and in the SI section S3; (2) While we have stressed this point in several places in the manuscript, here again we stress that coarse-grained (CG) MD time is not the same as real time. The point of CG simulations is to accelerate the timescale of the MD and the associated sampling, so the CG “time” from the MD integrator needs to be rescaled to associate a real time to it. As such, our CG simulation is not representing a microsecond of real time but rather something much longer. We will emphasize this again in the revised text. (3) Actually, we think that the parameterization of the LEN model and the LEN-capsid interactions is well described in the text associated with Fig 6 and in SI section S3. It is true that this one part of the CG model was parameterized “top-down” given the good experimental structures of bound LEN to capsid and other data, but the rest of the CG model is “bottom-up” (meaning developed from well-defined coarse-graining statistical mechanics as applied to molecular level structures and interactions, see also below). 

      As for the second reviewer, this review is quite problematic in our view as the reviewer seems to think that quoting a number of qualitative experimental results is sufficient to undermine the impact of our paper (they are not) and, furthermore, the reviewer appears to have a very minimal understanding of “bottom-up” CG modeling, which we have utilized. This modeling does not in fact rely on the “assumptions” this reviewer alleges we have relied on. (As an aside, it could be helpful for this reviewer to study the review by Jin et al, https://doi.org/10.1021/acs.jctc.2c00643) in order to become more familiar with the field and our approach before criticizing it.) We also note that our main HIV capsid-NPC docking model is already published in PNAS (https://doi.org/10.1073/pnas.2313737121), where it underwent rigorous peer review. In our forthcoming full response to the reviews and in the revised paper we will attempt to address a number of this reviewers comments, but the number, extent, and tone of this collection of criticisms, for us, calls into question the objectivity of this reviewer, not to mention the reviewer’s rather weak understanding of what we have done and how we have done it.

      Finally, while we certainly appreciate the overall positive eLife assessment, we are disappointed by the statement “some mechanistic interpretations rely on assumptions embedded in the simulations, leaving parts of the evidence incomplete”. Of course, all simulations (and experiments) rely on certain assumptions, but we have gone to great length to provide a “bottomup” approach to our modeling, based on underlying molecular level structures and interactions, and we have provided experimental validation of the main simulation predictions. It seems that the comments of the second reviewer may have influenced this point of view, but we do not feel it is justified.

    1. Reviewer #2 (Public review):

      Summary:

      Previous studies by some of the same authors of the actual manuscript showed that healthy human newborns memorize recently learned nonsense words. They exposed neonates to a familiarization period (several minutes) when multiple repetitions of a bisyllabic word were presented, uttered by the same speaker. Then they exposed neonates to an "interference period" when newborns listened to music or the same speaker uttering a different pseudoword. Finally, neonates were exposed to a test period when infants hear the familiarized word again. Interestingly, when the interference was music, the recognition of the word remained. The word recognition of the word was measured by using the NIRS technique, which estimates the regional brain oxygenation at the scalp level. Specifically, the brain response to the word in the test was reduced, unveiling a familiarity effect, while an increase in regional brain oxygenation corresponds to the detection of a "new word" due to a novelty effect. In previous studies, music does not erase the memory traces for a word (familiarity effect), while a different word uttered by the same speaker does.

      The current study aims at exploring whether and how word memory is interfered with by other speech properties, specifically the changes in the speaker, while young children can distinguish speakers by processing the speech. The author's main hypothesis anticipates that new speaker recognition would produce less interference in the familiarized word because somehow neonates "separate" the processing of both words (familiarized uttered by one speaker, and interfering word, uttered by a different speaker), memorizing both words as different auditory events.

      From my point of view, this hypothesis is interesting, since the results would contribute to estimating the role of the speaker in word learning and speech processing early in life.

      Strengths:

      (1) New data from neonates. Exploring neonates' cognitive abilities is a big challenge, and we need more data to enrich the knowledge of the early steps of language acquisition.

      (2) The study contributes new data showing the role of speaker (recognition) on word learning (word memory), a quite unexplored factor. The idea that neonates include speakers in speech processing is not new, but its role in word memory has not been evaluated before. The possible interpretation is that neonates integrate the process of the linguistic and communicative aspects of speech at this early age.

      (3) The study proposes a quite novel analytic approach. The new mixed models allow exploring the brain response considering an unbalanced design. More than the loss of data, which is frequent in infants' studies, the familiarization, interference and learning processes may take place at different moments of the experiment (e.g. related to changes in behavioural states along the experiment) or expressed in different regions (e.g. related to individual variations in optodes' locations and brain anatomy).

      Weaknesses:

      I did not find major weaknesses. However, I would like to have more discussion or explanation on the following points.

      (1) It would be fine to report the contribution of each infant to the analysis, i.e. how many good blocks, 1 to 5 in sequence 1 and 2, were provided by each infant.

      (2) Why did the factor "blocknumber" range from 0 to 4? The authors should explain what block zero means and why not 1 to 5.

      (3) I may suggest intending to integrate the changes in brain activity across the 3 phases. That is, whether changes in familiarization relate to changes in the test and interference phases. For instance, in Figure 2, the brain response distinguishes between same and novel words that occurred over IFG and STG in both hemispheres. However, in the right STG there was no initial increase in the brain response, and the response for the same was higher than the one for novels in the 5th block.

      (4) Similarly, it is quite amazing that the brain did not increase the activity with respect to the familiarization during the interference phase, mainly over the left hemisphere, even if both the word and speaker changed. Although the discussion considers these findings, an integrated discussion of the detection of novel words and the detection of a novel speaker over time may benefit from a greater integration of the results.

      Appraisal:

      The authors achieved their aims because the design and analytic approaches showed significant differences. The conclusions are based on these results. Specifically, the hypothesis that neonates would memorize words after interference, when interfered speech is pronounced by a different speaker, was supported by the data in blocks 2 and 5, and the potential mechanisms underlying these findings were discussed, such as separate processing for different speakers, likely related to the recognition of speaker identity.

      I think the discussion is well-structured, although I may suggest integrating the changes into the three phases of the study. Maybe comparing with other regions, not related to speech processing.

      Evaluating neonates is a challenge. Because physiology is constantly changing. For instance, in 9 minutes, newborns may transit from different behavioral states and experience different physiological needs.

      This study offers the opportunity to inspire looking for commonalities and individual differences when investigating early memory capacities of newborns.

    1. The rhythm of today, like every day we have lived here on Turtle Island, is made possible through the historic and ongoing processes and ideologies of colonialism. Importantly, it is also made possible through ongoing and persistent resistance to colonialism.

      This made me think about how in my own household, which I manage mostly on my own, the things I rely on, such as housing and education systems and how they exist because of the colonial systems, even though they may appear to me as just "regular" life.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      In this paper, the authors develop a biologically plausible recurrent neural network model to explain how the hippocampus generates and uses barcode-like activity to support episodic memory. They address key questions raised by recent experimental findings: how barcodes are generated, how they interact with memory content (such as place and seed-related activity), and how the hippocampus balances memory specificity with flexible recall. The authors demonstrate that chaotic dynamics in a recurrent neural network can produce barcodes that reduce memory interference, complement place tuning, and enable context-dependent memory retrieval, while aligning their model with observed hippocampal activity during caching and retrieval in chickadees.

      Strengths:

      (1) The manuscript is well-written and structured.

      (2) The paper provides a detailed and biologically plausible mechanism for generating and utilizing barcode activity through chaotic dynamics in a recurrent neural network. This mechanism effectively explains how barcodes reduce memory interference, complement place tuning, and enable flexible, context-dependent recall.

      (3) The authors successfully reproduce key experimental findings on hippocampal barcode activity from chickadee studies, including the distinct correlations observed during caching, retrieval, and visits.

      (4) Overall, the study addresses a somewhat puzzling question about how memory indices and content signals coexist and interact in the same hippocampal population. By proposing a unified model, it provides significant conceptual clarity.

      Weaknesses:

      The recurrent neural network model incorporates assumptions and mechanisms, such as the modulation of recurrent input strength, whose biological underpinnings remain unclear. The authors acknowledge some of these limitations thoughtfully, offering plausible mechanisms and discussing their implications in depth.

      One thread of questions that authors may want to further explore is related to the chaotic nature of activity that generates barcodes when recurrence is strong. Chaos inherently implies sensitivity to initial conditions and noise, which raises questions about its reliability as a mechanism for producing robust and repeatable barcode signals. How sensitive are the results to noise in both the dynamics and the input signals? Does this sensitivity affect the stability of the generated barcodes and place fields, potentially disrupting their functional roles? Moreover, does the implemented plasticity mitigate some of this chaos, or might it amplify it under certain conditions? Clarifying these aspects could strengthen the argument for the robustness of the proposed mechanism.

      In our model, chaos is used to produce a random barcode when forming memories, but memory retrieval depends on attractor dynamics. Specifically, the plasticity update at the end of the cache creates an attractor state, and then afterwards for successful memory retrieval the network activity must settle into this attractor rather than remaining chaotic. This attractor state is a conjunction of memory content (place and seed activity) and memory index (barcode activity). Thus a barcode is ‘reactivated’ when network dynamics during retrieval settle into this cache attractor, or in other words chaotic dynamics do not need to generate the same barcode twice.

      The reviewer raises an important point, which is how sensitivity to initial conditions and noise would affect the reliability of our proposed mechanism. The key question here is how noise will affect the network’s dynamics during retrieval. Would adding noise to the dynamics make memory retrieval more difficult? We thank the reviewer for suggesting we investigate this further, and below describe our experiments and changes to the manuscript to better address this topic.

      We first experimented with adding independent gaussian distributed noise into each unit, drawn independently at each timestep. We analyzed recall accuracy using the same task and methods as Fig. 4F while varying the magnitude of noise. Memory recall was quite robust to this form of noise, even as the magnitude of noise approached half of the signal amplitude. This first experiment added noise into the temporal dynamics of the network. We subsequently examined adding static noise into the network inputs, which can also be thought of as introducing noise into initial conditions. Specifically, we added independent gaussian distributed noise into each unit, with the random value held constant for the extent of temporal dynamics. This perturbation decreased the likelihood of memory recall in a graded manner with noise magnitude, without dramatically changing the spatial profile. Examination of dynamics on individual trials revealed that the network failed to converge onto a cache attractor on some random fraction of trials, with other trials appearing nearly identical to noiseless results. We now include these results in the text and as a new supplementary figure, Figure S4AB.

      To clarify the network dynamics and the purpose of chaos in our model, we make the following modifications in text:

      Section 2.3, paragraph 2 (starting at “To store memories…”):

      “…place inputs arrive into the RNN, recurrent dynamics generate an essentially random barcode, seed inputs are activated, and then Hebbian learning binds a particular pattern of barcode activity to place- and seed-related activity.”

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l, resulting in the storage of an attractor \vec{a} into the RNN. The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for location l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      Section 2.4, final paragraph (starting “We further examined how model hyperparameters affected performance on these tasks”), added the following describing new results on adding noise: We found that adding noise to the network's temporal dynamics had little effect on memory recall performance (Figure S4A). However, large static noise vectors added to the network's input and initial state decreased the overall probability of memory recall, but not its spatial profile (Figure S4B).

      It may also be worth exploring the robustness of the results to certain modeling assumptions.  For instance, the choice to run the network for a fixed amount of time and then use the activity  at the end for plasticity could be relaxed.

      As described above, chaotic dynamics are necessary to generate a barcode during a cache, but not to reactivate that barcode during retrieval. During a successful memory retrieval, network activity settles into an attractor state and thus does not depend on the duration of simulated dynamics. The choice of duration to run dynamics during caching is important, but only insofar as activity significantly decorrelates from the initial state. We show in Figure S1B that decorrelation saturates ~t=25, and thus any random time point t > 25 would be similarly effective. We used a fixed duration runtime for caches only to avoid introducing unnecessary complication into our model.

      Reviewer #2 (Public review):

      Summary:

      Striking experimental results by Chettih et al 2024 have identified high-dimensional, sparse patterns of activity in the chickadee hippocampus when birds store or retrieve food at a given site. These barcode-like patterns were interpreted as "indexes" allowing the birds to retrieve from memory the locations of stored food.

      The present manuscript proposes a recurrent network model that generates such barcode activity and uses it to form attractor-like memories that bind information about location and food. The manuscript then examines the computational role of barcode activity in the model by simulating two behavioral tasks, and by comparing the model with an alternate model in which barcode activity is ablated.

      Strengths of the study:

      Proposes a potential neural implementation for the indexing theory of episodic memory - Provides a mechanistic model of striking experimental findings: barcode-like, sparse patterns of activity when birds store a grain at a specific location

      A particularly interesting aspect of the model is that it proposes a mechanism for binding discrete events to a continuous spatial map, and demonstrates the computational advantages of this mechanism.

      Weaknesses:

      The relation between the model and experimentally recorded activity needs some clarification

      The relation with indexing theory could be made more clear

      The importance of different modeling ingredients and dynamical mechanisms could be made more clear

      The paper would be strengthened by focusing on the most essential aspects

      Comments:

      The model distinguishes between "barcode activity" and "attractors". Which of the two corresponds to experimentally-recorded barcodes? I would presume the attractors. A potential issue is that the attractors are, as explained in the text (l.137), conjunctions of place activity, barcode activity and "seed" inputs. The fact that the seed activity is shared across attractors seems to imply that they have a non-zero correlation independent of distance. Is that the case in the model? If I understand correctly, Fig 3D shows correlations between an attractor and barcodes at different locations, but correlations between attractors at different locations are not shown. Fig 1 F instead shows that correlations between recorded retrieval activities decay to zero with distance.

      More generally, the fact that the expression "barcode" is apparently used with different meanings in the model and in the experiments is potentially confusing (in the model they correspond to activity generating during caching, and this activity is distinct from the memories; my understanding is that in the experiments barcodes correspond to both caching and retrieval, but perhaps I am mistaken?).

      Our intent is to use the expression “barcode” as similarly as possible between model and experimental work. The reviewer points out that the connection between barcodes in experimental and modeling work is unclear, as well as the relation of “attractors” in our model to previous experimental results. The meaning of ‘barcode’ is absolutely critical—we clarify below our intended meaning, and then describe changes to the manuscript to highlight this.

      In experiments, we observed that activity during caching looked different than ordinary hippocampal activity (i.e. typical “place activity” observed during visits). Empirically there were two major differences. First, there was a pattern of neural activity which was present during every cache . This pattern was also present when birds visually inspected sites containing a cached seed, but not when visually inspecting an empty site. This is what we refer to as “seed activity”. Second, there was a pattern of neural activity which was unique to each cache. This pattern re-occurred during retrieval, and was orthogonal to place activity (see Fig. 1E-F). This is what we refer to as “barcode activity”. In summary, activity during a cache (or retrieval) contains a combination of three components: place activity, seed activity, and barcode activity.

      These experimental findings are recapitulated in our model, as activity during a cache contains a combination of three components: place activity driven by place inputs, seed activity driven by seed inputs, and barcode activity generated by recurrent dynamics. Cache activity in the model corresponds to cache activity in experiments, and barcodes in the model correspond to barcodes in experiments. Our model additionally has “attractors”, meaning that network connectivity changes so that the activity generated during a simulated cache becomes an attractor state of network dynamics. “Attractors” refers to a feature of network dynamics, not a distinct activity state, and we do not yet know if these attractors exist in experimental data.

      Figure 3D, as described in the figure legend, is a correlation of activity during cache and retrieval (in purple), for cache-retrieval pairs at the same or at different sites. We believe this is what the reviewer asks to see: the correlation between attractor states for different cache locations. The reviewer makes an important point: seed activity is shared across all attractors, so then why are correlations not high for all locations? This is because attractors also have a place component, which is anti-correlated for distant locations. This is evident in Fig. 3D by noticing that visit-visit correlations (black line, corresponding to place activity only) are negative for distant locations, and the correlation between attractors (purple line, cache-retrieval pairs) is subtly shifted up relative to the black line (place code only) for these distant locations. The size of this shift is due to the relative magnitude of place and seed inputs. For example, if we increase the strength of the seed input during caching (blue line), we can further increase the correlation between attractors even for quite distant sites:

      Author response image 1.

      To clarify the manuscript, we made the following modifications:

      Section 2.2, first paragraph: We model the hippocampus as a recurrent neural network (RNN) (Alvarez and Squire, 1994; Tsodyks, 1999; Hopfield, 1982) and propose that recurrent dynamics can generate barcodes from place inputs. As in experiments, the model’s population activity during a cache should exhibit both place and barcode activity components.

      Section 2.3, paragraph 3 (starting at “Memory recall in our network…”): As an example, consider a scenario in which an animal has already formed a memory at some location l , resulting in the storage of an attractor \vec{a} into the RNN . The attractor \vec{a} can be thought of as a linear combination of place input-driven activity $p(l)$, seed input-driven activity $s$, and a recurrent-driven barcode component $b$. Later, the animal returns to the same location and attempts recall (i.e. sets r \= 1, Figure 3B). Place inputs for l drive RNN activity towards $p(l)$, which is partially correlated with attractor \vec{a}, and the recurrent dynamics cause network activity to converge onto attractor \vec{a}. In this way, barcode activity $b$ is reactivated as part of attractor \vec{a}, along with the place and seed components stored in the attractor state, $p(l)$ and $s$. The seed input can also affect recall, as discussed in the following section.

      The insights obtained from the network model for the computational role of barcode activity could be explained more clearly. The introduction starts by laying out the indexing theory, which proposes that the hippocampus links an index with each memory so that the memory is reactivated when the index is presented. The experimental paper suggests that the barcode activations play the role of indexes. Yet, in the model reactivations of memories are driven not by presenting bar-code activity, but by presenting place activity (Cache Presence task) or seed activity (Cache Location task). So it seems that either place activity and seed activity play the role of indexes. Section 2.5 nicely shows that ultimately the role of barcode activity is to decorrelate attractors, which seems different from playing the role of indexes. I feel it would be useful that the Discussion reassess more critically the relationship between barcodes, indexing theory, and key-value architectures.

      The reviewer highlights a failure on our part to clearly identify the connection between our findings on barcodes, indexing theory, and key-value architectures. This is another major component of the paper, and below we propose changes to the manuscript to clarify these concepts and their relationships. First, we will summarize the key points that were unclear in our original manuscript.

      The reviewer equates the concept of an ‘index’ with that of a ‘query’: the signal that drives memory reactivation. This may be intuitive, but it is not how a memory index was defined in indexing theory (e.g. Teyler & DiScenna 1986). In indexing theory, the index is a pattern of hippocampal activity that is (a) generated during memory formation, (b) separate from the activity encoding memory content, and (c) linked to memory content via associative plasticity. After memory formation, a memory might be queried by activating a partial set of the memory contents, which would then drive reactivation of the hippocampal index, leading to pattern completion of memory contents. See, for example, figure 1 of Teyler and DiScenna 1986. The ‘index’ is thus not the same as the ‘query’ that drives recall.

      We propose in this work that barcode activity is such an index. Indexing theory originally posited that memory content was encoded by neocortex, and memory index was encoded by hippocampus. However the experiments of Chettih et al. 2024 revealed that the hippocampus contained both memory content and memory index signals, and furthermore there was no division of cells into ‘content’ and ‘index’ subtypes. Thus our model drops the assumption of earlier work that index and content signals correspond to different neurons in different brain areas—a significant advance of our work. Otherwise, the experimentally observed barcodes and the barcodes generated by our computational model play the role of indices as originally defined.

      Our original manuscript was unclear on the relationship of indexing theory and key-value systems. Our work connects diverse areas of memory models, including attractor dynamics, key-value memory systems, and memory indexing. A full account of these literatures and their relationships may be beyond the scope of this manuscript, and we note that a recent review article (Gershman, Fiete, and Irie, 2025) further clarifies the relationship between key-value memory, indexing theory, and the hippocampus. We will cite this work in our discussion as a source for the interested reader.

      Briefly, a key-value memory system distinguishes between the address where a memory is stored, the ‘key’, and the content of that memory, the ‘value’. An advantage of such systems is that keys can be optimized for purposes independent of the value of each memory. The use of barcodes in our model to decorrelate memories is related to this optimization of keys in key-value memory systems. By generating barcodes and adding this to the attractor state corresponding to a cache memory, the ‘address’ of the memory in population activity is differentiated from other memories. Our work is thus consistent with the idea that hippocampus generates keys and implements a key storage system. However it is not so straightforward to equate barcodes with keys, as they are defined in key-value memory. As the reviewer points out, memory recall can be driven by location and seed inputs, i.e. it is content-addressable. We think of the barcode as modifying the memory address to better separate similar memories, without changing memory content, and the resulting memory can be recalled by querying with either content or barcode. Given the complex and speculative nature of these relationships, we prefer to note the salient connection of our work with ongoing efforts applying the key-value framework to biological memory, and leave the precise details of this connection to future work.

      We make the following changes in the manuscript to clarify these ideas:

      Introduction, first paragraph: In this scheme, during memory formation the hippocampus generates an index of population activity, and the neurons representing this index are linked with the neurons representing memory content by associative plasticity . Later, re-experience of partial memory contents may reactivate the index, and reactivation of the index drives complete recall of the memory contents.

      Discussion, 4th paragraph on key-value: Interestingly, prior theoretical work has suggested neural implementations for both key-value memory and attention mechanisms, arguing for their usefulness in neural systems such as long term memory (Kanerva, 1988; Tyulmankov et al., 2021; Bricken and Pehlevan, 2021; Whittington et al., 2021; Kozachkov et al., 2023; Krotov and Hopfield, 2020; Gershman 2025 ). In this framework, the address where a memory is stored (the key) may be optimized independently of the value or content of the memory. In our model, barcodes improve memory performance by providing a content-independent scaffold that binds to memory content, preventing memories with overlapping content from blurring together. Thus barcodes can be considered as a change in memory address, and our model suggests important connections between recurrent neural activity and key generation mechanisms. However we note that barcodes should not be literally equated with keys in key-value systems as our model’s memory is ‘content-addresable’—it can be queried by place and seed inputs.

      The model includes a number of non-standard ingredients. It would be useful to explain which of these ingredients and which of the described mechanisms are essential for the studied phenomenon. In particular:

      - the dynamics in Eq.2 include a shunting inhibition term. Is it essential and why?

      The shunting inhibition is important as it acts to normalize the network activity to prevent runaway excitation. We hope to clarify this further by amending the following sentence in section 2.2: “g (·) is a leak rate that depends on the average activity of the full network, representing a form of global shunting inhibition that normalizes network activity to prevent runaway excitation from recurrent dynamics.”

      - same question for the global inhibition included in the random connectivity;

      The distribution from which connectivity strengths are drawn has a negative mean (global inhibition). This causes activity during caching (i.e. r = 1) to be sparser than activity during visits (i.e. r = 0), and was chosen to match experimental findings. In figures 2B and S2B we show that our model can transition between a mode with place code only, barcode only, or a mode containing both, by changing the variance of the weight distribution while holding the mean constant. We suggest clarifying this by editing the following in section 2.2, paragraph 2: “We initialize the recurrent weights from a random Gaussian distribution, . where 𝑁<sub>𝑋</sub> is the number of RNN neurons and μ < 0, reflecting global subtractive inhibition that encourages sparse network activity to match experimental findings (Chettih et al. 2024).”

      - the model is fully rate-based, but for certain figures, spikes are randomly generated. This seems superfluous.

      Spikes are simulated for one analysis and one visualization, where it is important to consider noise or variability in neural responses across trials. First, for Fig. 2H,J, we generated spikes to allow a visual comparison to figures that can be easily generated from experimental data. Second, and more significantly, for the analysis underlying Fig. 3D, it is essential to simulate variability in neural responses. Because our rate-based models are noiseless, the RNN’s rate vector at site distance = 0 will always be the same and result in a correlation of 1 for both visit-visit and cache-retrieval. However, we show that, if one interprets the rate as a noisy Poisson spiking process, the correlation at site distance = 0 between a cache-retrieval pair is higher than that of two visits. This is because under a Poisson spiking model, the signal-to-noise ratio is higher for cache-retrieval activity, where rates are higher in magnitude. The greater correlation for a cache-retrieval pair at the same site, relative to visits at the same site, is an experimental finding that was critical for our model to reproduce. We detail clarifications to the manuscript below in response to the reviewer’s following and related question.

      How are the correlations determined in the model (e.g., Fig 2 B)? The methods explain that they are computed from Poisson-generated spikes, but over which time period? Presumably during steady-state responses, but are these responses time-averaged?

      The reviewer points out a lack of clarity in our original manuscript. Correlations for events (caches, retrievals and visits) at different sites are calculated in two sections of the paper (2B, 3D), for different purposes and with slight differences in methods:

      - For figure 2B, no spikes are simulated. Note that the methods mentioning poisson spike generation specify only Fig. 2H,J and Fig. 3D. We simply take the network’s rate vector at timestep t=100 (when the decorrelating effect of chaotic dynamics has saturated, S1A-B) and correlate this vector when generated at different locations. We now clarify this in the legend for Figure 2B: “We show correlation of place inputs (gray) and correlation of the RNN's rate vector at t = 100 (black).”

      - For Figure 3D, we want to compare the model to empirical results from Chettih et al. 2024, and reproduced in this paper in Fig. 1E-F. These empirical results are derived from correlating vectors of spiking activity on pairs of single trials, and are thus affected by noise or variability in neural responses as described in our response to the reviewer’s previous question. We thus took the RNN’s rate vector at t=100 and simulated spiking data by drawing samples from a poisson distribution to get spike counts. Our original manuscript was unclear about this, and we suggest the following changes:

      - Legend for Figure 3D: D. Correlation of Poisson-generated spikes simulated from RNN rate vectors at two sites, plotted as a function of the distance between the two sites.

      - Section 2.3, last paragraph: Population activity during retrieval closely matches activity during caching, and is substantially decorrelated from activity during visits (Figure 3C). To compare our model with the empirical results reproduced in Figure 1E,F, we ran in silico experiments with caches and retrievals at varying sites in the circular arena. We simulated Poisson-generated spikes drawn from our network's underlying rates to match the intrinsic variability in empirical data (see Methods).

      - Methods, subsection Spatial correlation of RNN activity for cache-retrieval pairs at different sites: To calculate correlation values as in Figure \ref{fig3}D, we simulated experiments where 5 sites were randomly chosen for caching and retrieval. To compare model results to the empirical data in Fig. 1E,F, which includes intrinsic neural variability, we sampled Poisson-generated spike counts from the rates output by our model. Specifically, for RNN activity \vec{r_i} at location i, using the rates at t=100 as elsewhere, we first generate a sample vector of spikes…

      I was confused by early and late responses in Fig 2 C. The text says that the activity is initialized at zero, so the response at t=0 should be flat (and zero). More generally, I am not sure I understand why the dynamics matter for the phenomenon at all, presumably the decorrelation shown in Fig 2B depends only on steady state activity (cf previous question).

      Thanks for catching this mistake. The legend has been updated to indicate that the ‘early’ response is actually at t=1, when network activity reflects place inputs without the effects of dynamics. The reviewer is correct that we are primarily interested in the ‘late’ response of the network. All other results in the paper use this late response at t=100. As shown in Fig. S2A,B, this timepoint is not truly a steady state, as activity in the network continues to change, but the decorrelation of network activity with place-driven activity has saturated.

      We include the early response in Fig. 2C for visual comparison of the purely place-driven early activity with the eventual network response. It is also relevant since, as the reviewer points out above, there is a shunting inhibition term in the dynamics that is present during both low and high recurrent strength simulations.

      Related to the previous point, the discussion of decorrelation (l.79 - 97) is somewhat confusing. That paragraph focuses on chaotic activity, but chaos decorrelates responses across different time points. Here the main phenomenon is the decorrelation of responses across different spatial inputs (Fig 2B). This decorrelation is presumably due to the fact that different inputs lead to different non-trivial steady-state responses, but this requires some clarification. If that is correct, the temporal chaos adds fluctuations around these non-trivial steady-state responses, but that alone would not lead to the decorrelation shown in Fig 2B.

      We agree with the reviewer that chaotic activity produces a decorrelation across time points. Because of chaotic dynamics, network activity does not settle into a trivial steady-state, and instead evolves from the initial state in an unpredictable way. The network does not settle into a steady-state pattern, but both the decorrelation of network state with initial state and the rate of change in the network state saturate after ~t=25 timesteps, as shown in Fig. S2A-B.

      The initial activity for nearby states is similar, due to them receiving similar place inputs.

      Because network activity is chaotically decorrelated from this initial state by temporal dynamics, ‘late stage’ network activity between nearby spatial states is less correlated than ‘early stage’ activity. Thus the temporal decorrelation produces a spatial decorrelation. We believe that the changes we have introduced to the manuscript in revision will make this point clearer in our resubmission.

      A key ingredient of the model is that the recurrent interactions are switched on and off between "caching" and "visits". The discussion argues that a possible mechanism for this is recurrent inhibition (l.320), which would need to be added. However two forms of inhibition are already included in the model. The text also says that it is unclear how units in the model should be mapped onto E and I neurons. However the model makes explicit assumptions about this, in particular by generating spikes from individual neurons. Altogether, I did not find that part of the Discussion convincing.

      We agree with the reviewer that this section is a limitation of our current work, and in fact it is an ongoing area of future research. However we think the advances in this current work warrant publication despite this topic requiring further research. We attempted to discuss this limitation explicitly, and note that the other reviewer pointed this section out as particularly helpful. We do not think it is problematic for a realistic model of the brain to ultimately include 3, or even more forms of inhibition. We do not think that poisson-generated spikes commit us to interpreting network units as single neurons. Spikes are not a core part of our model’s mechanism, and were used only as a mechanism of introducing variability on top of deterministic rates for specific analyses. Furthermore one could still view network units as pools of both E and I spiking neurons. We would welcome further recommendations the reviewer believes are important to note in this section on our model’s limitations.

      On lines 117-120 the text briefly mentions an alternate feed-forward model and promptly discards it. The discussion instead says that a "separate possibility is that barcodes are generated in a circuit upstream of where memories are stored, and supplied as inputs to the hippocampal population", and that this possibility would lead to identical conclusions. The two statements seem a bit contradictory. It seems that the alternative possibility would replace the need for switching on and off recurrent interactions, with a mechanism where barcode inputs are switched on and off. This alternate scenario is perhaps more plausible, so it would be useful to discuss it more explicitly.

      We apologize for the confusion here, which seems to be due to our phrasing in the discussion section. We do reject the idea that a simple feed-forward model could generate the spatial correlation profile observed in data, as mentioned in the text and included as Fig. S2. Our statement in the discussion may have seemed contradictory because here we intended to discuss the possibility that an upstream area generates barcodes, for example by the chaotic recurrent dynamics proposed in our work, while a downstream network receives these barcodes as inputs and undergoes plasticity to store memories as attractors. We did not intend to suggest any connection to the feedforward model of barcode generation, and apologize for the confusion. Our claim that this ‘2 network’ solution would lead to similar conclusions is because the upstream network would need an efficient means of barcode generation, and the downstream network would need an efficient means of storing memory attractors, and separating these functions into different networks is not likely to affect for example the advantage of partially decorrelating memory attractors. Moreover, the downstream network would still require some form of recurrent gating, so that during visits it exhibits place activity without activating stored memory attractors!

      We thus chose a 1 network instead of a 2 network solution because it was simpler and, we believe, more interesting. It is challenging in the absence of more data to say which is more plausible, thus we wanted to mention the possibility of a 2 network solution. We suggest the following changes to the manuscript:

      - Discussion, 3rd paragraph: “Alternatively, other mechanisms may be involved in generating barcodes. We demonstrated that conventional feed-forward sparsification (Babadi and Sompolinsky, 2014; Xie et al., 2023) was highly inefficient, but more specialized computations may improve this (Földiak, 1990; Olshausen and Field, 1996; Sacouto and Wichert, 2023; Muscinelli et al., 2023). Another possibility is that barcodes are generated in a separate recurrent network upstream of the recurrent network where memories are stored. In this 2-network scenario, the downstream network receives both spatial tuning and barcodes as inputs. This would not obviate the need for modulating recurrent strength in the downstream network to switch between input-driven modes and attractor dynamics. We suspect separating barcode generation and memory storage in separate networks would not fundamentally affect our conclusions.”

      As a minor note, the beginning of the discussion states that the presented model is similar to previous recurrent network models of the hippocampus. It would be worth noting that several of the cited works assign a very different role to recurrent interactions: they generate place cell activity, while the present model assumes it is inherited from upstream inputs.

      We are not sure how best to modify the paper to address this suggestion. As far as we know, all of the cited models which deal with spatial encoding do assume that the hippocampus receives a spatially-modulated or spatially-tuned input. For example, the Tsodyks 1999 paper cited in this paragraph uses exponentially-decaying place inputs to each neuron highly similar to our model. Furthermore we explore how our model would perform if we change the format of spatial inputs in Fig. S4, and find key results are unchanged. It is unclear how hippocampal place fields could emerge without inputs that differentiate between spatial locations. We think it is appropriate to highlight the similarity of our model to well known hopfield-type recurrent models, where memories are stored as attractor states of the network dynamics.

      On the other hand, we agree that a common line of hippocampal modeling proposes that recurrent interactions reshape spatial inputs to produce place fields. This often arises in the context of hippocampus generating a predictive map, where inputs may be one-hot for a single spatial state, in a grid cell-like format, or a random projection of sensory features. We attempted to address this in section 2.6, using a model which superimposes the random connectivity needed for barcode generation with the structured connectivity needed for predictive map formation. We found that such a model was able to perform both predictive and barcode functions, suggesting a path forward to connecting different lines of hippocampal modeling in future work.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      In this manuscript, Xiong and colleagues investigate the mechanisms operating downstream to TRIM32 and controlling myogenic progression from proliferation to differentiation. Overall, the bulk of the data presented is robust. Although further investigation of specific aspects would make the conclusions more definitive (see below), it is an interesting contribution to the field of scientists studying the molecular basis of muscle diseases.

      We thank the Reviewer for appreciating our work and for their valuable suggestions to improve our manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      In my opinion, a few aspects would improve the manuscript. Firstly, the conclusion that Trim32 regulates c-Myc mRNA stability could be expanded and corroborated by further mechanistic studies:

      1. Studies investigating whether Tim32 binds directly to c-Myc RNA. Moreover, although possibly beyond the scope of this study, an unbiased screening of RNA species binding to Trim32 would be informative. Authors’ response. This point will be addressed as detailed in the Revision Plan

      If possible, studies in which the overexpression of different mutants presenting specific altered functional domains (NHL domain known to bind RNAs and Ring domain reportedly involved in protein ubiquitination) would be used to test if they are capable or incapable of rescuing the reported alteration of Trim32 KO cell lines in c-Myc expression and muscle maturation.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      An optional aspect that might be interesting to explore is whether the alterations in c-Myc expression observed in C2C12 might be replicated with primary myoblasts or satellite cells devoid of Trim32.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      I also have a few minor points to highlight:

        • It is unclear if the differences highlighted in graphs 5G, EV5D, and EV5E are statistically significant.*

      Authors’ response. We thank the Reviewer for raising this point. We now indicated the statistical analyses performed on the data presented in the mentioned figures (according also to a point of Reviewer #3). According to the conclusion that Trim32 is necessary for proper regulation of c-Myc transcript stability, using 2-way-ANOVA, the data now reported as Figure 5G show the statistically significant effect of the genotype at 6h (right-hand graph) but not at D0 (left-hand graph). In the graphs of Fig. EV5 D and E at D0 no significant changes are observed whereas at 6h the data show significant difference at the 40 min time point. We included this info in the graphs and in the corresponding legends.

      - On page 10, it is stated that c-Myc down-regulation cannot rescue KO myotube morphology fully nor increase the differentiation index significantly, but the corresponding data is not shown. Could the authors include those quantifications in the manuscript?

      Authors’ response. As suggested, we included the graph showing the differentiation index upon c-Myc silencing in the Trim32 KO clones and in the WT clones, as a novel panel in Figure 6 (Fig. 6D). As already reported in the text, a partial recovery of differentiation index is observed but the increase is not statistically significant. In contrast, no changes are observed applying the same silencing in the WT cells. Legend and text were modified accordingly.

      Reviewer #1 (Significance (Required)):

      The manuscript offers several strengths. It provides novel mechanistic insight by identifying a previously unrecognized role for Trim32 in regulating c-Myc mRNA stability during the onset of myogenic differentiation. The study is supported by a robust methodology that integrates CRISPR/Cas9 gene editing, transcriptomic profiling, flow cytometry, biochemical assays, and rescue experiments using siRNA knockdown. Furthermore, the work has a disease relevance, as it uncovers a mechanistic link between Trim32 deficiency and impaired myogenesis, with implications for the pathogenesis of LGMDR8. * * At the same time, the study has some limitations. The findings rely exclusively on the C2C12 myoblast cell line, which may not fully represent primary satellite cell or in vivo biology. The functional rescue achieved through c-Myc knockdown is only partial, restoring Myogenin expression but not the full differentiation index or morphology, indicating that additional mechanisms are likely involved. Although evidence supports a role for Trim32 in mRNA destabilization, the precise molecular partners-such as RNA-binding activity, microRNA involvement, or ligase function-remain undefined. Some discrepancies with previous studies, including Trim32-mediated protein degradation of c-Myc, are acknowledged but not experimentally resolved. Moreover, functional validation in animal models or patient-derived cells is currently lacking. Despite these limitations, the study represents an advancement for the field. It shifts the conceptual framework from Trim32's canonical role in protein ubiquitination to a novel function in RNA regulation during myogenesis. It also raises potential clinical implications by suggesting that targeting the Trim32-c-Myc axis, or modulating c-Myc stability, may represent a therapeutic strategy for LGMDR8. This work will be of particular interest to muscle biology researchers studying myogenesis and the molecular basis of muscle disease, RNA biology specialists investigating post-transcriptional regulation and mRNA stability, and neuromuscular disease researchers and clinicians seeking to identify new molecular targets for therapeutic intervention in LGMDR8. * * The Reviewer expressing this opinion is an expert in muscle stem cells, muscle regeneration, and muscle development.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Summary: * * In this study, the authors sought to investigate the molecular role of Trim32, a tripartite motif-containing E3 ubiquitin ligase often associated with its dysregulation in Limb-Girdle Muscular Dystrophy Recessive 8 (LGMDR8), and its role in the dynamics of skeletal muscle differentiation. Using a CRISPR-Cas9 model of Trim32 knockout in C2C12 murine myoblasts, the authors demonstrate that loss of Trim32 alters the myogenic process, particularly by impairing the transition from proliferation to differentiation. The authors provide evidence in the way of transcriptomic profiling that displays an alteration of myogenic signaling in the Trim32 KO cells, leading to a disruption of myotube formation in-vitro. Interestingly, while previous studies have focused on Trim32's role in protein ubiquitination and degradation of c-Myc, the authors provide evidence that Trim32-regulation of c-Myc occurs at the level of mRNA stability. The authors show that the sustained c-Myc expression in Trim32 knockout cells disrupts the timely expression of key myogenic factors and interferes with critical withdrawal of myoblasts from the cell cycle required for myotube formation. Overall, the study offers a new insight into how Trim32 regulates early myogenic progression and highlights a potential therapeutic target for addressing the defects in muscular regeneration observed in LGMDR8.

      We thank the Reviewer for valuing our work and for their appreciated suggestions to improve our manuscript. We have carefully addressed some of the concerns raised as detailed here, while others, which require more laborious experimental efforts, will be addressed as reported in the Revision Plan.

      Major Comments:

      The work is a bit incremental based on this:

      https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030445 * * And this:

      https://www.nature.com/articles/s41418-018-0129-0 * * To their credit, the authors do cite the above papers.

      Authors’ response. We thank the Reviewer for this careful evaluation of our work against the current literature and for recognising the contribution of our findings to the understanding of myogenesis complex picture in which the involvement of Trim32 and c-Myc, and of the Trim32-c-Myc axis, can occur at several stages and likely in narrow time windows along the process, thus possibly explaining some reports inconsistencies.

      The authors do provide compelling evidence that Trim32 deficiency disrupts C2C12 myogenic differentiation and sustained c-Myc expression contributes to this defective process. However, while knockdown of c-Myc does restore Myogenin levels, it was not sufficient to normalize myotube morphology or differentiation index, suggesting an incomplete picture of the Trim32-dependent pathways involved. The authors should qualify their claim by emphasizing that c-Myc regulation is a major, but not exclusive, mechanism underlying the observed defects. This will prevent an overgeneralization and better align the conclusions with the author's data.

      Authors’ response. We agree with the Reviewer and we modified our phrasing that implied Trim32-c-Myc axis as the exclusive mechanism by explicitly indicated that other pathways contribute to guarantee proper myogenesis, in the Abstract and in Discussion.

      The Abstract now reads: … suggesting that the Trim32–c-Myc axis may represent an essential hub, although likely not the exclusive molecular mechanism, in muscle regeneration within LGMDR8 pathogenesis.”

      The Discussion now reads: “Functionally, we demonstrated that c-Myc contributes to the impaired myogenesis observed in Trim32 KO clones, although this is clearly not the only factor involved in the Trim32-mediated myogenic network; realistically other molecular mechanisms can participate in this process as also suggested by our transcriptomic results.”

      The authors provide a thorough and well-executed interrogation of cell cycle dynamics in Trim32 KO clones, combining phosphor-histone H3 flow cytometry of DNA content, and CFSE proliferation assays. These complementary approaches convincingly show that, while proliferation states remain similar in WT and KO cells, Trim32-deficient myoblasts fail in their normal withdraw from the cell cycle during exposure to differentiation-inducing conditions. This work adds clarity to a previously inconsistent literature and greatly strengthens the study.

      Authors’ response. We thank the Reviewer for appreciating our thorough analyses on cell cycle dynamics in proliferation conditions and at the onset of the differentiation process.

      The transcriptomic analysis (detailed In the "Transcriptomic analysis of Trim32 WT and KO clones along early differentiation" section of Results) is central to the manuscript and provides strong evidence that Trim32 deficiency disrupts normal differentiation processes. However, the description of the pathway enrichment results is highly detailed and somewhat compressed, which may make it challenging for readers to following the key biological 'take-homes'. The narrative quickly moves across their multiple analyses like MDS, clustering, heatmaps, and bubble plots without pausing to guide the reader through what each analysis contributes to the overall biological interpretation. As a result, the key findings (reduced muscle development pathways in KO cells and enrichment of cell cycle-related pathways) can feel somewhat muted. The authors may consider reorganizing this section, so the primary biological insights are highlighted and supported by each of their analyses. This would allow the biological implications to be more accessible to a broader readership.

      Authors’ response. We thank the Reviewer for raising this point and apologise for being too brief in describing the data, leaving indeed some points excessively implicit. As suggested, we now reorganised this session and added the lists of enriched canonical pathways relative to WT vs KO comparisons at D0 and D3 (Fig. EV3B) as well as those relative to the comparison between D0 and D3 for both WT and Trim32 KO samples (Fig. EV3C), with their relative scores. We changed the Results section “Transcriptomic analysis of Trim32 WT and Trim32 KO clones along early differentiationas reported here below and modified the legends accordingly.

      The paragraph now reads: Based on our initial observations, the absence of Trim32 already exerts a significant impact by day 3 (D3) of C2C12 myogenic differentiation. To investigate how Trim32 influences early global transcriptional changes during the proliferative phase (D0) and early differentiation (D3), we performed an unbiased transcriptomic profiling of WT and Trim32 KO clones (Fig. 2A). Multidimensional Scaling (MDS) analysis revealed clear segregation of gene expression profiles based on both time of differentiation (Dim1, 44% variance) and Trim32 genotype (Dim2, 16% variance) (Fig. 2A). Likewise, hierarchical clustering grouped WT and Trim32 KO clones into distinct clusters at both timepoints, indicating consistent genotype-specific transcriptional differences (Fig. EV3A). Differentially Expressed Genes (DEGs) were detected in the Trim32 KO transcriptome relative to WT, at both D0 and D3. In proliferating conditions, 72 genes were upregulated and 189 were downregulated whereas at D3 of differentiation, 72 genes were upregulated and 212 were downregulated. Ingenuity Pathway Analysis of the DEGs revealed the top 10 Canonical Pathways displayed in Fig. EV3B as enriched at either D0 or D3 (Fig. EV3B). Several of these pathways can underscore relevant Trim32-mediated functions though most of them represent generic functions not immediately attributable to the observed myogenesis defects.

      Notably, the transcriptional divergence between WT and Trim32 KO cells is more pronounced at D3, as evidenced by a greater separation along the MSD Dim2 axis, suggesting that Trim32-dependent transcriptional regulation intensifies during early differentiation (Fig. 2A). Given our interest in the differentiation process, we therefore focused our analyses comparing the changes occurring from D0 to D3 in WT (WT D3 vs. D0) and in Trim32 KO (KO D3 vs. D0) RNAseq data.

      Pathway enrichment analysis of D3 vs. D0 DEGs allowed the selection of the top-scored pathways for both WT and Trim32 KO data. We obtained 18 top-scored pathways enriched in each genotype (-log(p-value) ³ 9 cut-off): 14 are shared while 4 are top-ranked only in WT and 4 only in Trim32 KO (Fig. EV3C). For the following analyses, we employed thus a total of 22 distinct pathways and to better mine those relevant in the passage from the proliferation stage to the early differentiation one and that are affected by the lack of Trim32, we built a bubble plot comparing side-by-side the scores and enrichment of the 22 selected top-scored pathways above in WT and Trim32 KO (Fig. 2B). A heatmap of DEGs included within these selected pathways confirms the clustering of the samples considering both the genotypes and the timepoints highlighting gene expression differences (Fig. 2C). These pathways are mainly related to muscle development, cell cycle regulation, genome stability maintenance and few other metabolic cascades.

      As expected given the results related to Figure 1, moving from D0 to D3 WT clones showed robust upregulation of key transcripts associated with the Inactive Sarcomere Protein Complex, a category encompassing most genes in the “Striated Muscle Contraction” pathway, while in Trim32 KO clones this pathway was not among those enriched in the transition from D0 to D3 (Fig. EV3C). Detailed analyses of transcripts enclosed within this pathway revealed that on the transition from proliferation to differentiation, WT clones show upregulation of several Myosin Heavy Chain isoforms (e.g., MYH3, MYH6, MYH8), α-Actin 1 (ACTA1), α-Actinin 2 (ACTN2), Desmin (DES), Tropomodulin 1 (TMOD1), and Titin (TTN), a pattern consistent with previous reports, while these same transcripts were either non-detected or only modestly upregulated in Trim32 KO clones at D3 (Fig. 2D). This genotype-specific disparity was further confirmed by gene set enrichment barcode plots, which demonstrated significant enrichment of these muscle-related transcripts in WT cells (FDR_UP = 0.0062), but not in Trim32 KO cells (FDR_UP = 0.24) (Fig. EV3D). These findings support an early transcriptional basis for the impaired myogenesis previously observed in Trim32 KO cells.

      In addition to differences in muscle-specific gene expression, we observed that also several pathways related to cell proliferation and cell cycle regulation were more enriched in Trim32 KO cells compared to WT. This suggests that altered cell proliferation may contribute to the distinct differentiation behavior observed in Trim32 KO versus WT (Fig. 2B). Given that cell cycle exit is a critical prerequisite for the onset of myogenic differentiation and considering that previous studies on Trim32 role in cell cycle regulation have reported inconsistent findings, we further examined cell cycle dynamics under our experimental conditions to clarify Trim32 contribution to this process

      The work would be greatly strengthened by the conclusion of LGMDR8 primary cells, and rescue experiments of TRIM32 to explore myogenesis.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Also, EU (5-ethynyl uridine) pulse-chase experiments to label nascent and stable RNA coupled with MYC pulldowns and qPCR (or RNA-sequencing of both pools) would further enhance the claim that MYC stability is being affected.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      "On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025)." Also address and discuss the following, as what is currently written is not entirely accurate: https://www.embopress.org/doi/full/10.1038/s44319-024-00299-z and https://journals.physiology.org/doi/prev/20250724-aop/abs/10.1152/ajpcell.00528.2025

      Authors’ response. We thank the Reviewer for bringing to our attention these two publications, that indeed, add important piece of data to recapitulate the in vivo complexity of c-Myc role in myogenesis. We included this point in our Discussion.

      The Discussion now reads: “On one side, c-Myc may influence early stages of myogenesis, such as myoblast proliferation and initial myotube formation, but it may not contribute significantly to later events such as myotube hypertrophy or fusion between existing myotubes and myocytes. This hypothesis is supported by recent work showing that c-Myc is dispensable for muscle fiber hypertrophy but essential for normal MuSC function (Ham et al, 2025). Other reports, instead, demonstrated the implication of c-Myc periodic pulses, mimicking resistance-exercise, in muscle growth, a role that cannot though be observed in our experimental model (Edman et al., 2024; Jones et al., 2025).”

      Minor Comments:

      Z-score scale used in the pathway bubble plot (Figure 2C) could benefit from alternative color choices. Current gradient is a bit muddy and clarity for the reader could be improved by more distinct color options, particularly in the transition from positive to negative Z-score.

      Authors’ response. As suggested, we modified the z-score-representing colors using a more distinct gradient especially in the positive to negative transition in Figure 2B.

      Clarification on the rationale for selecting the "top 18" pathways would be helpful, as it is not clear if this cutoff was chosen arbitrarily or reflects a specific statistical or biological threshold.

      Authors’ response. As now better explained (see comment regarding Major point: Transcriptomics), we used a cut-off of -log(p-value) above or equal to 9 for pathways enriched in DEGs of the D0 vs D3 comparison for both WT and Trim32 KO. The threshold is now included in the Results section and the pathways (shared between WT and Trim32 KO and unique) are listed as Fig. EV3C.

      The authors alternates between using "Trim 32 KO clones" and "KO clones" throughout the manuscript. Consistent terminology across figures and text would improve readability.

      Authors’ response. We thank the Reviewer for this remark, and we apologise for having overlooked it. We amended this throughout the manuscript by always using for clarity “Trim32 KO clones/cells”.

      Cell culture methodology does not specify passage number or culture duration (only "At confluence") before differentiation. This is important, as C2C12 differentiation potential can drift with extended passaging.

      Authors’ response. We agree with the Reviewer that C2C12 passaging can reduce the differentiation potential of this myoblast cell lines; this is indeed the main reason why we decided to employ WT clones, which underwent the same editing process as those that resulted mutated in the Trim32 gene, as reference controls throughout our study. We apologise for not indicating the passages in the first version of the manuscript that now is amended as per here below in the Methods section:

      The C2C12 parental cells used in this study were maintained within passages 3–8. All clonal cell lines (see below) were utilized within 10 passages following gene editing. In all experiments, WT and Trim32 KO clones of comparable passage numbers were used to ensure consistency and minimize passage-related variability.

      Reviewer #2 (Significance (Required)):

      General Assessment:

      This study provides a thorough investigation of Trim32's role the processes related to skeletal muscle differentiation using a CRISPR-Cas9 knockout C2C12 model. The strengths of this study lie in the multi-layered experimental approach as the authors incorporated transcriptomics, cell cycle profiling, and stability assays which collectively build a strong case for their hypothesis that Trim32 is a key factor in the normal regulation of myogenesis. The work is also strengthened by the use of multiple biological and technical replicates, particularly the independent KO clones which helps address potential clonal variation issues that could occur. The largest limitation to this study is that, while the c-Myc mechanism is well explored, the other Trim32-dependent pathways associated with the disruption (implicated by the incomplete rescue by c-Myc knockdown) are not as well addressed. Overall however, the study convincingly identifies a critical function for Trim32 during skeletal muscle differentiation. * * Advance: * * To my knowledge, this is the first study to demonstrate the mRNA stability level of c-Myc regulation by Trim32, rather than through the ubiquitin-mediated protein degradation. This work will advance the current understanding and provide a more complete understanding of Trim32's role in c-Myc regulation. Beyond c-Myc, this work highlights the idea that TRIM family proteins can influence RNA stability which could implicate a broader role in RNA biology and has potential for future therapeutic targeting. * * Audience: * * This research will be of interest to an audience that focuses on broad skeletal muscle biology but primarily to readers with more focused research such as myogenesis and neuromuscular disease (LGMDR8 in particular) where the defined Trim32 governance over early differentiation checkpoints will be of interest. It will also provide mechanistic insights to those outside of skeletal muscle that study TRIM family proteins, ubiquitin biology, and RNA regulation. For translational/clinical researchers, it identifies the Trim32/c-Myc axis as a potential therapeutic target for LGMDR8 and related muscular dystrophies.

      Expertise: * * My expertise lies in skeletal muscle biology, gene editing, transgenic mouse models, and bioinformatics. I feel confident evaluating the data and conclusions as presented.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      • In this paper, the authors examine the role of TRIM32, implicated in limb girdle muscular dystrophy recessive 8 (LGMDR8), in the differentiation of C2C12 mouse myoblasts. Using CRISPR, they generate mutant and wild-type clones and compare their differentiation capacity in vitro. They report that Trim32-deficient clones exhibit delayed and defective myogenic differentiation. RNA-seq analysis reveals widespread changes in gene expression, although few are validated by independent methods. Notably, Trim32 mutant cells maintain residual proliferation under differentiation conditions, apparently due to a failure to downregulate c-Myc. Translation inhibition experiments suggest that TRIM32 promotes c-Myc mRNA destabilization, but this conclusion is insufficiently substantiated. The authors also perform rescue experiments, showing that c-Myc knockdown in Trim32-deficient cells alleviates some differentiation defects. However, this rescue is not quantified, was conducted in only two of the three knockout lines, and is supported by inappropriate statistical analysis of gene expression. Overall, the manuscript in its current form has substantial weaknesses that preclude publication. Beyond statistical issues, the major concerns are: (1) exclusive reliance on the immortalized C2C12 line, with no validation in primary/satellite cells or in vivo, (2) insufficient mechanistic evidence that TRIM32 acts directly on c-Myc mRNA, and (3) overinterpretation of disease relevance in the absence of supporting patient or in vivo data. Please find more details below:*

      We thank the Reviewer for the in-depth assessment of our work and precious suggestions to improve the manuscript. We have carefully addressed some of the concerns raised, as detailed here, while others, which require more experimental efforts, will be addressed as detailed in the Revision Plan.

      - TRIM32 complementation / rescue experiments to exclude clonal or off-target CRISPR effects and show specificity are lacking.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - The authors link their in vitro findings to LGMDR8 pathogenesis and propose that the Trim32-c-Myc axis may serve as a central regulator of muscle regeneration in the disease. However, LGMDR8 is a complex disorder, and connecting muscle wasting in patients to differentiation assays in C2C12 cells is difficult to justify. No direct evidence is provided that the proposed mRNA mechanism operates in patient-derived samples or in mouse satellite cells. Moreover, the partial rescue achieved by c-Myc knockdown (which does not fully restore myotube morphology or differentiation index) further suggests that the disease connection is not straightforward. Validation of the TRIM32-c-Myc axis in a physiologically relevant system, such as LGMD patient myoblasts or Trim32 mutant mouse cells, would greatly strengthen the claim.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      -Some gene expression changes from the RNA-seq study in Figure 2 should be validated by qPCR

      Authors’ response. We thank the reviewer for this suggestion. This point will be addressed as detailed in the Revision Plan. We have selected several transcripts that will be evaluated in independent samples in order to validate the RNAseq results.

      - The paper shows siRNA knockdown of c-Myc in KO restores Myogenin RNA/protein but does not fully rescue myotube morphology or differentiation index. This suggests that Trim32 controls additional effectors beyond c-Myc; yet the authors do not pursue other candidate mediators identified in the RNA-seq. The manuscript would be strengthened by systematically testing whether other deregulated transcripts contribute to the phenotype.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      - There are concerns with experimental/statistical issues and insufficient replicate reporting. The authors use unpaired two-tailed Student's t-test across many comparisons; multiple testing corrections or ANOVA where appropriate should be used. In Figure EV5B and Figure 6B, the authors perform statistical analyses with control values set to 1. This method masks the inherent variability between experiments and artificially augments p values. Control sample values need to be normalized to one another to have reliable statistical analysis. Myotube morphology and differentiation index quantifications need clear description of fields counted, blind analysis, and number of biological replicates.

      Authors’ response. We thank the Reviewer for raising this point.

      Regarding the replicates, we clarified in the Methods and Legends that the Trim32 KO experiments have been performed on 3 biological replicates (independent clones) and the same for the reference control (3 independent WT clones), except for the Fig. 6 experiments that were performed on 2 Trim32 KO and 2 WT clones. All the Western Blots, immunofluorescence, qPCR data are representative of the results of at least 3 independent experiments unless otherwise stated. We reported the number and type of replicates as well as the microscope fields analyzed.

      We repeated the statistical analyses of the data in Figure 5G, EV5D, EV5E, employing more appropriately the 2-way-ANOVA test, as suggested, and we now reported this info in the graphs and legends.

      We thank the Reviewer for raising this point, we agree and substituted the graphs in Fig. EV5B and 6B showing the control values normalised as suggested. The statistical analyses now reflect this change.

      -Some English mistakes require additional read-throughs. For example: "Indeed, Trim32 has no effect on the stability of c-Myc mRNA in proliferating conditions, but upon induction of differentiation the stability of c-Myc mRNA resulted enhanced in Trim32 KO clones (Fig. 5G, Fig. EV5D and 5E)."

      Authors’ response. We re-edited this revised version of the manuscript as suggested.

      -Results in Figure 5A should be quantified

      Authors’ response. We amended this point by quantifying the results shown in Fig. 5A, we added the graph of the quantification of 3 experimental replicates to the Figure. Quantification confirms that no statistically significant difference is observed. The Figure and the relative legend are modified accordingly.

      -Based on the nuclear marker p84, the separation of cytoplasmic and nuclear fractions is not ideal in Figure 5D

      Authors’ response. We agree with the Reviewer that the presence of p84 also in the cytoplasmic fraction is not ideal. Regrettably, we observed this faint p84 band in all the experiments performed. We think however, that this is not impacting on the result that clearly shows that c-Myc and Trim32 are never detected in the same compartment.

      -In Figure 6, it is not appropriate to perform statistical analyses on only two data points per condition.

      Authors’ response. We agree with the Reviewer and we now show the graph of the results of the 3 technical replicates for 2 biological replicates and do not indicate any statistics (Fig. 6B). The graph was also modified according to a previous point raised.

      -The nuclear MYOG phenotype is very interesting; could this be related to requirements of TRIM32 in fusion?

      Authors’ response. We agree with the Reviewer that Trim32 might also be necessary for myoblast fusion. This point is however beyond the scope of the present study and will be addressed in future work.

      - The hypothesis that TRIM32 destabilizes c-Myc mRNA is intriguing but requires stronger mechanistic support. This would be more convincing with RNA immunoprecipitation to test direct association with c-Myc mRNA, and/or co-immunoprecipitation to identify interactions between TRIM32 and proteins involved in mRNA stability. The study would also be strengthened by reporter assays, such as c-Myc 3′UTR luciferase constructs in WT and KO cells, to directly demonstrate 3′UTR-dependent regulation of mRNA stability.

      Authors’ response. This point will be addressed as detailed in the Revision Plan

      Reviewer #3 (Significance (Required)):

      The manuscript presents a minor conceptual advance in understanding TRIM32 function in myogenic differentiation. Its main limitation is that all experiments were performed in C2C12 cells. While C2C12 are a classical system to study muscle differentiation, they are an immortalized, long-cultured, and genetically unstable line that represents a committed myoblast stage rather than bona fide satellite cells. They therefore do not fully model the biology of early regenerative responses. Several TRIM32 phenotypes reported in the literature differ between primary satellite cells and cell lines, and the authors themselves note such discrepancies. Extrapolating these findings to LGMDR8 pathogenesis without validation in primary human myoblasts, satellite cell assays, or in vivo regeneration models is therefore not justified. Previous work has already established clear roles for TRIM32 in mouse satellite cells in vivo and in patient myoblasts in vitro, whereas this study introduces a novel link to c-Myc regulation during differentiation. In addition, without mechanistic evidence, the central claim that TRIM32 regulates c-Myc mRNA stability remains descriptive and incomplete. Nevertheless, the results will be of interest to researchers studying LGMD and to those exploring TRIM32 biology in broader contexts. I review this manuscript as a muscle biologist with expertise in satellite cell biology and transcriptional regulation.

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

      Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Reply to the Reviewers

      I thank the Referees for their...

      Referee #1

      1. The authors should provide more information when...

      Responses + The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A). + Though this is not stated in the MS 2. Figure 6: Why has only...

      Response: We expanded the comparison

      Minor comments:

      1. The text contains several...

      Response: We added...

      Referee #2

    1. Author response:

      The following is the authors’ response to the previous reviews.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Conceptually, I feel that the authors addressed many concerns. However, I am still not convinced that their data support the strength of their claims. Additionally, I spent considerable time investigating the now freely available code and data and found several inconsistencies that would be critical to rectify. My comments are split into two parts, reflecting concerns related to the responses/methods and concerns resulting from investigation of the provided code/data. The former is described in the public review above. Because I show several figures to illustrate some key points for the latter part, an attached file will provide the second part: https://elife-rp.msubmit.net/elife-rp_files/2025/02/24/00136468/01/136468_1_attach_15_2451_convrt.pdf

      (1) This point is discussed in more detail in the attached file, but there are some important details regarding the identification of the learned trial that require more clarification. For instance, isn’t the original criterion by Gibbon et al. (1977) the first “sequence of three out of four trials in a row with at least one response”? The authors’ provided code for the Wilcoxon signed rank test and nDkl thresholds looks for a permanent exceeding of the threshold. So, I am not yet convinced that the approaches used here and in prior papers are directly comparable.

      We agree that there remain unresolved issues with our two attempts to create criteria that match that used by Gibbon and Balsam for trials to criterion. Therefore, we have decided to remove those analyses and return to our original approach showing trials to acquisition using several different criteria so as to demonstrate that the essential feature of the results—the scaling between learning rate and information—is robust. Figure 2A shows the results for a criterion that identifies the trial after which the cumulative response rate during the CS (=cumulative CS response count from Trial 1 divided by cumulative CS time from Trial 1) is consistently above the cumulative overall response rate across the trial (i.e., including both the CS and ITI). These data compare the CS response rate with the overall response rate, rather than with ITI rate as done in the previous version (in Figure 3A of that submission), to be consistent with the subsequent comparisons that are made using the nDkl. (The nDkl relies on the comparison between the CS rate and the overall rate, rather than between the CS and ITI rates.) Figures 2B and 2C show trials to acquisition when two statistical criteria, based on the nDkl, are applied to the difference between CS and overall response rates (the criteria are for odds >= 4:1 and p<.05). As we now explain in the text, a statistical threshold is useful inasmuch as it provides some confidence to the claim that the animals had learned by a given trial. However, this trial is very likely to be after the point when they had learned because accumulating statistical evidence of a difference necessarily adds trials.

      Also, there’s still no regression line fitted to their data (Fig 3’s black line is from Fig 1,according to the legends). Accordingly, I think the claim in the second paragraph of the Discussion that the old data and their data are explained by a model with “essentially the same parameter value” is not yet convincing without actually reporting the parameters of the regression. Related to this, the regression for their data based on my analysis appears to have a slope closer to -0.6, which does not support strict timescale invariance. I think that this point should be discussed as a caveat in the manuscript.

      We now include regression lines fitted to our data in Figures 2A-C, and their slopes are reported in the figure note. We also note on page 14 of the revision that these regressions fitted to our data diverge from the black regression line (slope -1) as the informativeness increases. On pages 14-15, we offer an explanation for this divergence; that, in groups with high informativeness, the effective informativeness is likely to be lower than the assigned value because the rats had not been magazine trained which means they would not have discovered the food pellet as soon as it was released on the first few trials. On pages 15-16, we go on to note that evidence for a change in response rate during the CS in those very first few trials may have been missed because the initial response rates were very low in rats trained with very long inter-reinforcement intervals (and thus high informativeness). We also propose a solution to this problem of comparing between very low response rates, one that uses the nDkl to parse response rates into segments (clusters of trials with equivalent response rates). This analysis with parsed response rates provides evidence that differential responding to the CS may have been acquired earlier than is revealed using trial-by-trial comparisons.

      (2) The authors report in the response that the basis for the apparent gradual/multiple step-like increases after initial learning remains unclear within their framework. This would be important to point out in the actual manuscript Further, the responses indicating the fact that there are some phenomena that are not captured by the current model would be important to state in the manuscript itself.

      We have included a paragraph (on page 26) that discusses the interpretation of the steady/multi-step increase in responding across continued training.

      (3) There are several mismatches between results shown in figures and those produced by the authors’ code, or other supplementary files. As one example, rat 3 results in Fig 11 and Supplementary Materials don’t match and neither version is reproduced by the authors’ code. There are more concerns like this, which are detailed in the attached review file.

      Addressed next….

      The following is the response to the points raised in Part 2 of Reviewer 1’s pdf.

      (1a) I plotted the calculated nDkl with the provided code for rat 3 (Fig 11), but itlooks different, and the trials to acquisition also didn’t match with the table  provided (average of ~20 trial difference). The authors should revise the provided code and plots. Further, even in their provided figures, if one compares rat 3 in Supplementary Materials to data from the same rat in Fig 11, the curves are different. It is critical to have reproducible results in the manuscript, including the ability to reproduce with the provided code.

      We apologise for those inconsistencies. We have checked the code and the data in the figures to ensure they are all now consistent and match the full data in the nHT.mat file in OSF. Figures 11 and 12 from the previous version are now replaced with Figure 6 in the revised manuscript (still showing data from Rats 3 and 176). The data plotted in Fig 6 match what is plotted in the supplementary figures for those 2 rats (but with slightly different cropping of the x-axes) and all plots draw directly from nHT.mat.

      (1b) I tried to replicate also Fig 3C with the results from the provided code, but I failed especially for nDkl > 2.2. Fig 3A and B look to be OK.

      There was error in the previous Fig 3C which was plotting the data from the wrong column of the Trials2Acquisition Table. We suspect this arose because some changes to the file were not updated in Dropbox. However, that figure has changed (now Figure 2) as already mentioned, and no longer plots data obtained with that specific nDkl criterion. The figure now shows criteria that do not attempt to match the Gibbon and Balsam criterion.

      (1c) The trials to learn from the code do match with those in the  Trials2Acquisition Table, but the authors’ code doesn’t reproduce the reported trials to learn values in the nDkl Acquisition Table. The trials to learn from the code are ~20 trials different on average from the table’s ones, for 1:20, 1:100, and 1:1000 nDkl.

      We agree that discrepancies between those different files were a source of potential confusion because they were using different criteria or different ways of measuring response rate (i.e., the “conventional” calculation of rate as number of responses/time, vs our adjusted calculation in which the 1<sup>st</sup> response in the CS was excluded as well as the time spent in the magazine, vs parsed response rates based on inter-response intervals). To avoid this, there is now a single table called Acquisition_Table.xlsx in OSF that includes Trials to acquisition for each rat based on a range of criteria or estimates of response rate in labelled columns. The data shown in Figure 2 are all based on the conventional calculation of response rate (provided in Columns E to H of Acquisition_Table.xlsx). To make the source of these data explicit, we have provided in OSF the matlab code that draws the data from the nHT.mat file to obtain these values for trials-to-acquisition.

      (1d) The nDkl Acquisition Table has columns with the value of the nDkl statistics at various acquisition landmarks, but the value does not look to be true, especially for rat 19. The nDkl curve provided by the authors (Supplementary Materials) doesn’t match the values in the table. The curve is below 10 until at least 300 trials, while the table reports a value higher than 20 (24.86) at the earliest evidence of learning (~120 trials?).

      We are very grateful to the reviewer for finding this discrepancy in our previous files. The individual plots in the Supplementary Materials now contain a plot of the nDkl computed using the conventional calculation of response rate (plot 3 in each 6-panel figure) and a plot of the nDkl computed using the new adjusted calculation of response rate (plot 4). These correspond to the signed nDkl columns for each rat in the full data file nHT.mat. The nDkl values at different acquisition landmarks included in Acquisition_Table.xlsx (Cols AB to AF) correspond to the second of these nDkl formulations. We point out that, of the acquisition landmarks based on the conventional calculation of response rate (Cols E to J of Acquisition_Tabls.xlsx), only the first two landmarks (CSrate>Contextrate and min_nDkl) match the permanently positive and minimum values of the plotted nDkl values. This is because the subsequent acquisition landmarks are based on a recalculation of the nDkl starting from the trial when CSrate>ContextRate, whereas the plotted nDkl starts from Trial 1.

      (2) The cumulative number of responses during the trial (Total) in the raw data table is not measured directly, but indirectly estimated from the pre-CS period, as (cumNR_Pre*[cumITI/cumT_Pre])+ cumNR_CS (cumNR_Pre: cumulative nose-poke response number during pre-CS period; cumITI: cumulative sum of ITI duration; cumT_Pre: cumulative pre-CS duration; cumNR_CS: cumulative response number during CS), according to ‘Explanation of TbyTdataTable (MATLAB).docx’.Why not use the actual cumulative responses during the whole trial instead of using a noisier measure during a smaller time window and then scaling it for the total period?

      Unfortunately, the bespoke software used to control the experimental events and record the magazine activity did not record data continuously throughout the experiment. The ITI responses were only sampled during a specified time-window (the “pre-CS” period) immediately before each CS onset. Therefore, response counts across the whole ITI had to be extrapolated.

      (3) Regarding the “Matlab code for Find Trials to Criterion.docx”:

      (a) What’s the rationale for not using all the trials to calculate nDkl but starting the cumulative summation from the earliest evidence trial (truncated)? Also, this procedure is not described in the manuscript, and this should be mentioned.

      The procedure was perhaps not described clearly enough in the previous manuscript. We have expanded that text to make it clearer (page 12) which includes the text…

      “We started from this trial, rather than from Trial 1, because response rate data from trials prior to the point of acquisition would dilute the evidence for a statistically significant difference in responding once it had emerged, and thereby increase the number of trials required to observe significant responding to the CS. The data from Rat 1 illustrates this point. The CS response rate of Rat 1 permanently exceeded its overall response rate on Trial 52 (when the nD<sub>KL</sub> also became permanently positive). The nD<sub>KL</sub>, calculated from that trial onwards, surpassed 0.82 (odds 4:1) after a further 11 trials (on Trial 63) and reached 1.92 (p < .05) on Trial 81. By contrast, the nD<sub>KL</sub> for this rat, calculated from Trial 1, did not permanently exceed 0.82 until Trial 83 and did not exceed 1.92 until Trial 93, adding 10 or 20 trials to the point of acquisition.”

      (3b) The authors' threshold is the trial when the nDkl value exceeds the threshold permanently.  What about using just the first pass after the minimum?

      Rat 19 provides one example where the nDkl was initially positive, and even exceeded threshold for odds 4:1 and p<.05, but was followed by an extended period when the nDkl was negative because the CS response rate was less than the overall response rate. It illustrates why the first trial on which the nDkl passes a threshold cannot be used as a reliably index of acquisition.

      (3c) Can the authors explain why a value of 0.5 is added to the cumulative response number before dividing it by the cumulative time?

      This was done to provide an “unbiased” estimate of the response count because responses are integers. For example, if a rat has made 10 responses over 100 s of cumulative CS time, the estimated rate should be at least 10/100 but could be anything up to, but not including, 11/100. A rate of 10.5/100 is the unbiased estimate. However, we have now removed this step when calculating the nDkl to identify trials to acquisition because we recognise that it would represent a larger correction to the rate calculated across short intervals than across long intervals and therefore bias comparison between CS and overall response rates that involve very different time durations. As such, the correction would artefactually inflate evidence that the CS response rate was higher than the contextual response rate. However, as noted earlier in this reply, we have now instituted a similar correction when calculating the pre-CS response rate over the final 5 sessions for rats that did not register a single response (hence we set their response count to 0.5).

      (3d) Although the authors explain that nDkl was set to negative if pre-CS rate is higher than CS rate, this is not included in the code because the code calculates the nDkl using the truncated version, starting to accumulate the poke numbers and time from the earliest evidence, thus cumulative CS rate is always higher than cumulative contextual rate. I expect then that the cumulative CS rate will be always higher than the cumulative pre-CS rate.

      Yes, that is correct. The negative sign is added to the nDkl when it is computed starting from Trial 1. But when it is computed starting from the trial when the CS rate is permanently > the overall rate, there is no need to add a sign because the divergence is always in the positive direction.

      (3e) Regarding the Wilcoxon signed rank test, please clarify in the manuscript that the input ‘rate’ is not the cumulative rate as used for the earliest evidence. Please also clarify if the rates being compared for the signed nDkl are just the instantaneous rates or the cumulative ones. I believe that these are the ‘cumulative’ ones (not as for Wilcoxon signed rank test), because if not, the signed nDkl curve of rat 3 would fluctuate a lot across the x-axis.

      The reviewer is correct in both cases. However, as already mentioned, we have removed the analysis involving the Wilcoxon test. The description of the nDkl already specifies that this was done using the cumulative rates.

      (4) Supplemental table ‘nDkl Acquisition Table.xlsx’ 3rd column (“Earliest”) descriptions are unclear.

      (a) It is described in the supplemental ‘Explanation of Excel Tables.docx’ as the ‘earliest estimate of the onset of a poke rate during the CSs higher than the contextual poke rate’, while the last paragraph of the manuscript’s method section says ‘Columns 4, 5 and 6 of the table give the trial after which conditioned responding appeared as estimated in the above described three different ways— by the location of the minimum in the nDkl, the last upward 0 crossings, and the CS parse consistently greater than the ITI parse, respectively. Column 3 in that table gives the minimum of the three estimates.’ I plotted the data from column 3 (right) and comparing them with Fig 3A (left) makes it clear that there’s an issue in this column. If the description in the ‘Explanation of Excel Tables.docx’ is incorrect, please update it.

      We agree that the naming of these criteria can cause confusion, hence we have changed them. On page 9 we have replaced “earliest” with “first” in describing the criterion plotted in Figure 2A showing the trial starting from which the cumulative CS response rate permanently exceeded the cumulative overall rate. What is labelled as “Earliest” in “Acquisition_Table.xlsx” is, as the explanation says, the minimum value across the 3 estimates in that table.

      (b) Also, the term ‘contextual poke rate’ in the 3rd column’s description isconfusing as in the nDkl calculation it represents the poke rate during all the training time, while in the first paragraph of the ‘Data analysis’ part, the earliest evidence is calculated by comparing the ITI (pre-CS baseline) poke rate.

      Yes, we have kept the term “contextual” response rate to refer to responding across the whole training interval (the ITI and the CS duration). This is used in calculation of the nDkl. For consistency with this comparison, we now take the first estimate of acquisition (in Fig 2A) based on a comparison between the CS rate and the overall (context) rate (not the pre-CS rate).

      Reviewer #2 (Recommendations for the authors):

      In response to the Rebuttal comments:

      Analytical (1) relating to Figure 3C/D

      This is a reasonable set of alternative analyses, but it is not clear that it answers the original comment regarding why the fit was worse when using a theoretically derived measure. Indeed, Figure 3C now looks distinctly different to the original Gibbon and Balsam data in terms of the shape of the relationship (specifically, the Group Median - filled orange circles) diverge from the black regression line.

      As mentioned in response to Reviewer 1, there was a mistake in Figure 3C of the revised manuscript. The figure was actually plotting data using a more stringent criterion of nDkl > 5.4, corresponding to p<0.001. The figure was referencing the data in column J of the public Trials2Acquisition Table. The data previously plotted in Figure 3C are no longer plotted because we no longer attempt to identify a criterion exactly matching that used by Gibbon and Balsam.

      We agree that the data shown in the first 3 panels of Figure 2 do diverge somewhat from the black regression line at the highest levels of informativeness (C/T ratios > 70), and the regression lines fitted to the data have slopes greater than -1. We acknowledge this on page 14 of the revised manuscript. Since Gibbon and Balsam did not report data from groups with such high ratios, we can’t know whether their data too would have diverged from the regression line at this point. We now report in the text a regression fitted to the first 10 groups in our experiment, which have C/T ratios that coincide with those of Gibbon and Balsam, and those regression lines do have slopes much closer to -1 (and include -1 in the 95% confidence intervals). We believe the divergence in our data at the high C/T ratios may be due to the fact that our rats were not given magazine training before commencing training with the CS and food. Because of this, it is quite likely that many rats did not find the food immediately after delivery on the first few trials. Indeed, in subsequent experiments, when we have continued to record magazine entries after CS-offset, we have found that rats can take 90 s or more to enter the magazine after the first pellet delivery. This delay would substantially increase the effective CS-US interval, measured from CS onset to discovery of the food pellet by the rat, making the CS much less informative over those trials. We now make this point on pages 14-15 of the revised manuscript.

      Analytical (2)

      We may have very different views on the statistical and scientific approaches here.

      This scalar relationship may only be uniquely applicable to the specific parameters of an experiment where CS and US responding are measured with the same behavioral response (magazine entry). As such, statements regarding the simplicity of the number of parameters in the model may simply reflect the niche experimental conditions required to generate data to fit the original hypotheses.

      To the extent that our data are consistent with the data reported decades ago by Gibbon and Balsam indicates the scalar relationship they identified is not unique to certain niche conditions since those special conditions must be true of both the acquisition of sign-tracking responses in pigeons and magazine entry responses in rats. How broadly it applies will require further experimental work using different paradigms and different species to assess how the rate of acquisition is affected across a wide range of informativeness, just as we have done here.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):           

      Summary:

      The authors have created a new model of KCNC1-related DEE in which a pathogenic patient variant (A421V) is knocked into a mouse in order to better understand the mechanisms through which KCNC1 variants lead to DEE.  

      Strengths:

      (1)  The creation of a new DEE model of KCNC1 dysfunction. 

      (2)  In Vivo phenotyping demonstrates key features of the model such as early lethality and several types of electrographic seizures. 

      (3)  The ex vivo cellular electrophysiology is very strong and comprehensive including isolated patches to accurately measure K+ currents, paired recording to measure evoked synaptic transmission, and the measurement of membrane excitability at different time points and in two cell types.

      We thank Reviewer 1 for these positive comments related to strengths of the study.   

      Weaknesses:

      (1) The assertion that membrane trafficking is impaired by this variant could be bolstered by additional data.

      We agree with this comment. However, given the technical challenges of standard biochemical experiments for investigating voltage-gated potassium channels (e.g., antibody quality), the lack of a Kv3.1-A421V specific antibody, and the fact that Kv3.1 is expressed in only a small subset of cells, we did not undertake this approach. However, we did perform additional experiments and analysis to improve the rigor of the experiments supporting our conclusion that membrane trafficking is impaired in the Kcnc1-A421V/+ mouse. 

      Such experiments support a highly significant and robust difference in our (albeit imperfect) measurement of the membrane:cytosol ratio of Kv3.1 immunofluorescence between WT and Kcnc1-A421V/+ mice, which is consistent with lack of membrane trafficking (Figure 3). In the revised manuscript, we have added additional data points to this plot and updated the representative example images using improved imaging techniques to better showcase how Kcnc1-A421V/+ PV-INs differ from age-matched WT littermate controls. We think the result is quite clear. Future biochemical experiments perhaps best performed in a culture system in vitro could provide additional support for this conclusion.

      (2) In some experiments details such as the age of the mice or cortical layer are emphasized, but in others, these details are omitted.

      We apologize for this omission. We have now clarified the age of the mice and cortical layer for each experiment in the Methods and Results sections as well as figure legends.   

      (3) The impairments in PV neuron AP firing are quite large. This could be expected to lead to changes in PV neuron activity outside of the hypersynchronous discharges that could be detected in the 2-photon imaging experiments, however, a lack of an effect on PV neuron activity is only loosely alluded to in the text. A more formal analysis is lacking. An important question in trying to understand mechanisms underlying channelopathies like KCNC1 is how changes in membrane excitability recorded at the whole cell level manifest during ongoing activity in vivo. Thus, the significance of this work would be greatly improved if it could address this question.

      Yes, the impairments in the neocortical PV-IN excitability are notably severe relative to other PV interneuronopathies that we and others have directly investigated (e.g., Kv3.1 or Kv3.2-/- knockout mice; Scn1a+/- mice). In the revised version of the manuscript, we have now added a more thorough in vivo 2P calcium imaging investigation and analysis of our in vivo 2P calcium imaging data of PV-IN (and presumptive excitatory cell) neural activity (Figure 8 and Supplementary Figure 9, Methods- lines 230-271 Results- lines 630-657, and Discussion lines- 795-814). 

      Because of the prominent recruitment of neuropil during presumptive myoclonic seizures, further investigation of individual neuronal excitability in vivo required a slightly different labeling strategy now using a soma-tagged GCaMP8m as well as a separate AAV containing tdTomato driven by the PV-IN-specific S5E2 enhancer. Our new results reveal an increase in the baseline calcium transient frequency in non-PV-INs, and reduced mean transient amplitudes in both non-PV cells and PV-INs. These interesting findings, which are consistent with attenuated PV-IN-mediated perisomatic inhibition leading to disinhibited excitatory cells in the Kcnc1-A421V/+ mice, link our in vivo results to the slice electrophysiology experiments. Of course, there are residual issues with the application of this technique to interneurons and the ability to resolve individual or small numbers of spikes, which likely explains the lack of genotype difference in calcium transient frequency in PV-INs.

      (4) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice, but there is no mention of littermate control analyzed by EEG. 

      We performed additional experiments as requested and did not observe myoclonic jerks or any other epileptic activity in WT control mice. We have included this data in the revised manuscript (Figure 9C).   

      Reviewer #2 (Public review):           

      Summary:

      Wengert et al. generated and thoroughly characterized the developmental epileptic encephalopathy phenotype of Kcnc1A421V/+ knock-in mice. The Kcnc1 gene encodes the Kv3.1 channel subunit. Analogous to the role of BK channels in excitatory neurons, Kv3 channels are important for the recurrent high-frequency discharge in interneurons by accelerating the downward hyperpolarization of the individual action potential. Various Kcnc1 mutations are associated with developmental epileptic encephalopathy, but the effect of a recurrent A421V mutation was somewhat controversial and its influence on neuronal excitability has not been fully established. In order to determine the neurological deficits and underlying disease mechanisms, the authors generated cre-dependent KI mice and characterized them using neonatal neurological examination, high-quality in vitro electrophysiology, and in vivo imaging/electrophysiology analyses. These analyses revealed excitability defects in the PV+ inhibitory neurons associated with the emergence of epilepsy and premature death. Overall, the experimental data convincingly support the conclusion.

      Strengths:

      The study is well-designed and conducted at high quality. The use of the Cre-dependent KI mouse is effective for maintaining the mutant mouse line with premature death phenotype, and may also minimize the drift of phenotypes which can occur due to the use of mutant mice with minor phenotype for breeding. The neonatal behavior analysis is thoroughly conducted, and the in vitro electrophysiology studies are of high quality.

      We appreciate these positive comments from Reviewer 2. 

      Weaknesses:

      While not critically influencing the conclusion of the study, there are several concerns.

      In some experiments, the age of the animal in each experiment is not clearly stated. For example, the experiments in Figure 2 demonstrate impaired K+ conductance and membrane localization, but it is not clear whether they correlated with the excitability and synaptic defects shown in subsequent figures. Similarly, it is unclear how old mice the authors conducted EEG recordings, and whether non-epileptic mice are younger than those with seizures. 

      We have now updated the manuscript to include clear report of age for all experiments including the impaired K<sup>+</sup> conductance (now Figure 3) and EEG (now Figure 9). There was no intention to omit this information. The recordings of K<sup>+</sup> conductance impairments in PV-INs from Kcnc1-A421V/+ mice were completed at P1621. Thus, we interpret the loss of potassium current density to be causally linked with the impairments in intrinsic physiological function at that same time-period in neocortical layer II-IV PV-INs and more subtly in PV-positive cells in the RTN and neocortical layer V PVINs.

      Mice used in the EEG experiments were P24-48, an age range which roughly corresponded with the midpoint on the survival curve for Kcnc1-A421V/+ mice. Although we saw significant mouse-to-mouse variability in seizure phenotype, no Kcnc1-A421V/+ mice completely lacked epilepsy or marked epileptiform abnormalities, neither of which were seen in WT mice. We did not detect a clear relationship between seizure frequency/type and mouse age. 

      The trafficking defect of mutant Kv3.1 proposed in this study is based only on the fluorescence density analysis which showed a minor change in membrane/cytosol ratio. It is not very clear how the membrane component was determined (any control staining?). In addition to fluorescence imaging, an addition of biochemical analysis will make the conclusion more convincing (while it might be challenging if the Kv3.1 is expressed only in PV+ cells).

      This relates to comment 3 of Reviewer 1. We agree that, in the initial submission of the manuscript, the evidence from IHC for Kv3.1 trafficking deficits was somewhat subtle. In the revised version of the paper, we have gathered additional replicates of this original experiment with improved imaging quality and clarify how the membrane component was specified, to now show a robust and highly significant (***P<0.001) decrease in membrane:cytosol Kv3.1 ratio. We have also now provided new example images better showcasing the deficits observed in the Kcnc1-A421V/+ mice (Figure 3). The membrane compartment was defined as the outermost 1 micron of the parvalbumin-defined cell soma (drawn blind to the Kv3.1b signal), and, importantly, all analysis was conducted blinded to mouse genotype. These measures help to ensure that the result is robust and unbiased. Nonetheless, we have added a paragraph in the Discussion section highlighting the limitations of our IHC evidence for trafficking impairment (Lines 868-883). 

      While the study focused on the superficial layer because Kv3.1 is the major channel subunit, the PV+ cells in the deeper cortical layer also express Kv3.1 (Chow et al., 1999) and they may also contribute to the hyperexcitable phenotype via negative effect on Kv3.2; the mutant Kv3.1 may also block membrane trafficking of Kv3.1/Kv3.2 heteromers in the deeper layer PV cells and reduce their excitability. Such an additional effect on Kv3.2, if present, may explain why the heterozygous A421V KI mouse shows a more severe phenotype than the Kv3.1 KO mouse (and why they are more similar to Kv3.2 KO). Analyzing the membrane excitability differences in the deep-layer PV cells may address this possibility.

      We appreciate this thoughtful suggestion. We have now provided data from neocortical layer V PV interneurons in the revised manuscript (Supplementary Figure 5). Abnormalities in intrinsic excitability from neocortical layer V PV-INs in Kcnc1A421V/+ mice were present, but less pronounced than in PV-INs from more superficial cortical layers. These results are consistent with the view that greater relative expression of Kv3.2 “dilutes” the impact of the Kv3.1 A421V/+ variant. More specific determination of whether the A421V/+ variant impairs membrane trafficking and/or gating of Kv3.2 remains unclear. 

      We attempted to assess how the mutant Kv3.1 affects Kv3.2 localization, but were unsuccessful due to the lack of reliable antibodies. After immunostaining mouse brain sections with two different anti-Kv3.2 antibodies, only one produced somewhat promising signal (see below). However, even in this case, Kv3.2 staining was successful only once (out of five independent staining experiments) and the signal varied across cortical regions, showing widespread cellular Kv3.2 signal in some areas (b, top panel), and barely detectable signal in others, regardless of Kv3.1 expression. In the remaining four attempts, we detected only ‘fiber-like’ immunostaining signal, further diminishing our confidence in anti-Kv3.2 antibody, although results could be improved with still further testing and refinement which we will attempt. Consequently, this important question remains unsolved in this study. 

      Author response image 1.

      Immunostaining of Kv3.1 and Kv3.2 in sagittal mouse brain sections. a) An example of intracellular Kv3.2 immunostaining signal, variable across the cortex of a WT mice independent of Kv3.1 expression b) Kv3.2 is detectable intracellularly in most of the cells in the top panel but barely detectable in the lowest panel. c) Representative image of Kv3.2 immunostaining signal in other sagittal mouse brain sections.

      We have discussed these important implications and limitations of our results in the Discussion (Lines 868-883). We agree with the Reviewer’s interpretation that an impact on Kv3.1/Kv3.2 heteromultimers across the neocortex may explain why the Kcnc1A421V/+ mouse exhibits a more severe phenotype than Kv3.1-/- or Kv3.2-/- mice (see below), a view which we have attempted to further clarify in the Conclusion.    

      In Table 1, the A421V PV+ cells show a depolarized resting membrane potential than WT by ~5 mV which seems a robust change and would influence the circuit excitability. The authors measured firing frequency after adjusting the membrane voltage to -65mV, but are the excitability differences less significant if the resting potential is not adjusted? It is also interesting that such a membrane potential difference is not detected in young adult mice (Table 2). This loss of potential compensation may be important for developmental changes in the circuit excitability. These issues can be more explicitly discussed.

      We do not entirely understand this finding and its apparent developmental component. It could be compensatory, as suggested by the Reviewer; however, it is transient and seems to be an isolated finding (i.e., it is not accompanied by compensation in other properties). It is also possible that this change in Kcnc1-A421V/+ PV-INs may reflect impaired/delayed development. We cannot test excitability at a meaningfully later time point as the mice are deceased.

      The revised version of the manuscript contains additional data (Supplementary Figure 4) showing that major deficits in intrinsic excitability are still observed even when the resting membrane potential is left unadjusted. These results are further discussed in the Results section (lines 522-523) and the Discussion section (lines 727-731).   

      Reviewer #3 (Public review):           

      Summary:

      Here Wengert et al., establish a rodent model of KCNC1 (Kv3.1) epilepsy by introducing the A421V mutation. The authors perform video-EEG, slice electrophysiology, and in vivo 2P imaging of calcium activity to establish disease mechanisms involving impairment in the excitability of fast-spiking parvalbumin (PV) interneurons in the cortex and thalamic PV cells.

      Outside-out nucleated patch recordings were used to evaluate the biophysical consequence of the A421V mutation on potassium currents and showed a clear reduction in potassium currents. Similarly, action potential generation in cortical PV interneurons was severely reduced. Given that both potassium currents and action potential generation were found to be unaffected in excitatory pyramidal cells in the cortex the authors propose that loss of inhibition leads to hyperexcitability and seizure susceptibility in a mechanism similar to that of Dravet Syndrome.  

      Strengths: 

      This manuscript establishes a new rodent model of KCNC1-developmental and epileptic encephalopathy. The manuscript provides strong evidence that parvabumin-type interneurons are impaired by the A421V Kv3.1 mutation and that cortical excitatory neurons are not impaired. Together these findings support the conclusion that seizure phenotypes are caused by reduced cortical inhibition.

      We thank Reviewer 3 for their view of the strengths of the study.

      Weaknesses:

      The manuscript identifies a partial mechanism of disease that leaves several aspects unresolved including the possible role of the observed impairments in thalamic neurons in the seizure mechanism. Similarly, while the authors identify a reduction in potassium currents and a reduction in PV cell surface expression of Kv3.1 it is not clear why these impairments would lead to a more severe disease phenotype than other loss-of-function mutations which have been characterized previously. Lastly, additional analysis of videoEEG data would be helpful for interpreting the extent of the seizure burden and the nature of the seizure types caused by the mutation.

      We agree with this comment(s) from Reviewer 3. We studied neurons in the reticular thalamus and layer V neocortical PV-INs since they are also linked to epilepsy pathogenesis and are known to express Kv3.1. However, for most of the study, we focused on neocortical layer II-IV PV-INs, because these cells exhibited the most robust impairments in intrinsic excitability. Cross of our novel Kcnc1-Flox(A421V)/+ mice to a cerebral cortex interneuron-specific driver that would avoid recombination in the thalamus, such as Ppp1r2-Cre (RRID:IMSR_JAX:012686), could assist in determining the relative contribution of thalamic reticular nucleus dysfunction to overall phenotype as used by (Makinson et al., 2017) to address a similar question; however, we have been unable to obtain this mouse despite extensive effort. There are of course other Kv3.1expressing neurons in the brain, including in the hippocampus, amygdala, and cerebellum, and we have provided additional discussion (Lines 731-736) of this issue.

      We further agree with the Reviewer that a major question in the field of KCNC1-related neurological disorders is the mechanistic underpinning of why the KCNC1-A421V variant leads to a more severe disease phenotype than other loss of function KCNC1 variants, and, further, why the mouse phenotype is more severe than the Kcnc1 knockout. Previous results and our own recordings in heterologous systems suggest that the A421V variant is more profoundly loss of function than the R320H variant (Oliver et al., 2017; Cameron et al., 2019; Park et al., 2019), which is consistent with A421V having a more severe disease phenotype. Relative to knockout of Kv3.1, our results are consistent with the view that the A421V exhibits dominant negative activity by reducing surface expression of Kv3.1 and/or Kv3.2 (an effect that would not occur in knockout mice), with a possible additional contribution of impairing gating of those Kv3.1-A421V variant containing Kv3.1/Kv3.2 heteromultimers by inclusion of A421V subunits into the heterotetramer. Our finding that the magnitude of total potassium current was reduced in PV-INs by ~50% is consistent with a combination of these various mechanisms but does not distinguish between them.

      In the revised version of the manuscript, we have provided a more complete discussion of these important remaining questions regarding our interpretation of how the severity of KCNC1 disorders relates to the biophysical features of the ion channel variant (lines 868883).

      Recommendations for the authors

      Reviewer #1 (Recommendations for the authors):          

      Major

      (1) The authors suggest that the reduced K+ current density in Kcnc1-A421V/+ neurons is due in part to impaired trafficking and cell surface expression of Kv3.1 in these neurons. The data supporting this claim aren't completely convincing. First, it's difficult to visualize a difference in Kv3.1 localization in the images shown in panel H, and importantly, it seems problematic that the method to assess Kv3.1 levels in membrane vs. cytosol relied on using PV co-staining to define the membrane compartment as the outermost 1 um of the PV-defined cell soma. This doesn't seem to be the best method to define the membrane compartment, as the PV signal should be largely cytosolic.

      As noted above, we have completed additional data collection to confirm our results, and have performed additional imaging and updated our example images to be more representative of the observed deficits in membrane Kv3.1 expression in the Kcnc1-A421V/+ mice. We attempted to identify a marker to more clearly label the membrane to combine with PV immunocytochemistry but were unable to do so despite some effort. 

      Is it possible that in control neurons, the cytosolic PV signal localizes within the membrane-bound Kv3.1 signal, with less colocalization, whereas in Kcnc1-A421V/+ neurons, there would be more colocalization of the cytosolic PV and improperly trafficked Kv3.1.? Could the data be presented in this way showing altered colocalization of Kv3.1 with PV?

      We do not entirely understand the nature of this concern. In our experiments, we utilized the PV signal to determine the cell membrane and cytosolic compartments in an unbiased manner using a 1-micron shell traced around/outside the edge of the PV signal to define the membrane compartment, with the remainder of the area (minus the nuclear signal defined by DAPI) defined as the cytosol (see Methods 176-186). Because we did not identify any alterations in PV signal or correlation between PV immunohistochemistry and tdTomato expression in Cre reporter strains between WT and Kcnc1-A421V/+ mice, we believe that our strategy for determining membrane:cytosol ratio of Kv3.1 in an unbiased manner is acceptable (albeit of course imperfect). 

      Alternatively, membrane fractionation could be performed on WT vs Kcnc1-A421V/+ neurons, followed by Western blotting with a Kv3.1 antibody to show altered proportions in the cytosolic vs. membrane protein fractions. It's important that these results are convincing, as the findings are mentioned in the Abstract, the Results section, and multiple times in the Discussion, although it is still unclear how much the potential altered trafficking contributes to the decrease in K+ currents versus changes in channel gating.

      Multiple technical barriers made it difficult for us to gain direct biochemical evidence for altered trafficking of the A421V/+ Kv3.1 variant (see above). It is not clear how membrane fractionation techniques could be easily applied in this case (at least by us) when PV-INs constitute 3-5% of all neocortical neurons. We further agree (as noted above) that it is difficult to properly disentangle the relative roles of impaired membrane trafficking vs. gating deficits to the observed effect; however, we think that both phenomena are likely occurring. In the revised version of the manuscript, we have more explicitly discussed these limitations in the Discussion section (Lines 868-883).   

      (2) More information is needed regarding the age of mice used for experiments for the following results (added to the Results section as well as figure legends):

      PV density (Supplementary Figure 1) 

      K+ current data (Figure 2A-G)       

      Kv3.1 localization (Figure 2H and I)        

      RTN electrophysiology (Supplementary Figure 3)

      Excitatory neuron electrophysiology (Figure 4)             

      In vivo 2P calcium imaging (Figure 7) 

      Video-EEG (Figure 8)

      We apologize for omitting this critical information. In the revised manuscript, we have provided the age of mice for each of our experiments in the results section, in the figure legend, and in the methods section.   

      (3) It's unclear why developmental milestones/behavioral assessments were only done at P5-P10. In the previous publication of another Kcnc1 LOF variant (Feng et al. 2024), no differences were found at P5-P10, and it was suggested in the discussion that this finding was "consistent with the known developmental expression pattern of Kv3.1 in mouse, where Kv3.1 protein does not appear until P10 or later". In that paper, they did find behavioral deficits at 2-4 months. Even though this model is more severe than the previous model, it would be interesting to determine if there are any behavioral deficits at a later time point (especially as they find more neurophysiological impairments at P32P42).

      As in our previous study, the lack of clear behavioral deficits in developmental milestones from P5-15 is potentially expected considering the developmental expression of Kv3.1, and we performed these experiments primarily to showcase that the Kcnc1-A421V/+ mice exhibit otherwise normal overall early development (although this could be an artifact of the sensitivity of our testing methods).

      For the revised manuscript, we have conducted additional experiments to investigate behavioral deficits in adult Kcnc1-A421V/+ mice. We found cognitive/learning deficits in both Kcnc1-A421V/+ mice relative to WT in both the Barnes maze (Figure 2A-C) and Ymaze (Figure 2D-F). Other aspects of animal behavior including cerebellar-related motor function are likely also impaired at post-weaning timepoints, and will be included in a forthcoming research study focusing on the motor function in these mice.  

      (4) In the Results section, it should be more clearly stated which cortical layer/layers are being studied. In some cases, it mentions layers 2-4, and in some, only layer 4, and in others, it doesn't mention layers at all. Toward the beginning of the Results section, the rationale for focusing on layers 2-4 to assess the effects of this variant should be well described and then, for each experiment, it should be stated which cortical layers were assessed. Related to this point, it seems electrophysiology was only done in layer 4; the rationale for this should also be included.

      We have now clarified which neocortical layers were under investigation in the study. All PV-INs were targeted in somatosensory layers II-IV, while excitatory neurons were either cortical layer IV spiny stellate cells or pyramidal cells. Paired recordings were also completed in layer IV. We have also more explicitly articulated our rationale for looking at PV-INs in layers II-IV to examine the cellular/circuitlevel impact of Kv3.1 in a model of developmental and epileptic encephalopathy (Lines 487-491). 

      (5) Kcnc1-A421V/+ PV neurons showed more robust impairments in AP shape and firing at P32-42 than at P16-21 (Figure 3), and only showed synaptic neurotransmission alterations at P32-42 (Figure 6). Thus, it's unclear why Kcnc1-A421V/+ excitatory neurons were only assessed at P16-21 (Figure 4 and Supplementary Figure 4 related to Figure 5), particularly if only secondary or indirect effects on this population would be expected.

      We appreciate this excellent point raised by the Reviewer and we have taken the suggestion to examine excitatory neurons at P32-42 in addition to the earlier juvenile timepoint. Our new results from the later timepoint are similar to our results at P16-21: Excitatory neurons show no statistically significant impairments in intrinsic excitability at either of the two timepoints examined (Supplementary Figure 7). This adds support to our original conclusion that PV-INs represent the major driver of disease pathology across development.   

      (6) The 2P calcium imaging experiments are potentially interesting, however, a relationship between these results and the electrophysiology results for PV neurons is lacking. Was there an attempt to assess the frequency and/or amplitude of calcium events specifically in PV neurons, outside of the hypersynchronous discharges, to determine whether there are differences between WT and Kcnc1-A421V/+, as was seen in the electrophysiological analyses? It does seem there are some key differences between the two experiments (age: later timepoint for 2P vs. P16-21 and P32-42, layer: 2/3 vs. 4, and PV marking method: virus vs. mouse line), but the electrophysiological differences reported were quite strong. Thus, it would be surprising if there were no alterations in calcium activity among the Kcnc1-A421V/+ PV neurons.

      In our initial experiments, the prominent neuropil GCaMP signal in Kcnc1-A421V/+ mice rendered it difficult to distinguish and accurately describe baseline neuronal excitability in PV-INs and non-PV cells. In our revised manuscript, we utilized a soma-tagged GCaMP8m and separately labeled PV-INs through S5E2-tdTomato. This strategy made it possible to assess the amplitude and frequency of calcium transients in both PV-positive and PV-negative cells in vivo. We have updated the description of our methods (lines 230-271) and our results (lines 630-657) in the revised manuscript.

      As noted above, our more detailed analysis of somatic calcium transients in PV-IN and non-PV cells during quiet rest (Figure 8 and Supplementary Figure 9) shows that PV-INs from Kcnc1-A421V/+ mice are abnormally excitable- having reduced transient amplitude relative to WT controls. Interestingly, non-PV cells also exhibited an increased calcium transient frequency and reduced amplitude which is potentially consistent with reduced perisomatic inhibition causing disinhibition in cortical microcircuits. We again highlight that the slow kinetics of GCaMP combined with the calcium buffering and brief spikes of PVINs render quantification of action potential frequency and comparisons between groups difficult.  

      (7) As mentioned above, it would be helpful to state the time points or age ranges of these experiments to better understand the results and relate them to each other. For example, the 2P imaging showed apparent myoclonic seizures in 7/7 Kcnc1-A421V/+ mice (recorded for a total of 30-50 minutes/mouse), but the video-EEG showed myoclonic seizures in only 3/11 Kcnc1-A421V/+ mice (recorded for 48-72 hours/mouse). Were these experiments done at very different age ranges, so this difference could be due to some sort of progression of seizure types and events as the mice age? Is it possible these are not the same seizure types (even though they are similarly described)? This discrepancy should be discussed.

      Mice in the EEG experiments were between the ages of P24 and 48, slightly younger than the age in which we carried out the in vivo calcium imaging experiments (>P50). Therefore, an age-related exacerbation in myoclonic jerks is possible. 

      As is highlighted by the Reviewer, it is interesting that the myoclonic seizures were only detected in a portion of the Kcnc1-A421V/+ mice during EEG monitoring (4/12). We believe that the difference is most likely driven by more sensitive detection of the myoclonic jerk activity and behavior in the 2P imaging of neuropil cellular activity compared to our video-EEG monitoring and 2P imaging of soma-tagged GCaMP. We have occasionally observed repetitive myoclonic jerking in mice that appears highly localized (i.e. one forepaw only) suggesting that the myoclonic seizures exist on a spectra of severity from focal to diffuse. It is therefore possible that myoclonic events and electrographic activity may be slightly underestimated in our video-EEG experiments? 

      We have now added a few lines discussing this discrepancy in the Discussion (lines 809814).   

      (8) Myoclonic jerks and other types of more subtle epileptiform activity have been observed in control mice. Was video-EEG performed on control mice? These data should be added to Figure 8.

      We have added recordings in control WT mice (N=4). We did not detect myoclonic jerks or other epileptiform activity in the control mice (Figure 9).  

      Minor

      (1) In the first Results section, Line 365, the P value (P<0.001) is different from that in the legend for Figure 1, line 743 (P<0.0001).

      We have fixed this discrepancy. 

      (2) For Supplementary Figure 1, it would be helpful to show images that span the cortical layers (1-6), as PV and Kv3.1 are both expressed across the cortical layers.

      We have updated Supplementary Figure 1 with better example images that span the cortical layers.    

      (3) Error bars should be added to the line graphs in Supplementary Figure 2, particularly panels B and C. Some of the differences appear small considering the highly significant p-values (i.e. body weight at P7 and brain weight at P21).

      The values shown in Supplementary Figure 2D-E are percentages of mice displaying a particular characteristic, so there is no variance for the data.

      Supplementary Figure 2B-C actually do contain error bars plotted as SEM, however, because of the large number of N and small degree of variance in the measurements, the error bars are not apparent in the graphs. This has been noted in the Supplementary Figure 2 legend for clarity. 

      (4) In Figure 3, although the Kcnc1-A421V/+ neurons have elevated AP amplitudes relative to WT, the representative traces for P16-21 and P32-42 groups appear strikingly opposite (traces in B in G appear to have much higher amplitudes than those in C and H). As this is one of the three AP phenotypes described, it would be nice to have it reflected in the traces.

      We have updated our example traces to better represent our main findings including AP amplitude for both P16-21 and P32-42 timepoints.  

      (5) Were any effects on the AHP assessed in the electrophysiology experiments? As other studies have reported the effects of altered Kv3 channel activity on AHP, this parameter could be interesting to report as well.

      We have now provided data on the afterhyperpolarization for each condition displayed in the Supplementary data tables. Interestingly, we failed to detect significant differences in AHP between WT and Kcnc1-A421V/+ PV-INs, RTN neurons, or pyramidal cells, although we did identify differences in the dV/dt of the repolarization phase of the AP.   

      (6) The figure legend for Figure 7 has errors in the panel labeling (D instead of C, and two Fs).

      This error has been corrected in the revised manuscript.

      Reviewer #3 (Recommendations for the authors):

      Specific comments and questions for the authors:         

      (1) Do the authors provide a reason for why the juvenile animals are unaffected by the A421V mutation? Is it that PV cells have not fully integrated at this early time point or that Kv3.1 expression is low? Is the developmental expression profile of Kv3.1 in PV cells known and if so could the authors update the discussion with this information?

      We interpret the normal early developmental milestones (P5-P15) to reflect that Kcnc1-A421V/+ mice exhibit the onset of their neurological impairment at the same time that PV-INs upregulate Kv3.1, develop a fast-spiking physiological phenotype, and integrate into functional circuits in the third and fourth postnatal weeks. We have updated the discussion (Line 780-782) with this information and more clearly describe our interpretation of these early-life behavioral experiments.   

      (2) I would like to see a more complete analysis of the Video-EEG data that is included in Figure 8. What was the seizure duration and frequency? Were there spike-wave seizure types observed? Were EEG events that involve thalamocortical circuitry affected such as spindles? Was sleep architecture impaired in the model? Were littermate control animals recorded?

      Although classical convulsive seizures represent only part of the overall epilepsy phenotype that this mouse exhibits, we agree that reporting seizure duration and frequency is important. We have now included this in our revised manuscript (line 624-626). We have also now added WT control mice to our dataset, and, as expected, we failed to observe any epileptic features in our WT recordings.

      In our EEG experiments, we did not record EMG activity in the mouse to allow for unambiguous determination of sleep vs. quiet wakefulness. For that reason, and because we believe it beyond the scope of this particular study, we did not examine sleep-related EEG phenomena such as spindles or sleep architecture. We have, however, added a line in the discussion (line 771-774) suggesting that future studies focus on a more thorough investigation of the EEG activity in these animals. 

      (3) The in vivo calcium imaging data shows synchronous bursts in A421V animals which is in agreement with the synchronous bursts observed in the EEG. Overall the analysis of the in vivo calcium imaging data appears to be rudimentary and perhaps this is a missed opportunity. What additional insights were gained from this technically demanding experiment that were not obtained from the EEG recordings?

      As noted above, in the revised version of the manuscript, we have conducted additional experiments which allowed us to separately examine PV-IN and non-PV neuron excitability via 2P in vivo calcium imaging. This required an alternative strategy to label individual neuronal somata without contamination by the robust neuropil signal that we observed in the approach undertaken in the original submission. We’ve described the details of this new approach in methods (Lines 230-271) and results section (lines 630-657).

      Our new results (Figure 8 and Supplementary Figure 9) reveal that, during quiet rest, neocortical PV-INs from Kcnc1-A421V/+ mice exhibit a reduction in calcium transient amplitude during quiet wakefulness and that non-PV cells exhibit altered transient frequency and amplitude. Overall, we believe that these results are consistent with the view that PV-IN-mediated perisomatic inhibition is compromised in Kcnc1-A421V/+ mice which leads to a downstream hyperexcitability in excitatory neurons within cortical microcircuits.  

      (4) The increased severity of seizure phenotypes observed in the A421V model relative to knockout mice is interesting but also confusing given what is known about this mutation. As the authors point out, a possible explanation is that the mutation is acting in a dominant negative manner, where mutant Kv3.1 channels compete with other Kvs that would otherwise be able to partially compensate for the loss of Kv function. Alternatively, the A421V mutation might act by affecting the trafficking of heterotetrameric Kv3 channels to the membrane. Can the authors clarify why a trafficking deficit would produce a different effect than a loss of function mutation? Are the authors proposing that a hypomorphic mutation involving both a partial trafficking deficit and a dominant negative effect of those channels that are properly localized is more severe than a "clean" loss of function? The roughly 50% loss of potassium current absent a change in gating would be expected to behave like a loss-of-function mutation. This might be addressed by comparing the surface expression of the other Kv channels and/or through the use of Kv3.1-selective pharmacology.

      These are excellent points raised by the Reviewer. As noted above, we have endeavored to clarify our hypothesis as to the basis of this phenomenon, although the mechanistic basis for the more severe phenotype in the Kcnc1-A421V/+ mouse relative to the Kv3.1 knockout is not entirely clear. Our physiology results and the evidence presented supporting a trafficking impairment, are consistent with dominant negative action of the Kv3.1 A421V variant at the level of channel gating and/or trafficking. To restate, we think the Kcnc1-A421V/+ heterozygous variant is more severe than a Kv3.1 knockout for (at least) three reasons: variant Kv3.1 is incorporated into Kv3.1/Kv3.2 heterotetramers to (1) impair trafficking to the membrane as well as (2) alter the electrophysiological function of those channels that do successfully traffic to the membrane (while Kv3.1 knockout affects Kv3.1 only), and (3) the heterozygous variant may escape compensatory upregulation of Kv3.2 and which is known to occur in Kv3.1 knockout mice.

      For example, our data suggests and is consistent with the view that heterotetramers of WT Kv3.1 and Kv3.2 potentially come together with the A421V Kv3.1 subunit in the endoplasmic reticulum and then fail to traffic to the membrane due to the presence of one or more A421V subunit(s), as evidenced by increased Kv3.1 staining in the cytosol in the Kcnc1-A421V/+ mouse relative to WT. This is in contrast to what would occur in the Kv3.1knockout mice as there is no subunit produced from the null allele to impair WT Kv3.2 subunits from forming fully functional Kv3.2 homotetramers to then reach the cell surface and function properly. This is one specific possible mechanism for dominant negative activity.

      A non-mutually-exclusive mechanism is that inclusion of one or more Kv3.1 A421V subunits into Kv3 heterotetramers impairs gating and prevents potassium flux such that, even if the tetramer does reach the membrane, that entire tetramer fails to contribute to the total potassium current. This is another possible mechanism for dominant negative function of the A421V subunit.

      Experimental elucidation of the precise mechanism of the dominant negative activity of the A421V Kcnc1 variant is beyond the scope of this study; yet, our lab is continuing to work on this. It will likely require dose-response experiments in which various ratios of WT and Kv3.1 A421V subunits are co-expressed in heterologous cells and then recorded for an overall effect on potassium current similar to (Clatot et al., 2017).

      In the revised manuscript, we have updated our discussion of these mechanistic considerations for KCNC1-related epilepsy syndromes in lines 868-883 in the Discussion. 

      References

      Cameron JM et al. (2019) Encephalopathies with KCNC1 variants: genotype-phenotypefunctional correlations. Annals of Clinical and Translational Neurology 6:1263– 1272.

      Clatot J, Hoshi M, Wan X, Liu H, Jain A, Shinlapawittayatorn K, Marionneau C, Ficker E, Ha T, Deschênes I (2017) Voltage-gated sodium channels assemble and gate as dimers. Nature Communications 8.

      Makinson CD, Tanaka BS, Sorokin JM, Wong JC, Christian CA, Goldin AL, Escayg A, Huguenard JR (2017) Regulation of Thalamic and Cortical Network Synchrony by Scn8a. Neuron 93:1165-1179.e6.

      Oliver KL et al. (2017) Myoclonus epilepsy and ataxia due to KCNC1 mutation: Analysis of 20 cases and K+ channel properties. Annals of Neurology 81.

      Park J et al. (2019) KCNC1-related disorders: new de novo variants expand the phenotypic spectrum. Annals of Clinical and Translational Neurology 6:1319–1326.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) A detailed comparison between this work and the work of Sun et al. on experimental protocols and reagents in the main text will be beneficial for readers to assess critically.

      We have added a Key Reagents Table outlining the key reagents used in our study. In terms of experimental protocols, we replicated those described by Sun et al. in most instances and described any differences when present. With this resubmission, we included additional ZnMP accumulation experiments in liquid media (see point 3 below).

      (2) The GaPP used by Sun et al. (purchased from Frontier Scientific) is more effective in killing the worm than the one used in this study (purchased from Santa Cruz). Is the different outcome due to the differences in reagents? Moreover, Sun et al. examined the lethality after 3-4 days, while this work examined the lethality after 72 hours. Would the extra 24 hours make any difference in the result?

      We now cite product vender differences as a possible reason for the observed difference in worm death, as the reviewer suggests, on page 8 (see text below) and include these differences in the Key Reagents Table. We also now stress the fact that our experiments included different doses of GaPP and the use of eat-2 mutants as an additional control, which we believe adds rigor and demonstrates the potency of GaPP in our experiments. We decided on assessment at 72 hours, as we deemed it a less nebulous time point as compared to 3-4 days. Most of the observed worm death occurred earlier in this interval, so we believe it is unlikely that large group differences would emerge after an additional 24 hours.

      “Exposing worms to GaPP, a toxic heme analog, we observed that nematodes deficient in HRG-9 and HRG-10 displayed increased survival compared to WT worms, consistent with prior work,[13] though the between-group difference was markedly smaller in our study. We required higher GaPP concentrations to induce lethality, potentially due to product vendor differences, but did observe a clear dose-dependent effect across strains. Although it was previously proposed that the survival benefit seen in worms lacking HRG-9 and HRG-10 resulted from reduced transfer from intestinal cells after GaPP ingestion, our data suggest the reduced lethality is more likely due to decreased environmental GaPP uptake. Supporting this notion, DKO worms exhibited lawn avoidance, reduced pharyngeal pumping, and modestly lower intestinal ZnMP accumulation when exposed to this fluorescent heme analog on agar plates. In liquid media, DKO worms demonstrated higher fluorescence, but only in ZnMP-free conditions, suggesting the presence of gut granule autofluorescence. Furthermore, survival following exposure to GaPP was highest in eat-2 mutants, despite heme trafficking being unaffected in this strain.”

      (3) This work reported the opposite result of Sun et al. for the fluorescent ZnMP accumulation assay. However, the experimental protocols used by the two studies are massively different. Sun et al. did the ZnMP staining by incubating the L4-stage worms in an axenic mCeHR2 medium containing 40 μM ZnMP (purchased from Frontier Scientific) and 4 μM heme at 20 ℃ for 16 h, while this work placed the L4-stage worms on the OP50 E. coli seeded NGM plates treated with 40 μM ZnMP (purchased from Santa Cruz) for 16 h. The liquid axenic mCeHR2 medium is bacteria-free, heme-free, and consistent for ZnMP uptake by worms. This work has mentioned that the hrg-9 hrg-10 double null mutant has bacterial lawn avoidance and reduced pharyngeal pumping phenotypes. Therefore, the ZnMP staining protocol used in this work faces challenges in the environmental control for the wild type vs. the mutant. The authors should adopt the ZnMP staining protocol used by Sun et al. for a proper evaluation of fluorescent ZnMP accumulation.

      We agree with this comment. As such, we performed the ZnMP assay in liquid media conditions, as now described on page 13:

      “For liquid media experiments, three generations of worms were cultured in regular heme (20 uM) axenic media, with the first two generations receiving antibiotic-supplemented media (10 mg/ml tetracycline) and the 3<sup>rd</sup> generation cultivated without antibiotic. L4 worms from the 3<sup>rd</sup> generation were placed in media containing 40uM ZnMP for 16 hours before being prepared and mounted for imaging as above. Worms were imaged on Zeiss Axio Imager 2 at 40x magnification, with image settings kept uniform across all images. Fluorescent intensity was measured within the proximal region of the intestine using ImageJ.”

      In heme-free media, both WT and DKO worms invariably entered L1 arrest, thus we were not able to replicate the results reported by Sun et al. Using media containing heme, we did see an increase in fluorescence, but this was only in the ZnMP-free condition, indicating that the increased signal was attributable to autofluorescence. This is a known phenomenon associated with gut granules in C. elegans in the setting of oxidative stress. The results of these experiments are now summarized on page 6:

      “DKO nematodes at the L4 larval stage were previously shown to accumulate the fluorescent heme analog zinc mesoporphyrin IX (ZnMP) in intestinal cells in low-heme (4 µM) liquid media. While attempting to replicate this experiment, we observed that both wildtype and DKO nematodes entered L1 arrest under these conditions. Therefore, to allow for developmental progression, we grew worms on standard OP50 E. coli plates and in media containing physiological levels of heme (20 µM). We then examined whether differences in ZnMP uptake persisted under these basal conditions. DKO worms grown on ZnMP-treated E. coli plates displayed significantly reduced intestinal ZnMP fluorescence compared to N2 (Figure 1B and C). Using basal heme media with ZnMP, there was no significant difference in ZnMP fluorescence between DKO and wildtype nematodes, although DKO worms grown in media without ZnMP exhibited significantly higher autofluorescence (Figure 1D and E). To test whether autofluorescence may have contributed to the higher fluorescent intensities previously reported in heme-deficient DKO worms, we repeated this experiment on agar plates under starved conditions but did not observe a difference between groups (Figure 1B).”

      (4) A striking difference between the two studies is that Sun et al. emphasize the biochemical function of TANGO2 homologs in heme transporting with evidence from some biochemical tests. In contrast, this work emphasizes the physiological function of TANGO2 homologs with evidence from multiple phenotypical observations. In the discussion part, the authors should address whether these observed phenotypes in this study can be due to the loss of heme transporting activities upon eliminating TANGO2 homologs. This action can improve the merit of academic debate and collaboration.

      Thank you for this suggestion. The following text has been added to the Discussion section (page 9):

      “In addition to altered pharyngeal pumping, DKO worms displayed multiple previously unreported phenotypic features, suggesting a broader metabolic impairment and reminiscent of some clinical manifestations observed in patients with TDD. Elucidating the mechanisms underlying this phenotype, and whether they reflect a core bioenergetic defect, is an active area of investigation in our lab. Several C. elegans heme-responsive genes have been characterized, revealing relatively specific defects in heme uptake or utilization rather than broad organismal dysfunction. For example, hrg-1 and hrg-4 mutants exhibit impaired growth only under heme-limited conditions,[23] and hrg-3 loss affects brood size and embryonic viability specifically when maternal heme is scarce.[24] ]By contrast, hrg-9 and hrg-10 mutants exhibit the most severe organismal phenotypes of the hrg family, to date, including reduced pharyngeal pumping, decreased motility, shortened lifespan, and smaller broods, even when fed a heme-replete diet.”

      Reviewer #2 (Public review):

      (1) The manuscript is written mainly as a criticism of a previously published paper. Although reproducibility in science is an issue that needs to be acknowledged, a manuscript should focus on the new data and the experiments that can better prove and strengthen the new claims.

      Thank you for this suggestion. While the primary intent of this study was to replicate key findings from the 2022 publication by Sun et al., the revised manuscript now emphasizes underlying mechanisms more broadly rather than focusing narrowly on that prior publication.

      (2) The current presentation of the logic of the study and its results does not help the authors deliver their message, although they possess great potential.

      We have attempted to rectify this through substantial revision of the Discussion section and other places throughout the manuscript.

      (3) The study is missing experiments to link hrg-9 and hrg-10 more directly to bioenergetic and oxidative stress pathways.

      The reviewer is correct in this assertion, but it was not our intent to definitively prove this link or, indeed, the primary mechanism of TANGO2 in the present manuscript. This said, we are actively engaged in this endeavor in our lab and anticipate these data will be published in a separate, forthcoming publication.

      We have added additional references pertaining to hrg-9 enrichment as part of the mitochondrial unfolded protein response (page 10) and a comparison of the phenotype observed in hrg-9 and hrg-10 deficient worms versus those lacking other proteins in the hrg family (page 9).

      Reviewer #3 (Public review):

      (1) The authors stress - with evidence provided in this paper or indicated in the literature - that the primary role of TANGO2 and its homologues is unlikely to be related to heme trafficking, arguing that observed effects on heme transport are instead downstream consequences of aberrant cellular metabolism. But in light of a mounting body of evidence (referenced by the authors) connecting more or less directly TANGO2 to heme trafficking and mobilization, it is recommended that the authors comment on how they think TANGO2 could relate to and be essential for heme trafficking, albeit in a secondary, moonlighting capacity. This would highlight a seemingly common theme in emerging key players in intracellular heme trafficking, as it appears to be the case for GAPDH - with accumulating evidence of this glycolytic enzyme being critical for heme delivery to several downstream proteins.

      TANGO2 is essential for mitochondrial health, albeit in a yet unknown capacity. In the absence of TANGO2, defects in heme trafficking may be secondary sequelae of mitochondrial dysfunction. We would point out that prior studies that attempted to show that TANGO2 and its homologs are involved in heme trafficking proposed very different mechanisms (direct binding vs. membrane protein interaction) and relied on artificially low or high heme conditions to produce these effects. We have attempted to address these more clearly in the Discussion section and have added a fifth figure to summarize our current unifying theory for how heme levels and mitochondrial stress may be linked.

      (2) The observation - using eat-2 mutants and lawn avoidance behaviour - that survival patterns can be partially explained by reduced consumption, is fascinating. It would be interesting to quantify the two relative contributions.

      We have completed additional ZnMP experiments in liquid media at the reviewers’ request. This experimental condition eliminates lawn avoidance as a factor in consumption. Fluorescent intensity was significantly higher in the DKO worms in media lacking ZnMP, indicating increased autofluorescence in DKO worms, while signal was not significantly different in media with ZnMP.

      (3) In the legend to Figure 1A it's a bit unclear what the differently coloured dots represent for each condition. Repeated measurements, worms, independent experiments? The authors should clarify this.

      The following sentence has been added to the legend for Figure 1:

      “Each dot represents the number of offspring laid by one adult worm on one GaPP-treated plate after 24 hours.”

      (4) It would help if the entire fluorescence images (raw and processed) for the ZnMP treatments were provided. Fluorescence images would also benefit Figure 1B.

      Fluorescent intensity values pertaining to the ZnMP experiments are included in our Extended Data supplement, and we have added representative images to Figure 1, per the reviewer’s request. We thank the reviewer for this helpful suggestion. We would be happy to upload raw images to an open-access repository if deemed necessary by the editorial team.

      (5) Increasingly, the understanding of heme-dependent roles relies on transient or indirect binding to unsuspected partners, not necessarily relying on a tight affinity and outdating the notion of heme as a static cofactor. Despite impressive recent advancements in the detection of these interactions (for example https://doi.org/10.1021/jacs.2c06104; cited by the authors), a full characterisation of the hemome is still elusive. Sandkuhler et al. deemed it possible but seem to question that heme binding to TANGO2 occurs. However, Sun et al. convincingly showed and characterised TANGO2 binding to heme. It is recommended that the authors comment on this.

      We believe it is plausible that TANGO2 binds heme (as do hundreds of other proteins), especially as it has been shown to bind other hydrophobic molecules. However, we also note that a separate paper examining the role of TANGO2 in heme transport posited that GAPDH is the sole heme binding partner for cytoplasmic transport (https://doi.org/10.1038/s41467-025-62819-2), contradicting the originally posited theory of how TANGO2 functions. This is described in the Discussion section and, as noted above, we have added an additional figure to demonstrate our unifying hypothesis for why TANGO2 may be important in the low-heme state, irrespective of any direct effect on heme trafficking.

      Additional comments and revisions:

      (1) It was suggested that a triple mutant (eat-2; hrg-9; hrg-10) be tested to determine the primary driver of GaPP toxicity. We appreciate this suggestion, but we offer the following rationale for why these experiments were not pursued. The eat-2 mutant, which lacks a nicotinic acetylcholine receptor subunit in pharyngeal muscles, was included solely as a dietary restriction control to illustrate that reduced GaPP toxicity in the hrg-9/10 double mutant could arise from poor feeding rather than defective heme transport. Both eat-2 and hrg-9/10 mutants exhibit markedly reduced feeding but via different mechanisms. In our assays, GaPP survival was inversely correlated with ingestion rate: eat-2 animals, which feed the least, showed the highest survival, while hrg-9/10 mutants showed intermediate feeding and intermediate survival. Consistent with this, eat-2 worms also displayed the lowest ZnMP accumulation.

      (2) GaPP solution was added to NGM plates after seeding with OP50. This is now expressly stated in the Methods section (page 15). We would note that Sun et al. mixed GaPP in with NGM in the liquid phase. We would expect that if there were a difference in GaPP exposure due to these different protocols, worms in our experiment would have received higher GaPP concentrations.

      “Standard NGM plates were treated with 1, 2, 5, or 10 µM gallium protoporphyrin IX (GaPP; Santa Cruz) after seeding with OP50. Plates were swirled to ensure an even distribution of GaPP and allowed to dry completely.

      (3) The manuscript has been reworked to read as more of an independent study rather than a rebuttal of prior work, though the primary objective of validating prior work remains unchanged.

      (4) Several technical details of experiments have been moved from the main text to the materials and methods section.

      (5) One reviewer noted that the figure numbering should be adjusted. Numbering does not progress sequentially (i.e., 1A…1B…2A…2B) early in the text, because we have opted to consolidate data pertaining to heme analog experiments in Figure 1 and behavioral data in Figure 2.

      (6) “Kingdoms” has been changed to “domains” (page 4).

      (7) Example images are now included for Figure 1B, as noted above.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This study introduces an important approach using selection linked integration (SLI) to generate Plasmodium falciparum lines expressing single, specific surface adhesins PfEMP1 variants, enabling precise study of PfEMP1 trafficking, receptor binding, and cytoadhesion. By moving the system to different parasite strains and introducing an advanced SLI2 system for additional genomic edits, this work provides compelling evidence for an innovative and rigorous platform to explore PfEMP1 biology and identify novel proteins essential for malaria pathogenesis including immune evasion.

      Reviewer #1 (Public review):

      One of the roadblocks in PfEMP1 research has been the challenges in manipulating var genes to incorporate markers to allow the transport of this protein to be tracked and to investigate the interactions taking place within the infected erythrocyte. In addition, the ability of Plasmodium falciparum to switch to different PfEMP1 variants during in vitro culture has complicated studies due to parasite populations drifting from the original (manipulated) var gene expression. Cronshagen et al have provided a useful system with which they demonstrate the ability to integrate a selectable drug marker into several different var genes that allows the PfEMP1 variant expression to be 'fixed'. This on its own represents a useful addition to the molecular toolbox and the range of var genes that have been modified suggests that the system will have broad application. As well as incorporating a selectable marker, the authors have also used selective linked integration (SLI) to introduce markers to track the transport of PfEMP1, investigate the route of transport, and probe interactions with PfEMP1 proteins in the infected host cell.

      What I particularly like about this paper is that the authors have not only put together what appears to be a largely robust system for further functional studies, but they have used it to produce a range of interesting findings including:

      Co-activation of rif and var genes when in a head-to-head orientation.

      The reduced control of expression of var genes in the 3D7-MEED parasite line.

      More support for the PTEX transport route for PfEMP1.

      Identification of new proteins involved in PfEMP1 interactions in the infected erythrocyte, including some required for cytoadherence.

      In most cases the experimental evidence is straightforward, and the data support the conclusions strongly. The authors have been very careful in the depth of their investigation, and where unexpected results have been obtained, they have looked carefully at why these have occurred.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      (1) In terms of incorporating a drug marker to drive mono-variant expression, the authors show that they can manipulate a range of var genes in two parasite lines (3D7 and IT4), producing around 90% expression of the targeted PfEMP1. Removal of drug selection produces the expected 'drift' in variant types being expressed. The exceptions to this are the 3D7-MEED line, which looks to be an interesting starting point to understand why this variant appears to have impaired mutually exclusive var gene expression and the EPCR-binding IT4var19 line. This latter finding was unexpected and the modified construct required several rounds of panning to produce parasites expressing the targeted PfEMP1 and bind to EPCR. The authors identified a PTP3 deficiency as the cause of the lack of PfEMP1 expression, which is an interesting finding in itself but potentially worrying for future studies. What was not clear was whether the selected IT4var19 line retained specific PfEMP1 expression once receptor panning was removed.

      We do not have systematic long-term data for the Var19 line but do have medium-term data. After panning the Var19 line, the binding assays were done within 3 months without additional panning. The first binding assay was 2 months after the panning and the last binding assays three weeks later, totaling about 3 months without panning. While there is inherent variation in these assays that precludes detection of smaller changes, the last assay showed the highest level of binding, giving no indication for rapid loss of the binding phenotype. Hence, we can say that the binding phenotype appears to be stable for many weeks without panning the cells again and there was no indication for a rapid loss of binding in these parasites.

      Systematic long-term experiments to assess how long the Var19 parasites retain binding would be interesting, but given that the binding-phenotype appears to remain stable over many weeks or even months, this would only make sense if done over a much longer time frame. Such data might arise if the line is used over extended times for a specific project in which case it might be advisable to monitor continued binding. We included a statement in the discussion that the binding phenotype was stable over many weeks but that if long-term work with this line is planned, monitoring the binding phenotype might be advisable: “In the course of this work the binding phenotype of the IT4var19 expressor line remained stable over many weeks without further panning. However, given that initial panning had been needed for this particular line, it might be advisable for future studies to monitor the binding phenotype if the line is used for experiments requiring extended periods of cultivation.”

      (2) The transport studies using the mDHFR constructs were quite complicated to understand but were explained very clearly in the text with good logical reasoning.

      We are aware of this being a complex issue and are glad this was nevertheless understandable.

      (3) By introducing a second SLI system, the authors have been able to alter other genes thought to be involved in PfEMP1 biology, particularly transport. An example of this is the inactivation of PTP1, which causes a loss of binding to CD36 and ICAM-1. It would have been helpful to have more insight into the interpretation of the IFAs as the anti-SBP1 staining in Figure 5D (PTP-TGD) looks similar to that shown in Figure 1C, which has PTP intact. The anti-EXP2 results are clearly different.

      We realize the description of the PTP1-TGD IFA data and that of the other TGDs (see also response to Recommendation to authors point 4 and reviewer 2, major points 6 and 7) was rather cursory. The previously reported PTP1 phenotype is a fragmentation of the Maurer’s clefts into what in IFA appear to be many smaller pieces (Rug et al 2014, referenced in the manuscript). The control in Fig. 5D has 13 Maurer’s cleft spots (previous work indicates an average of ~15 MC per parasite, see e.g. the originally co-submitted eLife preprint doi.org/10.7554/eLife.103633.1 and references therein). The control mentioned by the reviewer in Fig. 1C has about 22 Maurer’s clefts foci, at the upper end of the typical range, but not unusual. In contrast, the PTP1-TGD in Fig. 5D, has more than 30 foci with an additional cytoplasmic pool and additional smaller, difficult to count foci. This is consistent with the published phenotype in Rug et al 2014. The EXP1 stained cell has more than 40 Maurer’s cleft foci, again beyond what typically is observed in controls. Therefore, these cells show a difference to the control in Fig. 5 but also to Fig. 1C. Please note that we are looking at two different strains, in Fig. 1 it is 3D7 and in Fig. 5 IT4. While we did not systematically assess this, the Maurer’s clefts number per cell seemed to be largely comparable between these strains (Fig. 10C and D in the other eLife preprint doi.org/10.7554/eLife.103633.1). 

      Overall, as the PTP1 loss phenotype has already been reported, we did not go into more experimental detail. However, we now modified the text to more clearly describe how the phenotype in the PTP1-TGD parasites was different to control: “IFAs showed that in the PTP1-TGD parasites, SBP1 and PfEMP1 were found in many small foci in the host cell that exceeded the average number of ~ 15 Maurer’s clefts typically found per infected RBC [66] (Fig. 5D). This phenotype resembled the previously reported Maurer’s clefts phenotype of the PTP1 knock out in CS2 parasites [39].”

      (4) It is good to see the validation of PfEMP1 expression includes binding to several relevant receptors. The data presented use CHO-GFP as a negative control, which is relevant, but it would have been good to also see the use of receptor mAbs to indicate specific adhesion patterns. The CHO system if fine for expression validation studies, but due to the high levels of receptor expression on these cells, moving to the use of microvascular endothelial cells would be advisable. This may explain the unexpected ICAM-1 binding seen with the panned IT4var19 line.

      We agree with the reviewer that it is desirable to have better binding systems for studying individual binding interactions. As the main purpose of this paper was to introduce the system and provide proof of principle that the cells show binding, we did not move to more complicated binding systems. However, we would like to point out that the CSA binding was done on receptor alone in addition to the CSA-expressing HBEC-5i cells and was competed successfully with soluble CSA. In addition, apart from the additional ICAM1-binding of the Var19 line, all binding phenotypes were conform with expectations. We therefore hope the tools used for binding studies are acceptable at this stage of introducing the system while future work interested in specific PfEMP1 receptor interactions may use better systems, tailored to the specific question (e.g. endothelial organoid models and engineered human capillaries and inhibitory antibodies or relevant recombinant domains for competition).

      (5) The proxiome work is very interesting and has identified new leads for proteins interacting with PfEMP1, as well as suggesting that KAHRP is not one of these. The reduced expression seen with BirA* in position 3 is a little concerning but there appears to be sufficient expression to allow interactions to be identified with this construct. The quantitative impact of reduced expression for proxiome experiments will clearly require further work to define it.

      This is a valid point. Clearly there seems to be some impact on binding when BirA* is placed in the extracellular domain (either through reduced presentation or direct reduction of binding efficiency of the modified PfEMP1; please see also minor comment 10 reviewer 2). The exact quantitative impact on the proxiome is difficult to assess but we note that the relative enrichment of hits to each other is rather similar to the other two positions (Fig. 6H-J). We therefore believe the BioIDs with the 3 PfEMP1-BirA* constructs are sufficient to provide a general coverage of proteins proximal to PfEMP1 and hope this will aid in the identification of further proteins involved in PfEMP1 transport and surface display as illustrated with two of the hits targeted here.

      The impact of placing a domain on the extracellular region of PfEMP1 will have to be further evaluated if needed in other studies. But the finding that a large folded domain can be placed into this part at all, even if binding was reduced, in our opinion is a success (it was not foreseeable whether any such change would be tolerated at all).

      (6) The reduced receptor binding results from the TryThrA and EMPIC3 knockouts were very interesting, particularly as both still display PfEMP1 on the surface of the infected erythrocyte. While care needs to be taken in cross-referencing adhesion work in P. berghei and whether the machinery truly is functionally orthologous, it is a fair point to make in the discussion. The suggestion that interacting proteins may influence the "correct presentation of PfEMP1" is intriguing and I look forward to further work on this.

      We hope future work will be able to shed light on this.

      Overall, the authors have produced a useful and reasonably robust system to support functional studies on PfEMP1, which may provide a platform for future studies manipulating the domain content in the exon 1 portion of var genes. They have used this system to produce a range of interesting findings and to support its use by the research community. Finally, a small concern. Being able to select specific var gene switches using drug markers could provide some useful starting points to understand how switching happens in P. falciparum. However, our trypanosome colleagues might remind us that forcing switches may show us some mechanisms but perhaps not all.

      Point noted! From non-systematic data with the Var01 line that has been cultured for extended periods of time (several years), it seems other non-targeted vars remain silent in our SLI “activation” lines but how much SLI-based var-expression “fixing” tampers with the integrity of natural switching mechanisms is indeed very difficult to gage at this stage. We now added a statement to the discussion that even if mutually exclusive expression is maintained, it is not certain the mechanisms controlling var expression all remain intact: “However, it should be noted that it is not known whether all mechanisms controlling mutually exclusive expression and switching remain intact in parasites with SLI-activated var genes.”

      Reviewer #2 (Public review):

      Summary

      Croshagen et al develop a range of tools based on selection-linked integration (SLI) to study PfEMP1 function in P. falciparum. PfEMP1 is encoded by a family of ~60 var genes subject to mutually exclusive expression. Switching expression between different family members can modify the binding properties of the infected erythrocyte while avoiding the adaptive immune response. Although critical to parasite survival and Malaria disease pathology, PfEMP1 proteins are difficult to study owing to their large size and variable expression between parasites within the same population. The SLI approach previously developed by this group for genetic modification of P. falciparum is employed here to selectively and stably activate the expression of target var genes at the population level. Using this strategy, the binding properties of specific PfEMP1 variants were measured for several distinct var genes with a novel semi-automated pipeline to increase throughput and reduce bias. Activation of similar var genes in both the common lab strain 3D7 and the cytoadhesion competent FCR3/IT4 strain revealed higher binding for several PfEMP1 IT4 variants with distinct receptors, indicating this strain provides a superior background for studying PfEMP1 binding. SLI also enables modifications to target var gene products to study PfEMP1 trafficking and identify interacting partners by proximity-labeling proteomics, revealing two novel exported proteins required for cytoadherence. Overall, the data demonstrate a range of SLI-based approaches for studying PfEMP1 that will be broadly useful for understanding the basis for cytoadhesion and parasite virulence.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Comments

      (1) While the capability of SLI to actively select var gene expression was initially reported by Omelianczyk et al., the present study greatly expands the utility of this approach. Several distinct var genes are activated in two different P. falciparum strains and shown to modify the binding properties of infected RBCs to distinct endothelial receptors; development of SLI2 enables multiple SLI modifications in the same parasite line; SLI is used to modify target var genes to study PfEMP1 trafficking and determine PfEMP1 interactomes with BioID. Curiously, Omelianczyk et al activated a single var (Pf3D7_0421300) and observed elevated expression of an adjacent var arranged in a head-to-tail manner, possibly resulting from local chromatin modifications enabling expression of the neighboring gene. In contrast, the present study observed activation of neighboring genes with head-to-head but not head-totail arrangement, which may be the result of shared promoter regions. The reason for these differing results is unclear although it should be noted that the two studies examined different var loci.

      The point that we are looking at different loci is very valid and we realize this is not mentioned in the discussion. We now added to the discussion that it is unclear if our results and those cited may be generalized and that different var gene loci may respond differently

      “However, it is unclear if this can be generalized and it is possible that different var loci respond differently.”

      (2) The IT4var19 panned line that became binding-competent showed increased expression of both paralogs of ptp3 (as well as a phista and gbp), suggesting that overexpression of PTP3 may improve PfEMP1 display and binding. Interestingly, IT4 appears to be the only known P. falciparum strain (only available in PlasmoDB) that encodes more than one ptp3 gene (PfIT_140083100 and PfIT_140084700). PfIT_140084700 is almost identical to the 3D7 PTP3 (except for a ~120 residue insertion in 3D7 beginning at residue 400). In contrast, while the C-terminal region of PfIT_140083100 shows near-perfect conservation with 3D7 PTP3 beginning at residue 450, the N-terminal regions between the PEXEL and residue 450 are quite different. This may indicate the generally stronger receptor binding observed in IT4 relative to 3D7 results from increased PTP3 activity due to multiple isoforms or that specialized trafficking machinery exists for some PfEMP1 proteins.

      We thank the reviewer for pointing this out, the exact differences between the two PTP3s of IT4 and that of other strains definitely should be closely examined if the function of these proteins in PfEMP1 binding is analysed in more detail. 

      It is an interesting idea that the PTP3 duplication could be a reason for the superior binding of IT4. We always assumed that IT4 had better binding because it was less culture adapted but this does not preclude that PTP3(s) is(are) a reason for this. However, at least in our 3D7 PTP3 can’t be the reason for the poor binding, as our 3D7 still has PfEMP1 on the surface while in the unpanned IT4-Var19 line and in the Maier et al., Cell 2008 ptp3 KO (PMID: 18614010)) PfEMP1 is not on the surface anymore. 

      Testing the impact of having two PTP3s would be interesting, but given the “mosaic” similarity of the two PTP3s isoforms, a simple add-on experiment might not be informative. Nevertheless, it will be interesting in future work to explore this in more detail.

      Reviewer #3 (Public review):

      Summary:

      The submission from Cronshagen and colleagues describes the application of a previously described method (selection linked integration) to the systematic study of PfEMP1 trafficking in the human malaria parasite Plasmodium falciparum. PfEMP1 is the primary virulence factor and surface antigen of infected red blood cells and is therefore a major focus of research into malaria pathogenesis. Since the discovery of the var gene family that encodes PfEMP1 in the late 1990s, there have been multiple hypotheses for how the protein is trafficked to the infected cell surface, crossing multiple membranes along the way. One difficulty in studying this process is the large size of the var gene family and the propensity of the parasites to switch which var gene is expressed, thus preventing straightforward gene modification-based strategies for tagging the expressed PfEMP1. Here the authors solve this problem by forcing the expression of a targeted var gene by fusing the PfEMP1 coding region with a drug-selectable marker separated by a skip peptide. This enabled them to generate relatively homogenous populations of parasites all expressing tagged (or otherwise modified) forms of PfEMP1 suitable for study. They then applied this method to study various aspects of PfEMP1 trafficking.

      Strengths:

      The study is very thorough, and the data are well presented. The authors used SLI to target multiple var genes, thus demonstrating the robustness of their strategy. They then perform experiments to investigate possible trafficking through PTEX, they knock out proteins thought to be involved in PfEMP1 trafficking and observe defects in cytoadherence, and they perform proximity labeling to further identify proteins potentially involved in PfEMP1 export. These are independent and complimentary approaches that together tell a very compelling story.

      We thank the reviewer for the kind assessment and the comments to improve the paper.

      Weaknesses:

      (1)  When the authors targeted IT4var19, they were successful in transcriptionally activating the gene, however, they did not initially obtain cytoadherent parasites. To observe binding to ICAM-1 and EPCR, they had to perform selection using panning. This is an interesting observation and potentially provides insights into PfEMP1 surface display, folding, etc. However, it also raises questions about other instances in which cytoadherence was not observed. Would panning of these other lines have been successfully selected for cytoadherent infected cells? Did the authors attempt panning of their 3D7 lines? Given that these parasites do export PfEMP1 to the infected cell surface (Figure 1D), it is possible that panning would similarly rescue binding. Likewise, the authors knocked out PTP1, TryThrA, and EMPIC3 and detected a loss of cytoadhesion, but they did not attempt panning to see if this could rescue binding. To ensure that the lack of cytoadhesion in these cases is not serendipitous (as it was when they activated IT4var19), they should demonstrate that panning cannot rescue binding.

      These are very important considerations. Indeed, we had repeatedly attempted to pan 3D7 when we failed to get the SLI-generated 3D7 PfEMP1 expressor lines to bind, but this had not been successful. The lack of binding had been a major obstacle that had held up the project and was only solved when we moved to IT4 which readily bound (apart from Var19 which was created later in the project). After that we made no further efforts to understand why 3D7 does not bind but the fact that PfEMP1 is on the surface indicates this is not a PTP3 issue because loss of PTP3 also leads to loss of PfEMP1 surface display. Also, as the parent 3D7 could not be panned, we assumed this issue is not easily fixed in the SLI var lines we made in 3D7.

      Panning the TGD lines: we see the reasoning for conducting panning experiments with the TGD lines. However, on second thought, we are unsure this should be attempted. The outcome might not be easily interpretable as at least two forces will contribute to the selection in panning experiments with TGD lines that do not bind anymore:

      Firstly, panning would work against the SLI of the TGD, resulting in a tug of war between the TGD-SLI and binding. This is because a small number of parasites will loop out the TGD plasmid (revert) and would normally be eliminated during standard culturing due to the SLI drug used for the TGD. These revertant cells would bind and the panning would enrich them. Hence, panning and SLI are opposed forces in the case of a TGD abolishing binding. It is unclear how strong this effect would be, but this would for sure lead to mixed populations that complicate interpretations. 

      The second selecting force are possible compensatory changes to restore binding. These can be due to different causes: (i) reversal of potential independent changes that may have occurred in the TGD parasites and that are in reality causing the binding loss (i.e. such as ptp3 loss or similar, the concern of the reviewer) or (ii) new changes to compensate the loss of the TGD target (in this case the TGD is the cause of the binding loss but for instance a different change ameliorates it by for instance increasing PfEMP1 expression or surface display). As both TGDs show some residual binding and have VAR01 on the surface to at least some extent, it is possible that new compensatory changes might indeed occur that indirectly increase binding again. 

      In summary, even if more binding occurs after panning of the lines, it is not clear whether this is due to a compensatory change ameliorating the TGD or reversal of an unrelated change or are counter-selections against the SLI. To determine the cause, the panned TGD lines would need to be subjected to a complex and time-consuming analysis (WGS, RNASeq, possibly Maurer’s clefts phenotype) to find out whether they were SLI-revertants, or had an unrelated chance that was reverted or a new compensatory change that helps binding. This might be further muddled if a mix of cells come out of the selection that have different changes of the options indicated above. In that case, it might even require scRNASeq to make sense of the panning experiment. Due to the envisaged difficulty in interpreting the outcome, we did not attempt this panning.

      To exclude loss of ptp3 expression as the reason for binding loss (something we would not have seen in the WGS if it is only due to a transcriptional change), we now carried out RNASeq with the TGD lines that have a binding phenotype. While we did not generate replicas to obtain quantitative data, the results show that both ptp3 copies were expressed in these TGDs comparable to other parasite lines that do bind with the same SLI-activated var gene, indicating that the effect is not due to ptp3 (see response to point 4 on PTP3 expression in the Recommendations for the authors). While we can’t fully exclude other changes in the TGDs that might affect binding, the WGS did not show any obvious alterations that could be responsible for this. 

      (2) The authors perform a series of trafficking experiments to help discern whether PfEMP1 is trafficked through PTEX. While the results were not entirely definitive, they make a strong case for PTEX in PfEMP1 export. The authors then used BioID to obtain a proxiome for PfEMP1 and identified proteins they suggest are involved in PfEMP1 trafficking. However, it seemed that components of PTEX were missing from the list of interacting proteins. Is this surprising and does this observation shed any additional light on the possibility of PfEMP1 trafficking through PTEX? This warrants a comment or discussion.

      This is an interesting point and we agree that this warrants to be discussed. A likely reason why PTEX components are not picked up as interactors is that BirA* is expected to be unfolded when it passes through the channel and in that state can’t biotinylate. Labelling likely would only be possible if PfEMP1 lingered at the PTEX translocation step before BirA* became unfolded to go through the channel which we would not expect under physiological conditions. We added the following sentences to the discussion: “While our data indicates PfEMP1 uses PTEX to reach the host cell, this could be expected to have resulted in the identification of PTEX components in the PfEMP1 proxiomes, which was not the case. However, as BirA* must be unfolded to pass through PTEX, it likely is unable to biotinylate translocon components unless PfEMP1 is stalled during translocation. For this reason, a lack of PTEX components in the PfEMP1 proxiomes does not necessarily exclude passage through PTEX.”

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Most of my comments are in the public section. I would just highlight a few things:

      (1) In the binding studies section you talk about "human brain endothelial cells (HBEC-5i)". These cells do indeed express CSA but this is a property of their immortalisation rather than being brain endotheliium, which does not express CSA. I think this could be confusing to readers so I think you might want to reword this sentence to focus on CSA expressing the cell line rather than other features.

      We thank the reviewer for pointing this out, we now modified the sentence to focus on the fact these are CSA expressing cells and provided a reference for it.

      (2) As I said in the public section, CHO cells are great for proof of concept studies, but they are not endothelium. Not a problem for this paper.

      Noted! Please also see our response to the public review.

      (3) I wonder whether your comment about how well tolerated the Bir3* insertion is may be a bit too strong. I might say "Nonetheless, overall the BirA* modified PfEMP1 were functional."

      Changed as requested.

      (4) I'm not sure how you explain the IFA staining patterns to the uninitiated, but perhaps you could explain some of the key features you are looking for.

      We apologise for not giving an explanation of the IFA staining patterns in the first place. Please see detailed response to public review of this reviewer (point 3 on PTP1-TGD phenotype) and to reviewer 2 (Recommendations to the authors, points 6 and 7 on better explaining and quantifying the Maurer’s clefts phenotypes). For this we now also generated parasites that episomally express mCherry tagged SBP1 in the TGD parasites with the reduced binding phenotype. This resulted in amendments to Fig. S7, addition of a Fig. S8 and updated results to better explain the phenotypes. 

      This is a great paper - I just wish I'd had this system before.

      Thank you!

      Reviewer #2 (Recommendations for the authors):

      Major Comments

      (1) Does the RNAseq analysis of 3D7var0425800 and 3D7MEEDvar0425800 (Figure 1G, H) reveal any differential gene expression that might suggest a basis for loss of mutually exclusive var expression in the MEED line?

      We now carried out a thorough analysis of these RNASeq experiments to look for an underlying cause for the phenotype. This was added as new Figure 1J and new Table S3. This analysis again illustrated the increased transcript levels of var genes. In addition, it showed that transcripts of a number of other exported proteins, including members of other gene families, were up in the MEED line. 

      One hit that might be causal of the phenotype was sip2, which was down by close to 8-fold (pAdj 0.025). While recent work in P. berghei found this ApiAP2 to be involved in the expression of merozoite genes (Nishi et al., Sci Advances 2025(PMID: 40117352)), previous work in P. falciparum showed that it binds heterochromatic telomere regions and certain var upstream regions (Flück et al., PlosPath 2010 (PMID: 20195509), now cited in the manuscript). The other notable change was an upregulation of the non-coding RNA ruf6 which had been linked with impaired mono-allelic var expression (Guizetti et al., NAR 2016 (PMID: 27466391), now also cited in the manuscript). While it would go beyond this manuscript to follow this up, it is conceivable that alterations in chromosome end biology due to sip2 downregulation or upregulation of ruf6 are causes of the observed phenotype

      We now added a paragraph on the more comprehensive analysis of the RNA Seq data of the MEED vs non-MEED lines at the end of the second results section.

      (2) Could the inability of the PfEMP1-mDHFR fusion to block translocation (Fig 2A) reflect unique features of PfEMP1 trafficking, such as the existence of a soluble, chaperoned trafficking state that is not fully folded? Was a PfEMP1-BPTI fusion ever tested as an alternative to mDHFR?

      This is an interesting suggestion. The PfEMP1-BPTI was never tested. However, a chaperoned trafficking state would likely also affect BPTI. Given that both domains (mDHFR and BPTI) in principle do the same when folded and would block when the construct is in the PV, it is not so likely that using a different blocking domain would make a difference. Therefore, the scenario where BPTI would block when mDHFR does not, is not that probable. The opposite would be possible (mDHFR blocking while BPTI does not, because only the latter depends on the redox state). However, this would only happen if the block  occurred before the construct reaches the PV.

      At present, we believe the lacking block to be due to the organization of the domains in the construct. In the PfEMP1-mDHFR construct in this manuscript the position of the blocking domain is further away from the TMD compared to all other previously tested mDHFR fusions. Increased distance to the TMD has previously been found to be a factor impairing the blocking function of mDHFR (Mesen-Ramirez et al., PlosPath 2016 (PMID: 27168322)). Hence, our suspicion that this is the reason for the lacking block with the PfEMP1-mDHFR rather than the type of blocking domain. However, the latter option can’t be fully excluded and we might test BPTI in future work.

      (3) The late promoter SBP1-mDHFR is 2A fused with the KAHRP reporter. Since 2A skipping efficiency varies between fusion contexts and significant amounts of unskipped protein can be present, it would be helpful to include a WB to determine the efficiency of skipping and provide confidence that the co-blocked KAHRP in the +WR condition (Fig 2D) is not actually fused to the C-terminus of SBP1-mDHFR-GFP.

      Fortunately, this T2A fusion (crt_SBP1-mDHFR-GFP-2A-KAHRP-mScarlet<sup>epi</sup>) was used before in work that included a Western blot showing its efficient skipping (S3 A Fig in MesenRamirez et al., PlosPath 2016). In agreement with these Western blot result, fluorescence microscopy showed very limited overlap of SBP1-mDHFR-GFP and KAHRP-mCherry in absence of WR (Fig. 3B in Mesen-Ramirez et al., PlosPath 2016 and Fig. 2 in this manuscript) which would not be the case if these two constructs were fused together. Please note that KAHRP is known to transiently localize to the Maurer’s clefts before reaching the knobs (Wickham et al., EMBOJ 2001, PMID: 11598007), and therefore occasional overlap with SBP1 at the Maurer’s clefts is expected. However, we would expect much more overlap if a substantial proportion of the construct population would not be skipped and therefore the co-blocked KAHRP-mCherry in the +WR sample is unlikely to be due to inefficient skipping and attachment to SBP1-mDHFR-GFP.

      (4) Does comparison of RNAseq from the various 3D7 and IT4 lines in the study provide any insight into PTP3 expression levels between strains with different binding capacities? Was the expression level of ptp3a/b in the IT4var19 panned line similar to the expression in the parent or other activated IT4 lines? Could the expanded ptp3 gene number in IT4 indicate that specialized trafficking machinery exists for some PfEMP1 proteins (ie, IT4var19 requires the divergent PTP3 paralog for efficient trafficking)?

      PTP3 in the different IT4 lines that bind:

      In those parasite lines that did bind, the intrinsic variation in the binding assays, the different binding properties of different PfEMP1 variants and the variation in RNA Seq experiments to compare different parasite lines precludes a correlation of binding level vs ptp3 expression. For instance, if a PfEMP1 variant has lower binding capacity, ptp3 may still be higher but binding would be lower than if comparing to a parasite line with a better binding PfEMP1 variant. Studying the effect of PTP3 levels on binding could probably be done by overexpressing PTP3 in the same PfEMP1 SLI expressor line and assessing how this affects binding, but this would go beyond this manuscript.

      PTP3 in panned vs unpanned Var19:

      We did some comparisons between IT4 parent, and the IT4-Var19 panned and unpanned

      (see Author response table 1). This did not reveal any clear associations. While the parent had somewhat lower ptp3 transcript levels, they were still clearly higher than in the unpanned Var19 line and other lines had also ptp3 levels comparable to the panned IT4-Var19 (see Author response table 2) 

      PTP3 in the TGDs and possible reason for binding phenotype:

      A key point is whether PTP3 could have influenced the lack of binding in the TGD lines (see also weakness section and point 1 of public review of reviewer 3: ptp3 may be an indirect cause resulting in lacking binding in TGD parasites). We now did RNA Seq to check for ptp3 expression in the relevant TGD lines although we did not do a systematic quantitative comparison (which would require 3 replicates of RNASeq), but we reasoned that loss of expression would also be evident in one replicate. There was no indication that the TGD lines had lost PTP3 expression (see Author response table 2) and this is unlikely to explain the binding loss in a similar fashion to the Var19 parasites. Generally, the IT4 lines showed expression of both ptp3 genes and only in the Var19 parasites before panning were the transcript levels considerably lower:

      Author response table 1.

      Parent vs IT4-Var19 panned and unpanned

      Author response table 2.

      TGD lines with binding phenotype vs parent

      The absence of an influence of PTP3 on the binding phenotype in the cell lines in this manuscript (besides Var19) is further supported by its role in PfEMP1 surface display. Previous work has shown that KO of ptp3 leads to a loss of VAR2CSA surface display (Maier et al., Cell 2008). The unpanned Var19 parasite also lacked PfEMP1 surface display and panning and the resulting appearance of the binding phenotype was accompanied by surface display of PfEMP1. As both, the EMPIC3 and TryThra-TGD lines had still at least some PfEMP1 on the surface, this also (in addition to the RNA Seq above) speaks against PTP3 being the cause of the binding phenotype. The same applies to 3D7 which despite the poor binding displays PfEMP1 on the host cell surface (Figure 1D). This indicating that also the binding phenotype in 3D7 is not due to PTP3 expression loss, as this would have abolished PfEMP1 surface display. 

      The idea about PTP3 paralogs for specific PfEMP1s is intriguing. In the future it might be interesting to test the frequency of parasites with two PTP3 paralogs in endemic settings and correlate it with the PfEMP1 repertoire, variant expression and potentially disease severity. 

      (5) The IT4var01 line shows substantially lower binding in Figure 5F compared with the data shown in Figure 4E and 6F. Does this reflect changes in the binding capacity of the line over time or is this variability inherent to the assay?

      There is some inherent variability in these assays. While we did not systematically assess this, we had no indication that this was due to the parasite line changing. The Var01 line was cultured for months and was frozen down and thawed more than once without a clear gradual trend for more or less binding. While we can’t exclude some variation from the parasite side, we suspect it is more a factor of the expression of the receptor on the CHO cells the iRBCs bind to. 

      Specifically, the assays in Fig. 6F and 4E mentioned by the reviewer both had an average binding to CD36 of around 1000 iE/mm2, only the experiments in Fig. 5F are different (~ 500 iE/mm2) but these were done with a different batch of CHO cells at a different time to the experiments in Fig. 6F and 4E. 

      (6) In Figure S7A, TryThrA and EMPIC3 show distinct localization as circles around the PfEMP1 signal while PeMP2 appears to co-localize with PfEMP1 or as immediately adjacent spots (strong colocalization is less apparent than SBP1, and the various PfEMP1 IFAs throughout the study). Does this indicate that TryThrA and EMPIC3 are peripheral MC proteins? Does this have any implications for their function in PfEMP1 binding? Some discussion would help as these differences are not mentioned in the text. For the EMPIC3 TGD IFAs, localization of SBP1 and PfEMP1 is noted to be normal but REX1 is not mentioned (although this also appears normal).

      We apologise for the lacking description of the candidate localisations and cursory description of the Maurer’s clefts phenotypes (next point). Our original intent was to not distract too much from the main flow of the manuscript as almost every part of the manuscript could be followed up with more details. However, we fully agree that this is unsatisfactory and now provided more description (this point) and more data (next point).

      Localisation of TryThrA and EMPIC3 compared to PfEMP1 at the Maurer’s clefts: the circular pattern is reminiscent of the results with Maurer’s clefts proteins reported by McMillan et al using 3D-SIM in 3D7 parasites (McMillan et al., Cell Microbiology 2014 (PMID: 23421990)). In that work SBP1 and MAHRP1 (both integral TMD proteins) were found in foci but REX1 (no TMD) in circular structures around these foci similar to what we observed here for TryThrA and EMPIC3 which both also lack a TMD. The SIM data in McMillan et al indicated that also PfEMP1 is “more peripheral”, although it did only partially overlap with REX1. The conclusion from that work was that there are sub-compartments at the Maurer’s clefts. In our IFAs (Fig. S7A) PfEMP1 is also only partially overlapping with the TryThrA and EMPIC3 circles, potentially indicating similar subcompartments to those observed by 3D-SIM. We agree with the reviewer that this might be indicative of peripheral MC proteins, fitting with a lack of TMD in these candidates, but we did not further speculate on this in the manuscript.

      We now added enlargements of the ring-like structures to better illustrate this observation in Fig. S7A. In addition, we now specifically mention the localization data and the ring like signal with TryThrA and EMPIC3 in the results and state that this may be similar to the observations by McMillan et al., Cell Microbiology 2014.

      We also thank the reviewer for pointing out that we had forgotten to mention REX1 in the EMPIC3-TGD, this was amended.  

      (7) The atypical localization in TryThrA TGD line claimed for PfEMP1 and SBP1 in Fig S7B is not obvious. While most REX1 is clustered into a few spots in the IFA staining for SBP1 and REX1, SBP1 is only partially located in these spots and appears normal in the above IFA staining for SBP1 and HA. The atypical localization of PfEMP1-HA is also not obvious to me. The authors should clarify what is meant by "atypical" localization and provide support with quantification given the difference between the two SBP1 images shown.

      We apologise for the inadequate description of these IFA phenotypes. The abnormal signal for SBP1, REX1 and PfEMP1 in the TryThrA-TGD included two phenotypes found with all 3 proteins: 

      (1) a dispersed signal for these proteins in the host cell in addition to foci (the control and the other TGD parasites have only dots in the host cell with no or very little detectable dispersed signal). 

      (2) foci of disproportionally high intensity and size, that we assumed might be aggregation or enlargement of the Maurer’s clefts or of the detected proteins.

      The reason for the difference between the REX1 (aggregation) phenotype and the PfEMP1 and SBP1 (dispersed signal, more smaller foci) phenotypes in the images in Fig. S7B is that both phenotypes were seen with all 3 proteins but we chose a REX1 stained cell to illustrate the aggregation phenotype (the SBP1 signal in the same cell is similar to the REX1 signal, illustrating that this phenotype is not REX1 specific; please note that this cell also has a dispersed pool of REX1 and SBP1). 

      Based on the IFAs 66% (n = 106 cells) of the cells in the TryThrA-TGD parasites had one or both of the observed phenotypes. We did not include this into the previous version of the manuscript because a description would have required detouring from the main focus of this results section. In addition, IFAs have some limitations for accurate quantifications, particularly for soluble pools (depending on fixing efficiency and agent, more or less of a soluble pool in the host cell can leak out). 

      To answer the request to better explain and quantify the phenotype and given the limitations of IFA, we now transfected the TryThrA-TGD parasites with a plasmid mediating episomal expression of SBP1-mCherry, permitting live cell imaging and a better classification of the Maurer’s clefts phenotype. Due to the two SLI modifications in these parasites (using up 4 resistance markers) we had to use a new selection marker (mutated lactate transporter PfFNT, providing resistance to BH267.meta (Walloch et al., J. Med. Chem. 2020 (PMID: 32816478))) to transfect these parasites with an additional plasmid. 

      These results are now provided as Fig. S8 and detailed in the last results section. The new data shows that the majority of the TryThrA-TGD parasites contain a dispersed pool of SBP1 in the host cell. About a third of the parasites also showed disproportionally strong SBP1 foci that may be aggregates of the Maurer’s clefts. We also transfected the EMPIC3-TGD parasites with the FNT plasmid mediating episomal SBP1-mCherry expression and observed only few cells with a cytoplasmic pool or aggregates (Fig. S8). Overall these findings agree with the previous IFA results. As the IFA suggests similar results also for REX1 and PfEMP1, this defect is likely not SBP1 specific but more general (Maurer’s clefts morphology; association or transport of multiple proteins to the Maurer’s clefts). This gives a likely explanation for the cytoadherence phenotype in the TryThrA-TGD parasites. The reason for the EMPIC3-TGD phenotype remains to be determined as we did not detect obvious changes of the Maurer’s clefts morphology or in the transport of proteins to these structures in these experiments. 

      Minor comments

      (1) Italicized numbers in parenthesis are present in several places in the manuscript but it is not clear what these refer to (perhaps differently formatted citations from a previous version of the manuscript). Figure 1

      legend: (121); Figure S3 legend: (110), (111); Figure S6 legend: (66); etc.

      We thank the reviewer for pointing out this issue with the references, this was amended.

      (2) Figure 5A and legend: "BSD-R: BSD-resistance gene". Blasticidin-S (BS) is the drug while Blasticidin-S deaminase (BSD) is the resistance gene.

      We thank the reviewer for pointing this out, the legend and figure were changed.

      (3) Figure 5E legend: µ-SBP1-N should be α-SBP1-N.

      This was amended.

      (4) Figure S5 legend: "(Full data in Table S1)" should be Table S3.

      This was amended.

      (5) Figure S1G: The pie chart shows PF3D7_0425700 accounts for 43% of rif expression in 3D7var0425800 but the text indicates 62%.

      We apologize for this mistake, the text was corrected. We also improved the citations to Fig. S1G and H in this section.

      (6) "most PfEMP1-trafficking proteins show a similar early expression..." The authors might consider including a table of proteins known to be required for EMP1 trafficking and a graph showing their expression timing. Are any with later expressions known?

      Most exported proteins are expressed early, which is nicely shown in Marti et al 2004 (cited for the statement) in a graph of the expression timing of all PEXEL proteins (Fig. 4B in that paper). PNEPs also have a similar profile (Grüring et al 2011, also cited for that statement), further illustrated by using early expression as a criterion to find more PNEPs (Heiber et al., 2013 (PMID: 23950716)). Together this includes most if not all of the known PfEMP1 trafficking proteins. The originally co-submitted paper (Blancke-Soares & Stäcker et al., eLife preprint doi.org/10.7554/eLife.103633.1) analysed several later expressed exported proteins

      (Pf332, MSRP6) but their disruption, while influencing Maurer’s clefs morphology and anchoring, did not influence PfEMP1 transport. However, there are some conflicting results for Pf332 (referenced in Blancke-Soares & Stäcker et al). This illustrates that it may not be so easy to decide which proteins are bona fide PfEMP1 trafficking proteins. We therefore did not add a table and hope it is acceptable for the reader to rely on the provided 3 references to back this statement.

      (7)  Figure S1J: The predominate var in the IT4 WT parent is var66 (which appears to be syntenic with Pf3D7_0809100, the predominate var in the 3D7 WT parent). Is there something about this locus or parasite culture conditions that selects for these vars in culture? Is this observed in other labs as well?

      This is a very interesting point (although we are not certain these vars are indeed syntenic, they are on different chromosomes). As far as we know at least Pf3D7_0809100 is commonly a dominant var transcribed in other labs and was found expressed also in sporozoites (Zanghì et al. Cell Rep. 2018). However, it is unclear how uniform this really is. For IT4 we do not know in full but have also here commonly observed centromeric var genes to be dominating transcripts in unselected parasite cultures. It is possible that transcription drifts to centromeric var genes in cultured parasites. However, given the anecdotal evidence, it is unknown to which extent this is related to an inherent switching and regulation regiment or a consequence of faulty regulation following prolonged culturing.

      (8) Figure 4B, C: Presumably the asterisks on the DNA gels indicate non-specific bands but this is not described in the legend. Why are non-specific bands not consistent between parent and integrated lanes?

      We apologize for not mentioning this in the legend, this was amended.

      It is not clear why the non-specific bands differ between the lines but in part this might be due to different concentrations and quality of DNA preps. A PCR can also behave differently depending on whether the correct primer target is present or not. If present, the PCR will run efficiently and other spurious products will be outcompeted, but in absence of the correct target, they might become detectable.  

      Overall, we do not think the non-specific bands are indications of anything untoward with the lines, as for instance in Fig. 4B the high band in the 5’ integration in the IT4 line (that does not occur anywhere else) can’t be due to a genomic change as this is the parental line and does not contain the plasmid for integration. In the same gel, the ori locus band of incorrect size (likely due to crossreaction of the primers to another var gene which due to the high similarity of the ATS region is not always fully avoidable), is present in both, the parent IT4 and the integrant line which therefore also is not of concern. In C there are a couple of bands of incorrect size in the Integration line. One of these is very faint and both are too large and again therefore are likely other vars that are inefficiently picked up by these primers. The reason they are not seen in the parent line is that there the correct primer binding site is present, which then efficiently produces a product that outcompetes the product derived from non-optimal matching primer products and hence appear in the Int line where the correct match is not there anymore. For these reasons we believe these bands are not of any concern.  

      (9) Figure 4C: Is there a reason KAHRP was used as a co-marker for the IFA detecting IT4var19 expression instead of SBP1 which was used throughout the rest of the study?

      This is a coincidence as this line was tested when other lines were tested for KAHRP. As there were foci in the host cell we were satisfied that the HA-tagged PfEMP1 is produced and the localization deemed plausible. 

      (10) Figure 6: Streptavidin labeling for the IT4var01-BirA position 3 line is substantially less than the other two lines in both IFA and WB. Does the position 3 fusion reduce PfEMP1 protein levels or is this a result of the context or surface display of the fusion? Interestingly, the position 3 trypsin cleavage product appears consistently more robust compared with the other two configurations. Does this indicate that positioning BirA upstream of the TM increases RBC membrane insertion and/or makes the surface localized protein more accessible to trypsin?

      It is possible that RBC membrane insertion or trypsin accessibility is increased for the position 3 construct. But there could also be other explanations:

      The reason for the more robustly detected protected fragment for the position 3 construct in the WB might also be its smaller size (in contrast to the other two versions, it does not contain BirA*) which might permit more efficient transfer to the WB membrane. In that case the more robust band might not (only) be due to better membrane insertion or better trypsin accessibility.

      The lower biotinylation signal with the position 3 construct might also be explained by the farther distance of BirA* to the ATS (compared to position 1 and 2), the region where interactors are expected to bind. The position 1 and 2 constructs may therefore generally be more efficient (as closer) to biotinylate ATS proximal proteins. Further, in the final destination (PfEMP1 inserted into the RBC membrane) BirA* would be on the other side of the membrane in the position 3 construct while in the position 1 and 2 constructs BirA* would be on the side of the membrane where the ATS anchors PfEMP1 in the knob structure. In that case, labelling with position 3 would come from interactions/proximities during transport or at the Maurer’s clefts (if there indeed PfEMP1 is not membrane embedded) and might therefore be less.

      Hence, while alterations in trypsin accessibility and RBC membrane insertion are possible explanations, other explanations exist. At present, we do not know which of these explanations apply and therefore did not mention any of them in the manuscript. 

      Reviewer #3 (Recommendations for the authors):

      (1) In the abstract and on page 8, the authors mention that they generate cell lines binding to "all major endothelial receptors" and "all known major receptors". This is a pretty allencompassing statement that might not be fully accepted by others who have reported binding to other receptors not considered in this paper (e.g. VCAM, TSP, hyaluronic acid, etc). It would be better to change this statement to something like "the most common endothelial receptors" or "the dominant endothelial receptors", or something similar.

      We agree with the reviewer that these statements are too all-encompassing and changed them to “the most common endothelial receptors” (introduction) and “the most common receptors” (results).

      (2) The authors targeted two rif genes for activation and in each case the gene became the most highly expressed member of the family. However, unlike var genes, there were other rif genes also expressed in these lines and the activated copy did not always make up the majority of rif mRNAs. The authors might wish to highlight that this is inconsistent with mutually exclusive expression of this gene family, something that has been discussed in the past but not definitively shown.

      We thank the reviewer for highlighting this, we now added the following statement to this section: “While SLI-activation of rif genes also led to the dominant expression of the targeted rif gene, other rif genes still took up a substantial proportion of all detected rif transcripts, speaking against a mutually exclusive expression in the manner seen with var genes.”

      (3) In Figure 6, H-J, the authors display volcano plots showing proteins that are thought to interact with PfEMP1. These are labeled with names from the literature, however, several are named simply "1, 2, 3, 4, 5, or 6". What do these numbers stand for?

      We apologize for not clarifying this and thank the reviewer for pointing this out. There is a legend for the numbered proteins in what is now Table S4 (previously Table S3). We now amended the legend of Figure 6 to explain the numbers and pointing the reader to Table S4 for the accessions.

    1. A reader may not have experienced similar life circumstances as yours, but that doesn’t mean the reader won’t be able to identify emotionally with what you and your characters go through. Human strife is human strife.

      very important to keep in mind because sometimes we think that our personal experiences aren't relatable.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      Colorectal cancer (CRC) is the third most common cancer globally and the second leading cause of cancer-related deaths. Colonoscopy and fecal immunohistochemical testing are among the early diagnostic tools that have significantly enhanced patient survival rates in CRC. Methylation dysregulation has been identified in the earliest stages of CRC, offering a promising avenue for screening, prediction, and diagnosis. The manuscript entitled "Early Diagnosis and Prognostic Prediction of Colorectal Cancer through Plasma Methylation Regions" by Zhu et al. presents that a panel of genes with methylation pattern derived from cfDNA (27 DMRs), serving as a noninvasive detection method for CRC early diagnosis and prognosis.

      Strengths:

      The authors provided evidence that the 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV.

      Weaknesses:

      The major concerns are the design of DMR screening, the relatively low sensitivity of this DMR pattern in detecting early-stage CRC, the limited size of the cohorts, and the lack of comparison with the traditional diagnosis test.

      We sincerely thank the reviewer for their thorough evaluation and constructive feedback on our manuscript. We are encouraged that the reviewer found our 27-DMR panel promising for predicting distant metastasis and for its performance in late-stage CRC. We have carefully considered the weaknesses pointed out and have made revisions to address these concerns, which we believe have significantly strengthened our paper.

      We agree with the reviewer that achieving high sensitivity for early-stage disease is the ultimate goal for any noninvasive screening test. Detecting the minute quantities of cfDNA shed from early-stage tumors is a well-recognized challenge in the field. Although the sensitivity of our current panel for early-stage CRC is modest, its core strengths, lie in its capability to also detect advanced adenomas and its excellent performance in assessing CRC metastasis and prognosis. Furthermore, we have now added a direct comparative analysis of our 27-DMR panel against the most widely used clinical serum biomarker for CRC, carcinoembryonic antigen (CEA), using samples from the same patient cohorts. Our results demonstrate that 27-DMR methylation score significantly outperforms CEA in diagnostic accuracy for early-stage CRC (64% vs. 18%) (Table s7). And in the Discussion section, we have also acknowledged our limitations and suggest that future studies are warranted to combine the cfDNA methylation model with commonly used clinical markers, such as CEA and CA19-9, with the aim of improving the sensitivity for early diagnosis.

      We acknowledge the reviewer's concern regarding the cohort size and validation in larger, prospective, multi-center cohorts is essential before this panel can be considered for clinical application. We have explicitly stated this as a limitation of our study in the Discussion section and have highlighted the need for future large-scale validation studies (Page 18, Lines 367-373). We once again thank the reviewer for their insightful comments, which have allowed us to substantially improve our manuscript. We hope that the revised version is now suitable for publication.

      Reviewer #2 (Public review):

      This work presents a 27-region DMR model for early diagnosis and prognostic prediction of colorectal cancer using plasma methylation markers. While this non-invasive diagnostic and prognostic tool could interest a broad readership, several critical issues require attention.

      Major Concerns:

      (1) Inconsistencies and clarity issues in data presentation

      (a) Sample size discrepancies

      The abstract mentions screening 119 CRC tissue samples, while Figure 1 shows 136 tissues. Please clarify if this represents 119 CRC and 17 normal samples.

      We sincerely thank the reviewer for this careful observation and for pointing out the inconsistency. We apologize for the error and the confusion it caused. Regarding Figure 1: The reviewer is correct. The number 136 in the original Figure 1 was an error. This was due to an inadvertent double-counting of the tumor samples that were used in the differential analysis against adjacent normal tissues. The actual number of tissue samples used in this analysis is 89. We have now corrected this value in the revised Figure 1.

      Regarding the Abstract: The 119 CRC tissue samples mentioned in the abstract represents the total number of unique tumor samples analyzed across all stages of our study. This number is composed of two cohorts: the initial 15 pairs of tissues used for preliminary screening, and the subsequent 89 tissue samples used for validation, totaling 119 samples. We have ensured all sample numbers are now consistent throughout the revised manuscript.

      The plasma sample numbers vary across sections: the abstract cites 161 samples, Figure 1 shows 116 samples, and the Supplementary Methods mentions 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC).

      We sincerely thank the reviewer for their meticulous review and for identifying these inconsistencies in the plasma sample numbers. We apologize for this oversight and the lack of clarity.

      Figure 1 & Supplementary Methods (77 samples): The number 116 in the original Figure 1 was a clerical error. The correct number is 77, which is the cohort used for our differential methylation analysis. This number is now consistent with the Supplementary Methods. This cohort is composed of 13 Normal, 15 NAA, 12 AA, and 37 CRC samples. The figure has been revised accordingly.

      Abstract (161 samples): The total of 161 plasma samples mentioned in the abstract is the sum of two distinct sample sets used for different stages of our analysis: The 77 samples (13 Normal, 15 NAA, 12 AA, 37 CRC) used for the differential analysis.  An additional 84 samples (33 Normal, 51 CRC) which served as the training set for the LASSO regression model. We have now clarified these distinctions in the text and ensured consistency across the abstract, figures, and methods sections.

      (b) Methodological inconsistencies

      The Supplementary Material reports 477 hypermethylated sites from TCGA data analysis (Δβ>0.20, FDR<0.05), but Figure 1 indicates 499 sites.

      The manuscript states that analyzing TCGA data across six cancer types identified 499 CRC-specific methylation sites, yet Figure 1 shows 477. Please also explain the rationale for selecting these specific cancer types from TCGA.

      We sincerely thank the reviewer for their sharp observation and for highlighting these inconsistencies. We apologize for this clerical error, which occurred when labeling the figure. The numbers 477 and 499 in Figure 1 were inadvertently swapped and the text in Supplementary Material is correct. We have now corrected this error throughout the manuscript to ensure clarity and consistency. We deeply regret the confusion this has caused.

      Regarding the rationale for selecting the cancer types:

      The selection of colorectal, esophageal, gastric, lung, liver, and breast cancers was based on the following strategic criteria to ensure the stringent identification of CRC-specific markers. Firstly, esophageal, gastric, liver, and colorectal cancers all originate from the gastrointestinal tract and share developmental and functional similarities. Comparing CRC against these closely related cancers allowed us to filter out general GI-tract-related methylation patterns and isolate those that are truly unique to colorectal tissue. Secondly, we included lung and breast cancer as they are two of the most common non-GI malignancies worldwide with distinct tissue origins. This helps ensure our identified markers are not just pan-cancer methylation events but are specific to CRC, even when compared against highly prevalent cancers from different lineages. Finally, these six cancer types have some of the largest and most complete datasets available in the TCGA database, including high-quality methylation data. This provided a robust statistical foundation for a reliable cross-cancer comparison. We hope this explanation clarifies our methodology. Thank you again for your valuable feedback.

      "404 CRC-specific DMRs" mentioned in the main text while "404 MCBs" in Figure 1, the authors need to clarify if these terms are interchangeable or how MCBs are defined.

      We sincerely thank the reviewer for pointing out this important inconsistency in terminology. We apologize for the confusion this has caused and for the error in Figure 1. The two terms are closely related in our study. The final 404 markers are technically DMRs that were identified through an analysis of MCBs. To avoid confusion, we have decided to unify the terminology. The manuscript has now been revised to consistently use "DMRs", which is the most accurate final descriptor. The label in Figure 1 has been corrected accordingly.

      (2) Methodological documentation

      The Results section requires a more detailed description of marker identification procedures and justification of methodological choices.

      Figure 3 panels need reordering for sequential citation.

      We thank the reviewer for this valuable suggestion. We agree that the original Results section lacked sufficient detail regarding the marker identification procedures and the justification for our methodological choices. To address this, we have substantially rewritten the "Methylation markers selection" subsection. This revised section provides a clear, step-by-step narrative of our marker discovery. The revised text now integrates the specific methodological details and statistical criteria. For instance, we now explicitly describe the three-pronged approach for the initial TCGA data mining and the specific criteria (Δβ, FDR, log2FC) for each, and the analysis methodology such as Wilcoxon test and LASSO regression analysis. We believe this detailed narrative now provides the necessary description and justification for our methodological choices directly within the results, significantly improving the clarity and logical flow of our manuscript. This revision can be found on (Page 9-11, Lines 180-195, 202-213). We hope these changes fully address the reviewer's concerns.

      We thank the reviewer for pointing out the citation order of the panels in Figure 3. This was a helpful suggestion for improving the clarity of our manuscript. We have now reordered the panels in Figure 3 to ensure they are cited sequentially within the text. These adjustments have been made in the "Development and validation of the CRC diagnosis model" subsection of the Results (Page 11, lines 224-230). We appreciate the reviewer's attention to detail.

      (3) Quality control and data transparency

      No quality control metrics are presented for the in-house sequencing data (e.g., sequencing quality, alignment rate, BS conversion rate, coverage, PCA plots for each cohort).

      The analysis code should be publicly available through GitHub or Zenodo.

      At a minimum, processed data should be made publicly accessible to ensure reproducibility.

      We sincerely thank the reviewer for their valuable and constructive feedback regarding quality control and data transparency. We fully agree that these elements are crucial for ensuring the robustness and reproducibility of our research. As the reviewer suggested, we have made all processed data and the key quality control metrics for each sample including sequencing quality scores, bisulfite (BS) conversion rates, and sequencing coverage publicly available to ensure the reproducibility of our findings. The analysis was performed using standard algorithms as detailed in the Methods section. While we are unable to host the code in a public repository at this time, all analysis scripts are available from the corresponding author upon reasonable request. The data has been deposited in the National Genomics Data Center (NGDC) and is accessible under the accession number OMIX009128. This information is now clearly stated in the "Data and Code Availability" section of the manuscript. We thank the reviewer again for pushing us to improve our manuscript in this critical aspect.

      Reviewer #3 (Public review):

      Summary:

      This article provides a model for early diagnosis and prognostic prediction of Colorectal Cancer and demonstrates its accuracy and usability. However, there are still some minor issues that need to be revised and paid attention to.

      Strengths:

      A large amount of external datasets were used for verification, thus demonstrating robustness and accuracy. Meanwhile, various influencing factors of multiple samples were taken into account, providing usability.

      Weaknesses:

      There are notable language issues that hinder readability, as well as a lack of some key conclusions provided.

      We are very grateful to the reviewer for their positive assessment of our study and for the constructive feedback provided. We are particularly encouraged that the reviewer recognized the strengths of our work, especially the robustness demonstrated through extensive external validation and the practical usability of our model. Regarding the weaknesses, we have taken the comments very seriously and have thoroughly revised the manuscript. We sincerely apologize for the language issues that hindered readability in our initial submission. To address this, the entire manuscript has undergone a comprehensive round of professional language polishing and editing. We have carefully reviewed and revised the text to improve clarity, flow, and grammatical accuracy. Besides, we agree that the conclusions could be stated more explicitly. To rectify this, we have substantially revised the final paragraph of the Discussion and the Conclusion section (Page 14-18, lines 279-305, 319-334, 346-348, 358-360, 367-379). We now more clearly summarize the main findings of our study, emphasize the clinical significance and potential applications of our model, and provide clear take-home messages. We thank you again for your time and insightful comments, which have been invaluable in improving the quality of our paper. We hope the revised manuscript now meets the standards for publication.

      Reviewer #1 (Recommendations for the authors):

      Detail comments are outlined below:

      (1) In this study, the authors have highlighted methylated cfDNA as a noninvasive approach for CRC early diagnosis. However, the small size of cohorts for plasma screening, particularly the sample number of NAA and AA , may cause bias in the selection of DMRs. This bias may lead to inappropriate DMRs for early diagnosis. Furthermore, the similar issues for the training set with a high percentage of late-stage CRC, no AA or NAA samples were included. This absence may be the key factor in screening changed methylated cfDNA that can predict the early stages of CRC.

      We are very grateful to the reviewer for this insightful methodological critique. We agree that cohort composition and sample size are critical factors in the development of robust biomarkers, and we appreciate the opportunity to clarify our study design and the interpretation of our results.

      We agree with the reviewer that the number of precancerous lesion samples (NAA and AA) in our initial plasma screening cohort was limited. This is a valid point. However, it is important to contextualize the role of this step within our overall multi-stage marker selection funnel. The markers evaluated in this plasma cohort were not discovered from this small sample set alone. They were the result of a rigorous pre-selection process based on large-scale public TCGA data and our own tissue-level sequencing. This robust, tissue-based validation ensured that only the most promising CRC-specific markers were advanced for plasma testing. Therefore, while the plasma cohort was modest in size, its purpose was to confirm the circulatory detectability of markers already known to have a strong tissue-of-origin signal, thereby mitigating the potential bias from a smaller discovery set.

      Our primary aim was to first build a model that could robustly and accurately identify a definitive cancer-specific methylation signal. By training the model on clear-cut invasive cancer cases versus healthy controls, we could isolate the most powerful and specific markers for established malignancy. Our working hypothesis was that these strong cancer-specific methylation patterns are initiated during the precursor stages and would therefore be detectable, albeit at lower levels, in precancerous lesions.  Unfortunately, the panel could only identify a limited proportion of precancerous lesions (48.4% in the NAA group and 52.2% in the AA group). We fully agree with the reviewer's sentiment that including a larger and more balanced set of precancerous lesions in future training cohorts could potentially optimize a model specifically for adenoma detection. We have now explicitly added this point to our Discussion section, highlighting it as an important direction for future research (Page 18, lines 367-373).

      (2) The sensitivity of 27 DMRs in the external validation set (for NAA, AA and CRC 0-Ⅱare 48.4%. 52.2% and 66.7%, respectively) were much lower compared with previously published studies, like ColonES assay (DOI: 10.1016/j.eclinm.2022.101717) and ColonSecure test (DOI: 10.1186/s12943-023-01866-z). The 27 DMRs from the layered screening process did not show superior performance in a small population of an external validation cohort. Therefore, it is unlikely that this DMR pattern will be applicable to the general population in the future.

      We sincerely thank the reviewer for their insightful comments and for providing a thorough comparison with the highly relevant ColonES and ColonSecure assays. This has given us an important opportunity to clarify the unique contributions and specific clinical applications of our 27-DMR panel.

      We acknowledge the reviewer's point that the sensitivities of our panel for precancerous lesions (NAA: 48.4%, AA: 52.2%), while substantial, are numerically lower than those reported by the excellent ColonES assay (AA: 79.0%). However, it is important to clarify that while the ColonES and ColonSecure tests are outstanding benchmarks designed primarily for early detection and screening, the primary objective and contribution of our study were slightly different. Our model demonstrated an exceptional ability to predict distant metastasis with an AUC of 0.955 and a strong capacity for predicting overall prognosis with an AUC of 0.867. Our goal was to develop a multi-functional, biologically-rooted biomarker panel that not only contributes to early detection but, more importantly, provides crucial information for post-diagnosis patient management, including staging, risk stratification, and prognostication, from a single preoperative sample. We believe this ability to preoperatively identify high-risk patients who may require more aggressive treatment or intensive surveillance is the key contribution of our work. It provides a distinct clinical utility that complements, rather than directly competes with, pure screening assays.

      We agree with the reviewer that our external validation was performed on a limited cohort, and we have acknowledged this as a limitation in our Discussion section. However, the purpose of this validation was to provide a proof-of-concept for the panel's performance across its multiple functions. The promising and exceptionally high-performing results in the prognostic domain strongly warrant further validation in larger, prospective, multi-center cohorts.

      (3) The 27 DMRs pattern worked well in predicting CRC distant metastasis, and the methylation score remarkably increased in stage III-IV. In contrast, the increase of AA and 0-II groups was very mild in the validation cohort. This observation raises concerns regarding the study design, particularly in the context of the layered screening process and sample assigning.

      We sincerely thank the reviewer for this insightful and critical comment. We agree with the reviewer's observation that the methylation score increased more remarkably in late-stage (III-IV) CRC compared to the milder increase in adenoma (AA) and early-stage (0-II) CRC in the validation cohort. However, the observed pattern is biologically plausible and consistent with the nature of colorectal cancer progression. Carcinogenesis is a multi-step process involving the gradual accumulation of genetic and epigenetic alterations. The methylation changes we identified are likely associated with tumor progression and metastasis. Therefore, it is expected that advanced, metastatic cancers (Stage III-IV), which have undergone significant biological changes, would exhibit a much stronger and more robust methylation signal compared to pre-cancerous lesions (adenomas) or early-stage, non-metastatic cancers (Stage 0-II). The "mild" increase in early stages reflects the initial, more subtle epigenetic alterations, while the "remarkable" increase in late stages reflects the extensive changes required for invasion and metastasis. We believe this graduated increase actually strengthens the validity of our methylation signature, as it mirrors the underlying biological progression of the disease. We hope this response and the corresponding revisions address the reviewer's comments.

      (4) The authors did not provide the 27 DMRs prediction efficacy comparison with other noninvasive CRC assays, like a CEA and a FIT test.

      Thank you for this valuable suggestion. We agree that comparing our model with established non-invasive assays is crucial for demonstrating its clinical potential. Following your advice, we have now included a direct comparison of the diagnostic performance between our model and the traditional tumor marker, carcinoembryonic antigen (CEA), using the external validation cohort. The results show that our model has a significantly higher sensitivity for detecting early-stage colorectal cancer and adenomas compared to CEA. This detailed comparison has been added as Table s7 in the supplementary materials, and the corresponding description has been incorporated into the Results section of our manuscript (Page 12, lines 234-236). Regarding the Fecal Immunochemical Test (FIT), we unfortunately could not perform a direct statistical comparison because very few individuals in our cohort had undergone FIT. A comparison based on such a small sample size would lack statistical power and might not yield meaningful conclusions. We have acknowledged this as a limitation of our study in the Discussion section.We believe these additions and clarifications have substantially strengthened our manuscript. Thank you again for your constructive feedback.

      (5) The authors did not explicitly describe how they assigned the plasma samples to the distinct sets, nor did they specify the criteria for the plasma screen set, training set, and validation set. The detailed information for the patient grouping should be listed.

      Responce: Thank you for this essential feedback. We agree that a transparent and detailed description of the sample allocation process is crucial for the manuscript. We apologize for the previous lack of clarity and have now revised the Methods section to address this. Our patient cohorts were assigned to the screening, training, and validation sets based on a chronological splitting strategy. Specifically, samples were allocated based on the date of collection in a consecutive manner. This approach was chosen to minimize selection bias and to provide a more realistic, forward-looking assessment of the model's performance, simulating a prospective validation scenario. The screening set comprised 89 tissue samples and 77 plasma samples collected between June to December 2020. The primary purpose of this set was for the initial discovery and screening of potential methylation markers. The training set and validation set included 165 plasma samples collected from December 2020 to July 2022. The external validation cohort comprised 166 plasma samples collected from from July 2022 to December 2022. The subsection titled "Study design and samples" within the Methods section of the revised manuscript, which now contains all of this detailed information (Page 6, lines 116-133). We believe this detailed explanation now makes our study design clear and transparent. Thank you again for helping us improve our manuscript.

      Reviewer #2 (Recommendations for the authors):

      The manuscript requires significant language editing to improve clarity and readability. We recommend that the authors seek professional editing services for revision.

      Thank you for your constructive comments on the language of our manuscript. We apologize for any lack of clarity in the previous version. To address this, we have performed a thorough revision of the manuscript. The text has been carefully reviewed and edited by a native English-speaking colleague who is an expert in our research field. We have focused on correcting all grammatical errors, improving sentence structure, and refining the phrasing throughout the document to enhance readability. We are confident that these extensive revisions have significantly improved the clarity of the manuscript. We hope you will find the current version much easier to read and understand.

      Reviewer #3 (Recommendations for the authors):

      (1) However, I think the abstract part of the article is too detailed and should be more concise and shortened. It is not necessary to show detailed values but to summarize the results.

      Thank you for this valuable suggestion. We agree that the previous version of the abstract was overly detailed and that a more concise summary would be more effective for the reader. Following your advice, we have substantially revised the abstract. We have removed the specific numerical values (such as detailed statistics) and have instead focused on summarizing the key findings and their broader implications (Page 3, lines 54-60, 64-66, 70-72). The revised abstract is now shorter and provides a clearer, high-level overview of our study's background, methods, main results, and conclusions. We believe these changes have significantly improved its readability and impact. We hope you will find the current version more appropriate.

      (2) Figure 4, the color in the legend and plot are not the same, and should be revised.

      Thank you for your careful attention to detail and for pointing out the color inconsistency in Figure 4. We apologize for this oversight. We have now corrected the figure as you suggested, ensuring that the colors in the legend perfectly match those in the plot. The revised Figure 4 has been updated in the manuscript. We appreciate your help in improving the quality of our figures.

      (3) Please pay attention to the article format, such as the consistency of fonts and punctuation marks. (For example, Lines 75 and Line 230).

      Thank you for your meticulous review and for pointing out the inconsistencies in our manuscript's formatting. We sincerely apologize for these oversights and any inconvenience they may have caused. Following your feedback, we have carefully corrected the specific issues you highlighted. Furthermore, we have conducted a thorough proofread of the entire manuscript to ensure consistency in all fonts, punctuation marks, and overall adherence to the journal's formatting guidelines. We appreciate your help in improving the presentation and professionalism of our paper.

    1. Author response:

      (1) General Statements

      We thank the Reviewers for a fair review of our work and helpful suggestions. We have significantly revised the manuscript in response to these suggestions. We provide a point-by-point response to the Reviewers below but wanted to highlight in our response a recurring concern related to the strong cell cycle arrest observed upon the acute FAM53C knock-down being different than the limited phenotypes in other contexts, including the knockout mice and DepMap data.

      First, we now show that we can recapitulate the strong G1 arrest resulting from the FAM53C knock-down using two independent siRNAs in RPE-1 cells, supporting the specificity of the effects.

      Second, the G1 arrest that results from the FAM53C knock-down is also observed in cells with inactive p53, suggesting it is not due to a non-specific stress response due to “toxic” siRNAs. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype.

      Third, we have performed experiments in other human cells, including cancer cell lines. As would be expected for cancer cells, the G1 arrest is less pronounced but is still significant, indicating that the G1 arrest is not unique to RPE-1 cells.

      Fourth, it is not unexpected that compensatory mechanisms would be activated upon loss of FAM53C during development or in cancer – which may explain the lack of phenotypes in vivo or upon long-term knockout. This has been true for many cell cycle regulators, either because of compensation by other family members that have overlapping functions, or by a larger scale rewiring of signaling pathways. 

      (2) Point-by-point description of the revisions

      Reviewer #1 (Evidence, reproducibility and clarity): 

      Summary: 

      Taylar Hammond and colleagues identified new regulators of the G1/S transition of the cell cycle.

      They did so by screening public available data from the Cancer Dependency Map, and identified FAM53C as a positive regulator of the G1/S transition. Using biochemical assays they then show that FAM53 interacts with the DYRK1A kinase to inhibit its function. DYRK1A in its is known to induce degradation of cyclin D, leading the authors to propose a model in which DYRK1Adependent cyclin D degradation is inhibited by FAM53C to permit S-phase entry. Finally the authors assess the effect of FAM53C deletion in a cortical organoid model, and in Fam53c knockout mice. Whereas proliferation of the organoids is indeed inhibited, mice show virtually no phenotype.  

      Major comments: 

      The authors show convincing evidence that FAM53C loss can reduce S-phase entry in cell cultures, and that it can bind to DYRK1A. However, FAM53 has multiple other binding partners and I am not entirely convinced that negative regulation of DYRK1A is the predominant mechanism to explain its effects on S-phase entry. Some of the claims that are made based on the biochemical assays, and on the physiological effects of FAM53C are overstated. In addition, some choices made methodology and data representation need further attention. 

      (1) The authors do note that P21 levels increase upon FAM53C. They show convincing evidence that this is not a P53-dependent response. But the claim that " p21 upregulation alone cannot explain the G1 arrest in FAM53C-deficient cells (line 138-139) is misleading. A p53-independent p21 response could still be highly relevant. The authors could test if FAM53C knockdown inhibits proliferation after p21 knockdown or p21 deletion in RPE1 cells. 

      The Reviewer raises a great point. Our initial statement needed to be clarified and also need more experimental support. We have performed experiments where we knocked down FAM53C and p21 individually, as well as in combination, in RPE-1 cells. These experiment show that p21 knock-down is not sufficient to negate the cell cycle arrest resulting from the FAM53C knockdown in RPE-1 cells (Figure 4B,C and Figure S4C,D).

      We now extended these experiments to conditions where we inhibited DYRK1A, and we also compared these data to experiments in p53-null RPE-1 cells. Altogether, these experiments point to activation of p53 downstream of DYRK1A activation upon FAM53C knock-down, and indicate that p21 is not the only critical p53 target in the cell cycle arrest observed in FAM53C knock-down cells (Figure 4 and Figure S4).

      (2) The authors do not convincingly show that FAM53C acts as a DYRK1A inhibitor in cells. Figures 4B+C and S4B+C show extremely faint P-CycD1 bands, and tiny differences in ratios. The P values are hovering around the 0.05, so n=3 is clearly underpowered here. Total CycD1 levels also correlate with FAM53C levels, which seems to affect the ratios more than the tiny pCycD1 bands. Why is there still a pCycD1 band visible in 4B in the GFP + BTZ + DYRK1Ai condition? And if I look at the data points I honestly don't understand how the authors can conclude from S4C that knockdown of siFAM53C increases (DYRK1A dependent) increases in pCycD1 (relative to total CycD1). In figure 5C, no blot scans are even shown, and again the differences look tiny. So the authors should either find a way to make these assays more robust, or alter their claims appropriately. 

      We appreciate these comments from the Reviewer and have significantly revised the manuscript to address them.

      The analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knock-down, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We removed previous panel 4B from the revised manuscript. For panels 4E and S4B (now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      The representative Western blot images for 5C-D (now 5F-G) in the original submission are shown in Figure 5E, we apologize if this was not clear. The differences are small, which we acknowledge in the revised manuscript. Note that several factors can affect Cyclin D levels in cells, including the growth rate and the stage of the cell cycle. Our FACS analysis shows that normal organoids have ~63% of cells in G1 and ~13% in S phase; the overall lower proportion of S-phase cells in organoids may make the immunoblot difference appear smaller, with fewer cycling cells resulting in decreased Cyclin D phosphorylation.

      Nevertheless, the Reviewer brings up a good point and comments from this Reviewer and the others made us re-think how to best interpret our results. As discussed above, we re-read carefully the Meyer paper and think that FAM53C’s role and DYRK1A activity in cells may be understood when considering levels of both CycD and p21 at the same time in a continuum. While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is likely that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      (3) The experiments to test if DYRK1A inhibition could rescue the G1 arrest observed upon FAM53C knockdown are not entirely convincing either. It would be much more convincing if they also perform cell counting experiments as they have done in Figures 1F and 1G, to complement the flow cytometry assays. I suggest that the authors do these cell counting experiments in RPE1 +/- P53 cells as well as HCT116 cells. In addition, did the authors test if P21 is induced by DYRK1Ai in HCT116 cells? 

      We repeated the experiments with the DYRK1A inhibitor and counted the cells. In p53-null RPE1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide. Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells.

      (4) The data in Figure 5C and 5D are identical, although they are supposed to represent either pCycD1 ratios or p21 levels. This is a problem because at least one of the two cannot be true. Please provide the proper data and show (representative) images of both data types.

      We apologize for these duplicated panels in the original submission. We now replaced the wrong panel with the correct data (Fig. 5F,G). 

      (5) Line 246: "Fam53c knockout mice display developmental and behavioral defects." I don't agree with this claim. The mutant mice are born at almost the expected Mendelian ratios, the body weight development is not consistently altered. But more importantly, no differences in adult survival or microscopic pathology were seen. The authors put strong emphasis on the IMPC behavioral analysis, but they should be more cautious. The IMPC mouse cohorts are tested for many other phenotypes related to behavior and neurological symptoms and apparently none of these other traits were changed in the IMPC Famc53c-/- cohort. Thus, the decreased exploration in a new environment could very well be a chance finding. The authors need to take away claims about developmental and behavioral defects from the abstract, results and discussion sections; the data are just too weak to justify this. 

      We agree with the Reviewer that, although we observed significant p-values, this original statement may not be appropriate in the biological sense. We made sure in the revised manuscript to carefully present these data.

      Minor comments: 

      (6) Can the authors provide a rationale for each of the proteins they chose to generate the list of the 38 proteins in the DepMap analysis? I looked at the list and it seems to me that they do not all have described functions in the G1/S transition. The analysis may thus be biased. 

      To address this point, we updated Table S1 (2nd tab) to provide a better rationale for the 38 factors chosen. Our focus was on the canonical RB pathway and we included RB binding proteins whose function had suggested they may also be playing a role in the G1/S transition. We do agree that there is some bias in this selection (e.g., there are more RB binding factors described) but we hope the Reviewer will agree with us that this list and the subsequent analysis identified expected factors, including FAM53C. Future studies using this approach and others will certainly identify new regulators of cell cycle progression.

      (7) Figure 1B is confusing to me. Are these just some (arbitrarily) chosen examples? Consider leaving this heatmap out altogether, of explain in more detail. 

      We agree with the Reviewer that this panel was not necessarily useful and possibly in the wrong place, and we removed it from the manuscript. We replaced it with a cartoon of top hits in the screen.

      (8) The y-axes in Figures 2C, 2D, 2E, and 4D are misleading because they do not start at 0. Please let the axis start at 0, or make axis breaks. 

      We re-graphed these panels.

      (9) Line 229: " Consequences ... brain development." This subheader is misleading, because the in vitro cortical organoid system is a rather simplistic model for brain development, and far away from physiological brain development. Please alter the header. 

      We changed the header to “Consequences of FAM53C inactivation in human cortical organoids in culture”.

      (10) Figure S5F: the gating strategy is not clear to me. In particular, how do the authors know the difference between subG1 and G1 DAPI signals? Do they interpret the subG1 as apoptotic cells? If yes, why are there so many? Are the culturing or harvesting conditions of these organoids suboptimal? Perhaps the authors could consider doing IF stainings on EdU or BrdU on paraffin sections of organoids to obtain cleaner data?

      Thank you for your feedback. The subG1 population in the original Figure S5F represents cells that died during the dissociation step of the organoids for FACS analysis. To address this point, we performed live & dead staining to exclude dead cells and provide clearer data. We refined gating strategy for better clarity in the new S5F panel.

      (11) Figure S6A; the labeling seems incorrect. I would think that red is heterozygous here, and grey mutant. 

      We fixed this mistake, thank you. 

      Reviewer #1 (Significance): 

      The finding that the poorly studied gene FAM53C controls the G1/S transition in cell lines is novel and interesting for the cell cycle field. However, the lack of phenotypes in Famc53-/- mice makes this finding less interesting for a broader audience. Furthermore, the mechanisms are incompletely dissected. The importance of a p53-indepent induction of p21 is not ruled out. And while the direct inhibitory interaction between FAM53C and DYRK1A is convincing (and also reported by others; PMID: 37802655), the authors do not (yet) convincingly show that DYRK1A inhibition can rescue a cell proliferation defect in FAM53C-deficient cells. 

      Altogether, this study can be of interest to basic researchers in the cell cycle field. 

      I am a cell biologist studying cell cycle fate decisions, and adaptation of cancer cells & stem cells to (drug-induced) stress. My technical expertise aligns well with the work presented throughout this paper, although I am not familiar with biolayer interferometry. 

      Reviewer #2 (Evidence, reproducibility and clarity): 

      Summary 

      In this study Hammond et al. investigated the role of Dual-specificity Tyrosine Phosphorylation regulated Kinase 1A (DYRK1) in G1/S transition. By exploiting Dependency Map portal, they identified a previously unexplored protein FAM53C as potential regulator of G1/S transition. Using RNAi, they confirmed that depletion of FAM53C suppressed proliferation of human RPE1 cells and that this phenotype was dependent on the presence protein RB. In addition, they noted increased level of CDKN1A transcript and p21 protein that could explain G1 arrest of FAM53Cdepleted cells but surprisingly, they did not observe activation of other p53 target genes. Proteomic analysis identified DYRK1 as one of the main interactors of FAM53C and the interaction was confirmed in vitro. Further, they showed that purified FAM53C blocked the ability of DYRK1 to phosphorylate cyclin D in vitro although the activity of DYRK1 was likely not inhibited (judging from the modification of FAM53C itself). Instead, it seems more likely that FAM53C competes with cyclin D in this assay. Authors claim that the G1 arrest caused by depletion of FAM53C was rescued by inhibition of DYRK1 but this was true only in cells lacking functional p53. This is quite confusing as DYRK1 inhibition reduced the fraction of G1 cells in p53 wild type cells as well as in p53 knock-outs, suggesting that FAM53C may not be required for regulation of DYRK1 function. Instead of focusing on the impact of FAM53C on cell cycle progression, authors moved towards investigating its potential (and perhaps more complex) roles in differentiation of IPSCs into cortical organoids and in mice. They observed a lower level of proliferating cells in the organoids but if that reflects an increased activity of DYRK1 or if it is just an off target effect of the genetic manipulation remains unclear. Even less clear is the phenotype in FAM53C knock-out mice. Authors did not observe any significant changes in survival nor in organ development but they noted some behavioral differences. Weather and how these are connected to the rate of cellular proliferation was not explored. In the summary, the study identified previously unknown role of FAM53C in proliferation but failed to explain the mechanism and its physiological relevance at the level of tissues and organism. Although some of the data might be of interest, in current form the data is too preliminary to justify publication.

      Major points 

      (1) Whole study is based on one siRNA to Fam53C and its specificity was not validated. Level of the knock down was shown only in the first figure and not in the other experiments. The observed phenotypes in the cell cycle progression may be affected by variable knock-down efficiency and/or potential off target effects. 

      We thank the Reviewer for raising this important point. First, we need to clarify that our experiments were performed with a pool of siRNAs (not one siRNA). Second, commercial antibodies against FAM53C are not of the best quality and it has been challenging to detect FAM53C using these antibodies in our hands – the results are often variable. In addition, to better address the Reviewer’s point and control for the phenotypes we have observed, we performed two additional series of experiments: first, we have confirmed G1 arrest in RPE-1 cells with individual siRNAs, providing more confidence for the specificity of this arrest (Fig. S1B); second, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (Fig. S1E,F and Fig. 4F).

      (2) Experiments focusing on the cell cycle progression were done in a single cell line RPE1 that showed a strong sensitivity to FAM53C depletion. In contrast, phenotypes in IPSCs and in mice were only mild suggesting that there might be large differences across various cell types in the expression and function of FAM53C. Therefore, it is important to reproduce the observations in other cell types. 

      As mentioned above, we have new data indicating that other cell lines arrest in G1 upon FAM53C knock-down (three cancer cell lines) (Fig. S1E,F and Fig. 4F).

      (3) Authors state that FAM53C is a direct inhibitor of DYRK1A kinase activity (Line 203), however this model is not supported by the data in Fig 4A. FAM53C seems to be a good substrate of DYRK1 even at high concentrations when phosphorylations of cyclin D is reduced. It rather suggests that DYRK1 is not inhibited by FAM53C but perhaps FAM53C competes with cyclin D. Further, authors should address if the phosphorylation of cyclin D is responsible for the observed cell cycle phenotype. Is this Cyclin D-Thr286 phosphorylation, or are there other sites involved? 

      We revised the text of the manuscript to include the possibility that FAM53C could act as a competitive substrate and/or an inhibitor.

      We removed most of the Cyclin D phosphorylation/stability data from the revised manuscript. As the Reviewers pointed out, some of these data were statistically significant but the biological effects were small. As discussed above in our response to Reviewer #1, the analysis of Cyclin D phosphorylation and stability are complicated by the upregulation of p21 upon FAM53C knockdown, in particular because p21 can be part of Cyclin D complexes, which may affect its protein levels in cells (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). Instead of focusing on Cyclin D levels and stability, we refocused the manuscript on RB and p53 downstream of FAM53C loss.

      We note, however, that we used specific Thr286 phospho-antibodies, which have been used extensively in the field. Our data in Figure 1 with palbociclib place FAM53C upstream of Cyclin D/CDK4,6. We performed Cyclin D overexpression experiments but RPE-1 cells did not tolerate high expression of Cyclin D1 (T286A mutant) and we have not been able to conduct more ‘genetic’ studies. 

      (4) At many places, information on statistical tests is missing and SDs are not shown in the plots. For instance, what statistics was used in Fig 4C? Impact of FAM53C on cyclin D phosphorylation does not seem to be significant. In the same experiment, does DYRK1 inhibitor prevent modification of cyclin D? 

      As discussed above, we removed some of these data and re-focused the manuscript on p53-p21 as a second pathway activated by loss of FAM53C.

      (5) Validation of SM13797 compound in terms of specificity to DYRK1 was not performed. 

      This is an important point. We had cited an abstract from the company (Biosplice) but we agree that providing data is critical. We have now revised the manuscript with a new analysis of the compound’s specificity using kinase assays. These data are shown in Fig. S3F-H.

      (6) A fraction of cells in G1 is a very easy readout but it does not measure progression through the G1 phase. Extension of the S phase or G2 delay would indirectly also result in reduction of the G1 fraction. Instead, authors could measure the dynamics of entry to S phase in cells released from a G1 block or from mitotic shake off. 

      The Reviewer made a good point. As discussed in our response to Reviewer #1, with p53-null RPE-1 cells, we found that cell numbers do not increase in these conditions where we had observed a cell cycle re-entry (Fig. 4E), which was accompanied by apoptotic cell death (Fig. S4I). Thus, cells re-enter the cell cycle but die as they progress through S-phase and G2/M. We note that inhibition of DYRK1A has been shown to decrease expression of G2/M regulators (PMID: 38839871), which may contribute to the inability of cells treated to DYRK1Ai to divide.

      Because our data in RPE-1 cells showed that p21 knock-down was not sufficient to allow the FAM53C knock-down cells to re-enter the cell cycle, we did not further analyze p21 in HCT-116 cells. These data indicate that G1 entry by flow cytometry will not always translate into proliferation.

      Other points:

      (7) Fig. 2C, 2D, 2E graphs should begin with 0 

      We remade these graphs.

      (8) Fig. 5D shows that the difference in p21 levels is not significant in FAM53C-KO cells but difference is mentioned in the text. 

      We replaced the panel by the correct panel; we apologize for this error.

      (9) Fig. 6D comparison of datasets of extremely different sizes does not seem to be appropriate

      We agree and revised the text. We hope that the Reviewer will agree with us that it is worth showing these data, which are clearly preliminary but provide evidence of a possible role for FAM53C in the brain.

      (10) Could there be alternative splicing in mice generating a partially functional protein without exon 4? Did authors confirm that the animal model does not express FAM53C? 

      We performed RNA sequencing of mouse embryonic fibroblasts derived from control and mutant mice. We clearly identified fewer reads in exon 4 in the knockout cells, and no other obvious change in the transcript (data not shown). However, immunoblot with mouse cells for FAM53C never worked well in our hands. We made sure to add this caveat to the revised manuscript.

      Reviewer #2 (Significance): 

      Main problem of this study is that the advanced experimental models in IPSCs and mice did not confirm the observations in the cell lines and thus the whole manuscript does not hold together. Although I acknowledge the effort the authors invested in these experiments, the data do not contribute to the main conclusion of the paper that FAM53C/DYRK1 regulates G1/S transition. 

      Reviewer #3 (Evidence, reproducibility and clarity: 

      This paper identifies FAM53C as a novel regulator of cell cycle progression, particularly at the G1/S transition, by inhibiting DYRK1A. Using data from the Cancer Dependency Map, the authors suggest that FAM53C acts upstream of the Cyclin D-CDK4/6-RB axis by inhibiting DYRK1A.  Specifically, their experiments suggest that FAM53C Knockdown induces G1 arrest in cells, reducing proliferation without triggering apoptosis. DYRK1A Inhibition rescues G1 arrest in P53KO cells, suggesting FAM53C normally suppresses DYRK1A activity. Mass Spectrometry and biochemical assays confirm that FAM53C directly interacts with and inhibits DYRK1A. FAM53C Knockout in Human Cortical Organoids and Mice leads to cell cycle defects, growth impairments, and behavioral changes, reinforcing its biological importance. 

      Strength of the paper: 

      The study introduces a novel cell cycle control signalling module upstream of CDK4/6 in G1/S regulation which could have significant impact. The identification of FAM53C using a depmap correlation analysis is a nice example of the power of this dataset. The experiments are carried out mostly in a convincing manner and support the conclusions of the manuscript. 

      Critique: 

      (1) The experiments rely heavily on siRNA transfections without the appropriate controls. There are so many cases of off-target effects of siRNA in the literature, and specifically for a strong phenotype on S-phase as described here, I would expect to see solid results by additional experiments. This is especially important since the ko mice do not show any significant developmental cell cycle phenotypes. Moreover, FAM53C does not show a strong fitness effect in the depmap dataset, suggesting that it is largely non-essential in most cancer cell lines. For this paper to reach publication in a high-standard journal, I would expect that the authors show a rescue of the S-phase phenotype using an siRNA-resistant cDNA, and show similar S-phase defects using an acute knock out approach with lentiviral gRNA/Cas9 delivery. 

      We thank the Reviewer for this comment. Please refer to the initial response to the three Reviewers, where we discuss our use of single siRNAs and our results in multiple cell lines. Briefly, we can recapitulate the G1 arrest upon FAM53C knock-down using two independent siRNAs in RPE-1 cells. We also observe the same G1 arrest in p53 knockout cells, suggesting it is not due to a non-specific stress response. In addition, the arrest is dependent on RB, which fits with the genetic and biochemical data placing FAM53C upstream of RB, further supporting a specific phenotype. Human cancer cell lines also arrest in G1 upon FAM53C knock-down, not just RPE-1 cells. Finally, we hope the Reviewer will agree with us that compensatory mechanisms are very common in the cell cycle – which may explain the lack of phenotypes in vivo or upon long-term knockout of FAM53C.

      (2) The S-phase phenotype following FAM53C should be demonstrated in a larger variety of TP53WT and mutant cell lines. Given that this paper introduces a new G1/S control element, I think this is important for credibility. Ideally, this should be done with acute gRNA/Cas9 gene deletion using a lentiviral delivery system; but if the siRNA rescue experiments work and validate an on-target effect, siRNA would be an appropriate alternative. 

      We now show data with three cancer cell lines (U2OS, A549, and HCT-116 – Fig. S1E,F and Fig. 4F), in addition to our results in RPE-1 cells and in human cortical organoids. We note that the knock-down experiments are complemented by overexpression data (Fig. 1G-I), by genetic data (our original DepMap screen), and our biochemical data (showing direct binding of FAM53C to DYRK1A).

      (3) The western blot images shown in the MS appear heavily over-processed and saturated (See for example S4B, 4A, B, and E). Perhaps the authors should provide the original un-processed data of the entire gels? 

      For several of our panels (e.g., 4E and S4B, now panels S3J and S3K)), we used a true “immunoassay” (as indicated in the legend – not an immunoblot), which is much more quantitative and avoids error-prone steps in standard immunoblots (“Western blots”). Briefly, this system was developed by ProteinSimple. It uses capillary transfer of proteins and ELISA-like quantification with up to 6 logs of dynamic range (see their web site https://www.proteinsimple.com/wes.html). The “bands” we show are just a representation of the luminescence signals in capillaries. We made sure to further clarify the figure legends in the revised manuscript.

      Data in 4A are also not a western blot but a radiograph.

      For immunoblots, we will provide all the source data with uncropped blots with the final submission.

      (4) A critical experiment for the proposed mechanism is the rescue of the FAM53C S-phase reduction using DYRK1A inhibition shown in Figure 4. The legend here states that the data were extracted from BrdU incorporation assays, but in Figure S4D only the PI histograms are shown, and the S-phase population is not quantified. The authors should show the BrdU scatterplot and quantify the phenotype using the S-phase population in these plots. G1 measurements from PI histograms are not precise enough to allow for conclusions. Also, why are the intensities of the PI peaks so variable in these plots? Compare, for example, the HCT116 upper and lower panels where the siRNA appears to have caused an increase in ploidy. 

      We apologize for the confusion and we fixed these errors, for most of the analyses, we used PI to measure G1 and S-phase entry. We added relevant flow cytometry plots to supplemental figures (Fig. S1G, H, I, as well as Fig. S4E and S4K, and Fig. S5F).

      (5) There's an apparent contradiction in how RB deletion rescues the G1 arrest (Figure 2) while p21 seems to maintain the arrest even when DYRK1A is inhibited. Is p21 not induced when FAM53C is depleted in RB ko cells? This should be measured and discussed. 

      This comment and comments from the two other Reviewers made us reconsider our model. We re-read carefully the Meyer paper and think that DYRK1A activity may be understood when considering levels of both CycD and p21 at the same time in a continuum (as was nicely showed in a previous study from the lab of Tobias Meyer – Chen et al., Mol Cell, 2013). While our genetic and biochemical data support a role for FAM53C in DYRK1A inhibition, it is obvious that the regulation of cell cycle progression by FAM53C is not exclusively due to this inhibition. As discussed above and below, we noted an upregulation of p21 upon FAM53C knock-down, and activation of p53 and its targets likely contributes significantly to the phenotypes observed. We added new experiments to support this more complex model (Figure 4 and Figure S4, with new model in S4L).

      Reviewer #3 (Significance): 

      In conclusion, I believe that this MS could potentially be important for the cell cycle field and also provide a new target pathway that could be relevant for cancer therapy. However, the paper has quite a few gaps and inconsistencies that need to be addressed with further experiments. My main worry is that the acute depletion phenotypes appear so strong, while the gene is nonessential in mice and shows only a minor fitness effect in the depmap screens. More convincing controls are necessary to rule out experimental artefacts that misguide the interpretation of the results.

      We appreciate this comment and hope that the Reviewer will agree it is still important to share our data with the field, even if the phenotypes in mice are modest.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment:

      This valuable study examines how mammals descend effectively and securely along vertical substrates. The conclusions from comparative analyses based on behavioral data and morphological measurements collected from 21 species across a wide range of taxa are convincing, making the work of interest to all biologists studying animal locomotion.

      We would like to greatly thank the two reviewers for their time in reviewing this work, and for their valuable comments and suggestions that will help to improve this manuscript.

      Overall, we agree with the weaknesses raised, which are mainly areas for consideration in future studies: to study more species, and in a natural habitat context.

      We will nevertheless add a few modifications to improve the manuscript, notably by making certain figures more readable, and adding definitions and bibliography in the main text concerning gait characteristics.

      We also provide brief comments on each point of weakness raised by the reviewers below, in blue.

      Reviewer #1 (Public review):

      Summary:

      This unique study reports original and extensive behavioral data collected by the authors on 21 living mammal taxa in zoo conditions (primates, tree shrew, rodents, carnivorans, and marsupials) on how descent along a vertical substrate can be done effectively and securely using gait variables. Ten morphological variables reflecting head size and limb proportions are examined in relationship to vertical descent strategies and then applied to reconstruct modes of vertical descent in fossil mammals.

      Strengths:

      This is a broad and data-rich comparative study, which requires a good understanding of the mammal groups being compared and how they are interrelated, the kinematic variables that underlie the locomotion used by the animals during vertical descent, and the morphological variables that are associated with vertical descent styles. Thankfully, the study presents data in a cogent way with clear hypotheses at the beginning, followed by results and a discussion that addresses each of those hypotheses using the relevant behavioral and morphological variables, always keeping in mind the relationships of the mammal groups under investigation. As pointed out in the study, there is a clear phylogenetic signal associated with vertical descent style. Strepsirrhine primates much prefer descending tail first, platyrrhine primates descend sideways when given a choice, whereas all other mammals (with the exception of the raccoon) descend head first. Not surprisingly, all mammals descending a vertical substrate do so in a more deliberate way, by reducing speed, and by keeping the limbs in contact for a longer period (i.e., higher duty factors).

      Weaknesses:

      The different gait patterns used by mammals during vertical descent are a bit more difficult to interpret. It is somewhat paradoxical that asymmetrical gaits such as bounds, half bounds, and gallops are more common during descent since they are associated with higher speeds and lower duty factors. Also, the arguments about the limb support polygons provided by DSDC vs. LSDC gaits apply for horizontal substrates, but perhaps not as much for vertical substrates.

      We analyzed gait patterns using methods commonly found in the literature and discussed our results accordingly. However, the study of limbs support polygons was indeed developed specifically for studying locomotion on horizontal supports, and may not be applicable for studying vertical locomotion, which is in fact a type of locomotion shared by all arboreal species. In the future, it would be interesting to consider new methods for analyzing vertical gaits.

      The importance of body mass cannot be overemphasized as it affects all aspects of an animal's biology. In this case, larger mammals with larger heads avoid descending head-first. Variation in trunk/tail and limb proportions also covaries with different vertical descent strategies. For example, a lower intermembral index is associated with tail-first descent. That said, the authors are quick to acknowledge that the five lemur species of their sample are driving this correlation. There is a wide range of intermembral indices among primates, and this simple measure of forelimb over hindlimb has vital functional implications for locomotion: primates with relatively long hindlimbs tend to emphasize leaping, primates with more even limb proportions are typically pronograde quadrupeds, and primates with relatively long forelimbs tend to emphasize suspensory locomotion and brachiation. Equally important is the fact that the intermembral index has been shown to increase with body mass in many primate families as a way to keep functional equivalence for (ascending) climbing behavior (see Jungers, 1985). Therefore, the manner in which a primate descends a vertical substrate may just be a by-product of limb proportions that evolved for different locomotor purposes. Clearly, more vertical descent data within a wider array of primate intermembral indices would clarify these relationships. Similarly, vertical descent data for other primate groups with longer tails, such as arboreal cercopithecoids, and particularly atelines with very long and prehensile tails, should provide more insights into the relationship between longer tail length and tail-first descent observed in the five lemurs. The relatively longer hallux of lemurs correlates with tail-first descent, whereas the more evenly grasping autopods of platyrrhines allow for all four limbs to be used for sideways descent. In that context, the pygmy loris offers a striking contrast. Here is a small primate equipped with four pincer-like, highly grasping autopods and a tail reduced to a short stub. Interestingly, this primate is unique within the sample in showing the strongest preference for head-first descent, just like other non-primate mammals. Again, a wider sample of primates should go a long way in clarifying the morphological and behavioral relationships reported in this study.

      We agree with this statement. In the future, we plan to study other species, particularly large-bodied ones with varied intermembral indexes.

      Reconstruction of the ancient lifestyles, including preferred locomotor behaviors, is a formidable task that requires careful documentation of strong form-function relationships from extant species that can be used as analogs to infer behavior in extinct species. The fossil record offers challenges of its own, as complete and undistorted skulls and postcranial skeletons are rare occurrences. When more complete remains are available, the entire evidence should be considered to reconstruct the adaptive profile of a fossil species rather than a single ("magic") trait.

      We completely agree with this, and we would like to emphasize that our intention here was simply to conduct a modest inference test, the purpose of which is to provide food for thought for future studies, and whose results should be considered in light of a comprehensive evolutionary model.

      Reviewer #2 (Public review):

      Summary:

      This paper contains kinematic analyses of a large comparative sample of small to medium-sized arboreal mammals (n = 21 species) traveling on near-vertical arboreal supports of varying diameter. This data is paired with morphological measures from the extant sample to reconstruct potential behaviors in a selection of fossil euarchontaglires. This research is valuable to anyone working in mammal locomotion and primate evolution.

      Strengths:

      The experimental data collection methods align with best research practices in this field and are presented with enough detail to allow for reproducibility of the study as well as comparison with similar datasets. The four predictions in the introduction are well aligned with the design of the study to allow for hypothesis testing. Behaviors are well described and documented, and Figure 1 does an excellent job in conveying the variety of locomotor behaviors observed in this sample. I think the authors took an interesting and unique angle by considering the influence of encephalization quotient on descent and the experience of forward pitch in animals with very large heads.

      Weaknesses:

      The authors acknowledge the challenges that are inherent with working with captive animals in enclosures and how that might influence observed behaviors compared to these species' wild counterparts. The number of individuals per species in this sample is low; however, this is consistent with the majority of experimental papers in this area of research because of the difficulties in attaining larger sample sizes.

      Yes, that is indeed the main cost/benefit trade-off with this type of study. Working with captive animals allows for large comparative studies, but there is a risk of variations in locomotor behavior among individuals in the natural environment, as well as few individuals per species in the dataset. That is why we plan and encourage colleagues to conduct studies in the natural environment to compare with these results. However, this type of study is very time-consuming and requires focusing on a single species at a time, which limits the comparative aspect.

      Figure 2 is difficult to interpret because of the large amount of information it is trying to convey.

      We agree that this figure is dense. One possible solution would be to combine species by phylogenetic groups to reduce the amount of information, as we did with Fig. 3 on the dataset relating to gaits. However, we believe that this would be unfortunate in the case of speed and duty factor because we would have to provide the complete figure in SI anyway, as the species-level information is valuable. We therefore prefer to keep this comprehensive figure here and we will enlarge the data points to improve their visibility, and provide the figure with a sufficiently high resolution to allow zooming in on the details.

      Reviewer #1 (Recommendations for the authors):

      As indicated in the first section above, this is a strong comparative study that addresses important questions, relative to the evolution of arboreal locomotion in primates and close mammal relatives. My recommendations should be taken in the context of improving a manuscript that is already generally acceptable.

      (1) The terms symmetrical and asymmetrical gaits should be briefly defined in the main text (not just in the Methods section) by citing work done by Hildebrand and other relevant studies. To that effect, the statement on lines 96-97 about the convergence of symmetrical gaits is unclear. What does "Symmetrical gaits have evolved convergently in rodents, scandentians, carnivorans, and marsupials" mean? Symmetrical gaits such as the walk, run, trot, etc., are pretty the norm in most mammals and were likely found in metatherians and basal eutherians. This needs clarification. On line 239, the term "ambling" is used in the context of related asymmetrical gaits. To be clear, the amble is a type of running gait involving no whole-body aerial phase and is therefore a symmetrical gait (see Schmitt et al., 2006).

      We have added a definition of the terms symmetrical and asymmetrical gaits and added references in the introduction such as: “Symmetrical gaits are defined as locomotor patterns in which the footfalls of a girdle (a pair of fore- or hindlimbs) are evenly spaced in time, with the right and left limbs of a pair of limbs being approximately 50% out of phase with each other (Hildebrand, 1966, 1967). Symmetrical gaits can be further divided into two types: diagonal-sequence gaits, in which a hindlimb footfall is followed by that of the contralateral forelimb, and lateral-sequence gaits, in which a hindlimb footfall is followed by that of the ipsilateral forelimb (Hildebrand, 1967; Shapiro and Raichlen, 2005; Cartmill et al., 2007b). In contrast, asymmetrical gaits are characterized by unevenly spaced footfalls within a girdle, with the right and left limbs moving in near synchrony (Hildebrand, 1977).” Now found in lines 87-94.

      We corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      Thank you for pointing this out. We indeed did not use the right term to mention related asymmetrical gaits with increased duty factors. We removed the term « ambling » and the associated reference here. Now found in line 256.

      (2) Correlations are used in the paper to examine how brain mass scales with body mass. It is correct to assume that a correlation significantly different from 0 is indicative of allometry (in this case, positive). That said, lines are used in Figure S2 that go through the bivariate scatter plot. The vast majority of scaling studies rely on regression techniques to calculate and compare slopes, which are different statistically from correlations. In this case, a slope not significantly different from 1.0 would support the hypothesis of isometry based on geometric similarity (as brain mass and body mass are two volumes). The authors could refer to the work of Bob Martin and the 1985 edited book by Jungers and contributions therein. These studies should also be cited in the paper.

      Thank you for recommending us this better suited method. We replaced the correlations with major axis orthogonal regressions, as recommended by Martin and Barbour 1989. We found a positive slope for all species significantly different from 1 (0.36), indicating a negative allometry (we realized we were mistaken about the allometry terminology, initially reporting a “positive allometry” instead of a positive correlation).

      We corrected in the manuscript in the Results and Methods sections, and cited Martin and Barbour 1989 such as:

      “To ensure that the EQs of the different species studied are comparable and meaningful, we tested the allometry between the brain and body masses in our dataset following [84] and found a significant and positive slope for all species (major axis orthogonal regression on log transformed values: slope = 0.36, r<sup>2</sup> = 0.92, p = 5.0.10<sup>-12</sup>), indicating a negative allometry (r = 0.97, df = 19, p = 2.0.10<sup>-13</sup>), and similar allometric coefficients when restricting the analysis to phylogenetic groups (Fig. S2).” Now found in lines 289-298.

      - “To control that brain allometry is homogeneous among all phylogenetic groups, to be able to compare EQ between species, we computed major axis orthogonal regressions, following the recommendation of Martin and Barbour [84], between the Log transformed brain and body masses, over all species and by phylogenetic group using the sma package in R (Fig. S2).” Now found in lines 336-338.

      We also changed Figure S2 in Supplementary Information accordingly.

      (3) Trunk length is used as the denominator for many of the indices used in the study. In this way, trunk length is considered to be a proxy for body size. There should be a demonstration that trunk length scales isometrically with body mass in all of the mammals compared. If not the case, some of the indices may not be directly comparable.

      We did not use trunk length as a proxy for body mass, but to compute geometric body proportions in order to test whether intrinsic body proportions could be related to vertical descent behaviors, namely the length of the tail and of the fore- and hindlimbs relative to the animal. We chose those indices to quantify the capability of limbs to act as levers or counterweights to rotate the animals for this specific question of vertical descent behavior. We therefore do not think that body mass allometry with respect to trunk length is relevant to compare these indices across species here. Also, we don’t expect that trunk length (which is a single dimension) would scale isometrically with body mass, which scales more as a volume.

      (4) Given the numerous comparisons done in this study, a Bonferroni correction method should be considered to mitigate type I error (accepting a false positive).

      We had already corrected all our statistical tests using the Benjamini-Hochberg method to control for false positives; see the SuppTables Excel file for the complete results of the statistical analyses. We chose this method over the Bonferroni correction because the more modern and balanced Benjamini-Hochberg procedure is better suited for analyses involving a large number of hypotheses.

      (5) The terms "arm" and "leg" used in the main text and Table 1 are anatomically incorrect. Instead, the terms "forelimb" and hindlimb" should be used as they include the length sum of the stylopod, zeugopod, and autopod.

      Indeed, thank you for pointing that out. We have corrected this error within the manuscript as well as in the figures 4 and S3.

      (6) On p. 14, the authors make the statement that the postcranial anatomy of Adapis and Notharctus remains undescribed. The authors should consult the work of Dagosto, Covert, Godinot and others.

      We did not state that the postcranial remains of Adapis and Notharctus have not been described. However, we were unfortunately unable to find published illustrations of the known postcranial elements that could be reliably used in this study. To avoid any misunderstanding, we removed the sentence such as: “However, we could not find suitable illustrations of the known postcranial elements of these species in the literature that could be reliably incorporated into this study. Thus, we only included their reconstructed body mass and EQ,..”. Now found in lines 393-397.

      Reviewer #2 (Recommendations for the authors):

      (1) Line 65/69 - Perchalski et al. 2021 is a single-author publication, so no et al. or w/ colleagues.

      Indeed. This has been corrected in the manuscript, now found in lines 65 and 70.

      (2) Lines 96-98 - Is it appropriate to say that the use of symmetrical gaits are examples of convergent evolution? There's less burden of evidence to state that these are shared behaviors, rather than suggesting they independently evolved across all those groups.

      We agree with this and corrected the sentence such as “Symmetrical gaits are also common in rodents, scandentians, etc..” Now found in line 107.

      (3) Line 198 - I am confused by how to interpret (-16,36 %) compared to how other numbers are presented in the rest of the paragraph.

      To avoid confusion, we rephrased this sentence such as: “In contrast, primates did not significantly reduce their speed compared to ascents when descending sideways or tail-first (Fig. 2A, SuppTables B).”  Now found in lines 207-209.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review):

      Summary:

      In this study, the authors aim to understand how Rhino, a chromatin protein essential for small RNA production in fruit flies, is initially recruited to specific regions of the genome. They propose that asymmetric arginine methylation of histones, particularly mediated by the enzyme DART4, plays a key role in defining the first genomic sites of Rhino localization. Using a combination of inducible expression systems, chromatin immunoprecipitation, and genetic knockdowns, the authors identify a new class of Rhinobound loci, termed DART4 clusters, that may represent nascent or transitional piRNA clusters.

      Strengths:

      One of the main strengths of this work lies in its comprehensive use of genomic data to reveal a correlation between ADMA histones and Rhino enrichment at the border of known piRNA clusters. The use of both cultured cells and ovaries adds robustness to this observation. The knockdown of DART4 supports a role for H3R17me2a in shaping Rhino binding at a subset of genomic regions.

      Weaknesses:

      However, Rhino binding at, and piRNA production from, canonical piRNA clusters appears largely unaffected by DART4 depletion, and spreading of Rhino from ADMArich boundaries was not directly demonstrated. Therefore, while the correlation is clearly documented, further investigation would be needed to determine the functional requirement of these histone marks in piRNA cluster specification.

      The study identify piRNA cluster-like regions called DART4 clusters. While the model proposes that DART4 clusters represent evolutionary precursors of mature piRNA clusters, the functional output of these clusters remains limited. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwi-dependent silencing.

      In summary, the authors present a well-executed study that raises intriguing hypotheses about the early chromatin context of piRNA cluster formation. The work will be of interest to researchers studying genome regulation, small RNA pathways, and the chromatin mechanisms of transposon control. It provides useful resources and new candidate loci for follow-up studies, while also highlighting the need for further functional validation to fully support the proposed model.

      We sincerely thank Reviewer #1 for the thoughtful and constructive summary of our work. We appreciate the reviewer’s recognition that our study provides a comprehensive analysis of the relationship between ADMA-histones and Rhino localization, and that it raises intriguing hypotheses about the early chromatin context of piRNA cluster formation.

      We fully agree with the reviewer that our data primarily demonstrate correlation between ADMA-histones and Rhino localization, rather than direct causation. In response, we have carefully revised the text throughout the manuscript to avoid overstatements implying causality (details provided below).

      We also acknowledge the reviewer’s important point that the functional requirement of ADMA-histones for piRNA clusters specification remains to be further established. We have now added the discussion about our experimental limitations (page 18).

      Overall, we have revised the manuscript to present our findings more cautiously and transparently, emphasizing that our data reveal a correlation between ADMA-histone marks and the initial localization of Rhino, rather than proving a direct mechanistic requirement. We thank the reviewer again for highlighting these important distinctions.

      Reviewer #2 (Public review):

      This study seeks to understand how the Rhino factor knows how to localize to specific transposon loci and to specific piRNA clusters to direct the correct formation of specialized heterochromatin that promotes piRNA biogenesis in the fly germline. In particular, these dual-strand piRNA clusters with names like 42AB, 38C, 80F, and 102F generate the bulk of ovarian piRNAs in the nurse cells of the fly ovary, but the evolutionary significance of these dual-strand piRNA clusters remains mysterious since triple null mutants of these dual-strand piRNA clusters still allows fly ovaries to develop and remain fertile. Nevertheless, mutants of Rhino and its interactors Deadlock, Cutoff, Kipferl and Moonshiner, etc, causes more piRNA loss beyond these dual-strand clusters and exhibit the phenotype of major female infertility, so the impact of proper assembly of Rhino, the RDC, Kipferl etc onto proper piRNA chromatin is an important and interesting biological question that is not fully understood.

      This study tries to first test ectopic expression of Rhino via engineering a Dox-inducible Rhino transgene in the OSC line that only expresses the primary Piwi pathway that reflects the natural single pathway expression the follicle cells and is quite distinct from the nurse cell germline piRNA pathway that is promoted by Rhino, Moonshiner, etc. The authors present some compelling evidence that this ectopic Rhino expression in OSCs may reveal how Rhino can initiate de novo binding via ADMA histone marks, a feat that would be much more challenging to demonstrate in the germline where this epigenetic naïve state cannot be modeled since germ cell collapse would likely ensue. In the OSC, the authors have tested the knockdown of four of the 11 known Drosophila PRMTs (DARTs), and comparing to ectopic Rhino foci that they observe in HP1a knockdown (KD), they conclude DART1 and DART4 are the prime factors to study further in looking for disruption of ADMA histone marks. The authors also test KD of DART8 and CG17726 in OSCs, but in the fly, the authors only test Germ Line KD of DART4 only, they do not explain why these other DARTs are not tested in GLKD, the UAS-RNAi resources in Drosophila strain repositories should be very complete and have reagents for these knockdowns to be accessible.

      The authors only characterize some particular ADMA marks of H3R17me2a as showing strong decrease after DART4 GLKD, and then they see some small subset of piRNA clusters go down in piRNA production as shown in Figure 6B and Figure 6F and Supplementary Figure 7. This small subset of DART4-dependent piRNA clusters does lose Rhino and Kipferl recruitment, which is an interesting result.

      However, the biggest issue with this study is the mystery that the set of the most prominent dual-strand piRNA clusters. 42AB, 38C, 80F, and 102F, are the prime genomic loci subjected to Rhino regulation, and they do not show any change in piRNA production in the GLKD of DART4. The authors bury this surprising negative result in Supplementary Figure 5E, but this is also evident in no decrease (actually an n.s. increase) in Rhino association in Figure 5D. Since these main piRNA clusters involve the RDC, Kipferl, Moonshiner, etc, and it does not change in ADMA status and piRNA loss after DART4 GLKD, this poses a problem with the model in Figure 7C. In this study, there is only a GLKD of DART4 and no GLKD of the other DARTs in fly ovaries.

      One way the authors rationalize this peculiar exception is the argument that DART4 is only acting on evolutionarily "young" piRNA clusters like the bx, CG14629, and CG31612, but the lack of any change on the majority of other piRNA clusters in Figure 6F leaves upon the unsatisfying concern that there is much functional redundancy remaining with other DARTs not being tested by GLKD in the fly that would have a bigger impact on the other main dual-strand piRNA clusters being regulated by Rhino and ADMA-histone marks.

      Also, the current data does not provide convincing enough support for the model Figure 7C and the paper title of ADMA-histones being the key determinant in the fly ovary for Rhino recognition of the dual-strand piRNA clusters. Although much of this study's data is well constructed and presented, there remains a large gap that no other DARTs were tested in GLKD that would show a big loss of piRNAs from the main dual-strand piRNA clusters of 42AB, 38C, 80F, and 102F, where Rhino has prominent spreading in these regions.

      As the manuscript currently stands, I do not think the authors present enough data to conclude that "ADMA-histones [As a Major new histone mark class] does play a crucial role in the initial recognition of dual-strand piRNA cluster regions by Rhino" because the data here mainly just show a small subset of evolutionarily young piRNA clusters have a strong effect from GLKD of DART4. The authors could extensively revise the study to be much more specific in the title and conclusion that they have uncovered this very unique niche of a small subset of DART4-dependent piRNA clusters, but this niche finding may dampen the impact and significance of this study since other major dual-strand piRNA clusters do not change during DART4 GLKD, and the authors do not show data GLKD of any other DARTs. The niche finding of just a small subset of DART-4-dependent piRNA clusters might make another specialized genetics forum a more appropriate venue.

      We are deeply grateful to Reviewer #2 for the detailed and insightful review that carefully situates our study in the broader context of Rhino-mediated piRNA cluster regulation. We appreciate the reviewer’s recognition that our inducible Rhino expression system in OSCs provides a valuable model to explore de novo Rhino recruitment under a simplified chromatin environment.

      At the same time, we agree that the current data mainly support a role for DART4 in regulating a subset of evolutionarily young piRNA clusters, and do not demonstrate a requirement for ADMA-histones at the major dual-strand piRNA clusters such as 42AB or 38C. We have therefore revised the title and main conclusions to more accurately reflect the scope of our findings.

      We agree with the reviewer that functional redundancy among DARTs may explain why major dual-strand piRNA clusters are unaffected by DART4 GLKD. Indeed, we have tried DART1 GLKD in the germline, which shows collapse of Rhino foci in OSCs.For DART1 GLKD, two approaches were possible:

      (1) Crossing the BDSC UAS-RNAi line (ID: 36891) with nos-GAL4.

      (2) Crossing the VDRC UAS-RNAi line (ID: 110391) with nos-GAL4 and UAS-Dcr2.

      The first approach was not feasible because the UAS-RNAi line always arrived as dead on arrival (DOA) and could not be maintained in our laboratory. The second approach did not yield effective and stable knockdown (as follows).

      DART8 and CG17726 did not alter Rhino foci in OSC knockdown experiments; therefore, we did not attempt germline knockdown (GLKD) of these DARTs in the ovary.  We agree with the reviewer’s opinion that there are piRNA source loci where Rhino localization depends on DART1, and that simultaneous depletion of multiple DARTs may indeed reveal additional positive results because ADMA-histones such as H3R8me2a may be completely eliminated by the knockdown of multiple DARTs. At the same time, we note that many evolutionarily conserved piRNA clusters show a loss of ADMA accumulation compared with evolutionarily young piRNA clusters, with levels that are comparable to the background input in ChIP-seq reads. Therefore, conserved clusters such as 42AB and 38C may no longer be regulated by ADMA. Even if multiple DARTs function redundantly to regulate ADMA, it may be difficult to disrupt Rhino localization at such conserved piRNA clusters by depletion of DARTs. While disruption of Rhino localization at conserved clusters like 42AB and 38C may be challenging, we cannot exclude the possibility that DART depletion affects Rhino binding at less conserved piRNA clusters, where ADMA modification remains detectable. We added clarifications in the Discussion to acknowledge the potential redundancy with other DARTs and to note that further knockdown experiments in the germline will be necessary to test this model comprehensively (page 18).

      We appreciate the reviewer’s critical feedback, which has helped us refine the message and strengthen the interpretative balance of the paper.

      Reviewer #1 (Recommendations for the authors):

      In multiple places, the link between ADMA histones and Rhino recruitment is presented in terms that imply causality. Please revise these statements to reflect that, in most cases, the evidence supports correlation rather than direct functional necessity. Similarly, statements suggesting that ADMA histones promote Rhino spreading should be revised unless supported by direct evidence.

      We sincerely thank the reviewer for the insightful comments. We recognize that these suggestions are crucial for improving the manuscript, and we have revised it accordingly to address the concerns. The specific revisions we made are detailed below.

      (1) Page 1, line 14: The original sentence “in establishing the sites” was changed to “may establish the potential sites.”

      (2) Page 4, lines 11-12: The original sentence “genomic regions where Rhino binds at the ends and propagates in the areas in a DART4-dependent manner, but not stably anchored” was changed to “genomic regions that have ADMA-histones at their ends and exhibit broad Rhino spreading across their internal regions in a DART4dependent manner”

      (3) Page4, lines 12-15: The original sentence “Kipferl is present at the regions but not sufficient to stabilize Rhino-genomic binding after Rhino propagates.” was changed to “In contrast to authentic piRNA clusters, Kipferl was lost together with Rhino upon DART4 depletion in these regions, suggesting that Kipferl by itself is not sufficient to stabilize Rhino binding; rather, their localization depends on DART4.”

      (4) Page4, lines17-18: The original sentence “are considered to be primitive clusters” was changed to “might be nascent dual-strand piRNA source loci”.

      (5) Page 8, line 7: The original sentence “Involvement of ADMA-histones in the genomic localization of Rhino was implicated.” was changed to “Correlation of ADMA-histones in the genomic localization of Rhino was implicated.”

      (6) Page 8, lines 19-21: The original sentence “These results suggest that ADMAhistones, together with H3K9me3, contribute significantly and specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.” was changed to “These results raise the possibility that ADMA-histones, together with H3K9me3, may contribute specifically to the recruitment of Rhino to the ends of dual-strand clusters in OSCs.”

      (7) Page 10, lines 11-13: The original sentence “These results suggest that DART1 and DART4 are involved in Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).” was changed to ”These results suggest that DART1 and DART4 could contribute to Rhino recruitment at distinct genomic sites through the decreases in ADMA-histones in each of their KD conditions (H4R3me2a and H3R17me2a, respectively).”

      (8) Page 13, line 2: The original sentence “Genomic regions where Rhino spreads in a DART4-dependent manner, but not stably anchored, produce some piRNAs“ was changed to “Genomic regions where Rhino binds broadly in a DART4-dependent manner, but not stably anchored, produce some piRNAs”

      (9) Page 13, lines 21-22: The original sentence “These results support the hypothesis that ADMA-histones are involved in the genomic binding of Rhino both before and after Rhino spreading, resulting in stable genome binding.” was changed to “These results raise the possibility that a subset of Rhino localized to genomic regions correlating with ADMA-histones may serve as origins of spreading.”

      (10) Page 16, lines 6-8: The original sentence “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., ADMA-histones) play a crucial role in the loading of Rhino onto the genome.” was changed to “In this study, we took advantage of cultured OSCs for our analysis and found that chromatin marks (i.e., bivalent nucleosomes containing H3K9me3 and ADMA-histones) appear to contribute to the initial loading of Rhino onto the genome.”

      (11) Page16, line 12: The original sentence “We propose that the process of piRNA cluster formation begins with the initial loading of Rhino onto bivalent nucleosomes containing H3K9me3 and ADMA-histones (Fig. 7C). In OSCs, the absence of Kipferl and other necessary factors means that Rhino loading into the genome does not proceed to the next step.” was removed.

      Major points

      (1)  Clarify the limited colocalization between Rhino and H3K9me3 in OSCs. The observation that FLAG-Rhino foci show minimal overlap with H3K9me3 in OSCs appears inconsistent with the proposed model by the authors in the discussion, in which Rhino is initially recruited to bivalent nucleosomes bearing both H3K9me3 and ADMA marks. This discrepancy should be addressed. 

      We thank the reviewer’s insightful comments. Indeed, ChIP-seq shows that Rhino partially overlaps with H3K9me3 (Fig. 1F), but immunofluorescence did not reveal any detectable overlap (Fig. 1A). We interpret this discrepancy as arising from the fact that immunofluorescence primarily visualizes H3K9me3 foci that are localized as broad domains in the genome, such as those at centromeres, pericentromeres, or telomeres (named chromocenters), whereas the sharp and interspersed H3K9me3 signals along chromosome arms are difficult to detect by immunofluorescence. We now have these explanations in the revised text (page 6).

      (2)  Please indicate whether the FLAG-Rhino used in OSCs has been tested for functionality in vivo-for example, by rescuing Rhino mutant phenotypes. This is particularly relevant given that no spreading is observed with this construct.

      We thank the reviewer for raising this important point. We have not directly tested the functionality of FLAG-Rhino construct used in OSCs in living Drosophila fly; i.e., it has not been used to rescue Rhino mutant phenotypes in flies. We acknowledge that FLAGRhino has not previously been expressed in OSCs, and that its localization pattern in OSCs differs from that observed in ovaries, where Rhino is endogenously expressed. However, several lines of evidence suggest that the addition of the N-terminal FLAG tag is unlikely to compromise Rhino function

      (1) In previous studies, N-terminally tagged Rhino (e.g., 3xFLAG-V5-Precision-GFPRhino) was expressed in a living Drosophila ovary and was shown to localize properly to piRNA clusters, indicating that the tag does not prevent Rhino from binding its genomic targets (Baumgartner et al., 2022; eLife. Fig. 3 supplement 1G).

      (2) In Drosophila S2 cells, FLAG-tagged tandem Rhino chromodomains construct was shown to bind H3K9me3/H3K27me3 bivalent chromatin, demonstrating that the FLAG tag does not impair this fundamental chromatin interaction (Akkouche et al., 2025; Nat Struct Mol Biol. Fig. 4b).

      (3) GFP-tagged Rhino has been demonstrated to rescue the transposon derepression phenotype of Rhino mutant flies, further supporting that the addition of tags does not abolish its in vivo function. (Parhad et al., 2017; Dev Cell. Fig.1D).

      Therefore, we interpret the partial localization of FLAG-Rhino in OSCs as reflecting the specific chromatin environment and regulatory context of OSCs rather than functional impairment due to the FLAG tag.

      (3) Given the low levels of piRNA production and the absence of measurable effects on transposon expression or fertility upon DART4 knockdown, the rationale for classifying these regions as piRNA clusters should be clearly stated. Additional experiments could help clarify whether low-level piRNA production from these loci is sufficient to guide Piwidependent silencing. The authors should also consider and discuss the possibility that some of these differences may reflect background-specific genomic variation rather than DART4-dependent regulation per see.

      We thank the reviewer for the insightful comments. As noted, DART4 knockdown did not measurably affect transposon expression or fertility. piRNAs generated from DART4associated clusters associate with Piwi but are insufficient for target repression. Although loss of DART4 largely eliminated piRNAs from these clusters, the cluster-derived transcripts themselves were unchanged. To clarify this point, we now refer to these regions as DART4-dependent piRNA-source loci (DART4 piSLs) in the revised text. We also acknowledge that some observed differences may reflect strain-specific genomic variation and have added this caveat on page 16.

      (4)  The authors should describe the genomic context of DART4 clusters in more detail. Specifically, it would be helpful to indicate whether these regions overlap with known transposable elements, gene bodies, or intergenic regions, and to report the typical size range of the clusters. Are any of the piRNAs produced from these clusters predicted to target known transcripts? 

      We thank the reviewer’s insightful comments. The overlap of DART4 piSL with transposable elements, gene bodies, and intergenic regions is shown in the right panel of Supplementary Fig. 6E (denoted as “Rhino reduced regions in DART4 GLKD” in the figure). The typical size range of these clusters is presented in Supplementary Fig. 6G. The annotation of piRNA reads derived from these piSL is shown in the right panel of Supplementary Fig. 6F, indicating that most of them appear to target host genes. The specific genes and transposons matched by the piRNAs produced from DART4 piSL are listed in Supplementary Table 8.

      (5)  While correlations between Rhino and ADMA histone marks (especially H3R8me2a,H3R17me2a, H4R3me2a) are robust, many ADMA-enriched regions do not recruit Rhino. Please discuss this observation and consider the possible involvement of additional factors.

      We thank the reviewer’s insightful comments. As pointed out, not all ADMA-enriched regions recruit Rhino; rather, Rhino is recruited only at sites where ADMAs overlap with H3K9me3. Furthermore, the combination of H3K9me3 and ADMAs alone does not fully account for the specificity of Rhino recruitment, suggesting the involvement of additional co-factors (for example, other ADMA marks such as H3R42me2a, or chromatininteracting proteins). In addition, since histone modifications—including arginine methylation—have the possibility that they are secondary consequences of modifications on other proteins rather than primary regulatory events, it is possible that DART1/4 contribute to Rhino recruitment not only through histone methylation but also via arginine methylation of non-histone chromatin-interacting factors. However, methylation of HP1a does not appear to be involved (Supplementary Fig. 3G). We have added new sentences about these points in the Discussion section (page 18).

      (6) The manuscript states that Kipferl is present at DART4 clusters but does not stabilize Rhino binding. Please specify which experimental results support this conclusion and explain.

      We apologize for the lack of clarity regarding Kipferl data. Supplementary Fig. 7A and 7B show that Kipferl localizes at major DART4 piSL. This Kipferl localization is lost together with Rhino upon DART4 GLKD, indicating that Rhino localization at DART4 piSL depends on DART4 rather than on Kipferl. From these results, we infer that, unlike at authentic piRNA clusters, Kipferl may not be sufficient to stabilize the association of Rhino with the genome at DART4 piSL. We have added this interpretation on page 14.

      Minor points

      (1) Figure 1D: Please specify which piRNA clusters are included in the metaplot - all clusters, or only the major producers? 

      We thank the reviewer for the question. The metaplot was not generated from a predefined list of “all” piRNA clusters or only the “major producers.” Instead, it was constructed from Rhino ChIP–seq peaks (“Rhino domains”) that are ≥1.5 kb in length.These Rhino domains mainly correspond to the subregions within major dual-strand clusters (e.g., 42AB, 38C) as well as additional clusters such as 80F, 102F, and eyeless, among others. We have provided the full list of domains and their corresponding piRNA clusters (with genomic coordinates) in Supplementary Table 9 and added the additional explanation in Fig. 1d legend.

      (2) Supplemental Figure 5E is referred to as 5D in the main text.

      We corrected the figure citations on pages 11-12: the reference to Supplementary Fig. 5E has been changed to 5D, and the reference to Supplementary Fig. 5F has been changed to 5E.

      (3) Supplemental Figure 7C: The color legend does not match the pie chart, which may confuse readers.

      We thank the reviewer for the helpful comment. We are afraid we were not entirely sure what specific aspect of the legend was confusing, but to avoid any possible misunderstanding, we revised Supplemental Fig. 7C so that the color boxes in the legend now exactly match the corresponding colors in the pie chart. We hope this modification improves clarity.

      (4) Since the manuscript focuses on the roles of DART1 and DART4, including their expression profiles in OSCs and ovaries would help contextualize the observed phenotypes. Please consider adding this information if available.

      We thank the reviewer for the suggestion. We have now included a scatter plot comparing RNA-seq expression in OSCs and ovaries (Supplementary Fig. 3H). In these datasets, DART1 is strongly expressed in both tissues, whereas DART4 shows no detectable reads. Notably, ref. 28 reports strong expression of both DART1 and DART4 in ovaries by western blot and northern blot. In our own qPCR analysis in OSCs, DART4 expression is about 3% of DART1, which, although low, may still be sufficient for functional roles such as modification of H3R17me2a (Fig. 3C, Supplementary Fig. 3F and 3I). We have added these new data and additional explanation in the revised manuscript (page 11).

      (5) Several of the genome browser snapshots, particularly scale and genome coordinates, are difficult to read. 

      We apologize for the difficulty in reading several of the genome browser snapshots in the original submission. We have re-generated the relevant figures using IGV, which provides clearer visualization of scale and genome coordinates. The previous images have been replaced with the improved versions in the revised manuscript.

      Reviewer #2 (Recommendations for the authors):

      (1) The authors need to elaborate on what this sentence means, as it is very unclear what they are describing about Rhino residency: "The results show that Rhino in OSCs tends to reside in the genome where Rhino binds locally in the ovary (Fig. 1C)." 

      We apologize for the lack of clarity in the original sentence. The text has been revised as follows:

      ”Rhino expressed in OSCs bound predominantly to genomic sites exhibiting sharp and interspersed Rhino localization patterns in the ovary, while showing little localization within broad Rhino domains, including major piRNA clusters.”

      In addition, to clarify the behavior of Rhino at broad domains, we have added the phrase “the terminal regions of broad domains, such as major piRNA clusters” to the subsequent sentence.

      (2) The red correlation line is very confusing in Figure 5F. What sort of line does this mean in this scatter plot? 

      We apologize for the lack of clarity regarding the red line in Fig. 5F. The red line represents the least-squares linear regression fit to the data points, calculated using the lm() function in R, and was added with abline() to illustrate the correlation between ctrl GLKD and DART4 GLKD values. In the revised figure, we have clarified this in the legend by specifying that it is a regression line.

      (3) There is no confirmation of the successful knockdown of the various DARTs in the OSCs.

      We thank the reviewer for the comment. The knockdown efficiency of the various DARTs in OSCs was confirmed by RT–qPCR. The data are now shown in Supplementary Fig. 3J. 

      (4) What is the purpose of an unnumbered "Method Figure" in the supplementary data file? Why not just give it a number and mention it properly in the text? 

      We thank the reviewer for the suggestion. We have now assigned a number to the previously unnumbered "Method Figure" and have included it as Supplementary Fig. 9.

      The figure is now properly cited in the Methods section.

      (5) For Figure 5A, those fly strain numbers in the labels are better reserved in the Methods, and a more appropriate label is to describe the GAL4 driver and the UAS-RNAi construct by their conventional names.

      We thank the reviewer for the suggestion. The labels in Fig. 5A have been updated to use the conventional names of the GAL4 drivers and UAS-RNAi constructs. Specifically, they now read Ctrl GLKD (nos-GAL4 > UAS-emp) and DART4 GLKD (nos-GAL4 > UASDART4). The original fly strain numbers are listed in the Methods section.

    1. Author response:

      The following is the authors’ response to the original reviews.

      eLife Assessment

      This useful study presents the potentially interesting concept that LRRK2 regulates cellular BMP levels and their release via extracellular vesicles, with GCase activity further modulating this process in mutant LRRK2-expressing cells. However, the evidence supporting the conclusions remains incomplete, and certain statistical analyses are inadequate. This work would be of interest to cell biologists working on Parkinson's disease.

      Reviewer #1 (Public review):

      Summary:

      Even though mutations in LRRK2 and GBA1 (which encodes the protein GCase) increase the risk of developing Parkinson's disease (PD), the specific mechanisms driving neurodegeneration remain unclear. Given their known roles in lysosomal function, the authors investigate how LRRK2 and GCase activity influence the exocytosis of the lysosomal lipid BMP via extracellular vesicles (EVs). They use fibroblasts carrying the PDassociated LRRK2-R1441G mutation and pharmacologically modulate LRRK2 and GCase activity.

      Strengths:

      The authors examine both proteins at endogenous levels, using MEFs instead of cancer cells. The study's scope is potentially interesting and could yield relevant insights into PD disease mechanisms.

      Weaknesses:

      Many of the authors' conclusions are overstated and not sufficiently supported by the data. Several statistical errors undermine their claims. Pharmacological treatment is very long, leading to potential off-target effects. Additionally, the authors should be more rigorous when using EV markers.

      We thank the reviewer for these valuable observations. In the revised manuscript, we have addressed each of these points as follows:

      (1) Conclusions and data support – We carefully revised our text throughout the manuscript to ensure that all conclusions are better supported by the presented data. For instance, we now explicitly state that while pharmacological modulation supports the regulatory role of LRRK2 activity in EV-mediated BMP release, we have softened our conclusions concerning the contribution of GCase in this model (see revised Results and Discussion sections).

      (2) Statistical analyses – We reanalyzed experiments involving more than two groups and replaced simple t-tests with non-parametric Kruskal-Wallis tests followed by Dunn’s post hoc comparisons. This approach, described in the updated figure legends (e.g., Figure 2D-F and H-J), provides a more rigorous statistical framework that accounts for small sample sizes and variability typical of EV quantifications.

      (3) Pharmacological treatment duration – Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115),Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).  In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are timedependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.  We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (4) EV markers – We and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). Moreover, LAMP proteins have been reported to be more enriched in EVs of endolysosomal origin (Mathieu et al., 2021). To further strengthen this point, we performed new experiments using a CD63-pHluorin sensor combined with TIRF microscopy, which allowed real-time visualization of CD63-positive exosome release. These new data (now presented in Figure 7, Panels G-I; Videos 1 and 2) confirm increased CD63-positive EV release in LRRK2 mutant fibroblasts, which was reversed by LRRK2 inhibition with MLi-2. The CD63-positive compartment was also largely BMPpositive (new Figure 7D, F, G), reinforcing our conclusions and providing additional rigor in EV marker validation.

      Reviewer #2 (Public review):

      Summary:

      In this paper, the authors used MEFs expressing the R1441G mutant of leucine-rich repeat kinase 2 (LRRK2), a mutant associated with the early onset of Parkinson's disease. They report that in these cells LAMP2 fluorescence is higher but BMP fluorescence is lower, MVE size is reduced, and that MVEs contain less ILVs. They also report that LAMP2-positive EVs are increased in mutant cells in a process sensitive to LRRK2 kinase inhibition but are further increased by glucocerebrosidase (GCase) inhibition, and that total di-22:6-BMP and total di-18:1-BMP are increased in mutant LRRK2 MEFs compared to WT cells by mass spectrometry. They also report that LRRK2 kinase inhibition partially restores cellular BMP levels, and that GCase inhibition further increases BMP levels, and that in EVs from the LRRK2 mutant, LRRK2 inhibition decreases BMP while GCase inhibition has the opposite effect. Moreover, they report that the BMP increase is not due to increased BMP synthesis, although the authors observe that CLN5 is increased in LRRK2 mutant cells. Finally, they report that GW4869 decreases EV release and exosomal BMP, while bafilomycin A1 increases EV release. They conclude that LRRK2 regulates BMP levels (in cells) and release (via EVs). They also conclude that the process is modulated by GCase in LRRK2 mutant cells, and that these studies may contribute to the use of BMP-positive EVs as a biomarker for Parkinson's disease and associated treatments.

      Strengths:

      This is an interesting paper, which provides novel insights into the biogenesis of exosomes with exciting biomedical potential. However, I have comments that authors need to address to clarify some aspects of their study.

      Weaknesses:

      (1) The intensity of LAMP2 staining is increased significantly in cells expressing the R1441G mutant of LRRK2 when compared to WT cells (Figure 1C). Yet mutant cells contain significantly smaller MVEs with fewer ILVs, and the MVE surface area is reduced (Figure 1D-F). This is quite surprising since LAMP2 is a major component of the limiting membrane of late endosomes. Are other proteins of endo-lysosomes (eg, LAMP1, CD63, RAB7) or markers (lysotracker) also decreased (see also below)?

      As referenced in our original manuscript, several previous studies have reported endolysosomal morphological and homeostatic defects in cells harboring pathogenic LRRK2 mutations. LAMP2 can be upregulated as part of a lysosomal biogenesis or stress response (e.g., via MiT/TFE transcription factors such as TFEB; Sardiello et al., Science 2009, 325:473-477), whereas ILV biogenesis is primarily controlled by ESCRT- and SMPD3-dependent pathways that are regulated independently of MiT/TFE-driven transcriptional programs. Indeed, Stuffers et al. (Traffic 2009, 10:925-937) demonstrated that depletion of key ESCRT subunits markedly inhibited ILV formation while concomitantly increasing LAMP2 expression, highlighting the mechanistic dissociation between LAMP2 abundance and ILV number. In our study, we observed a similar pattern in R1441G LRRK2 MEFs, in which elevated LAMP2 staining and protein levels occurred despite a reduction in MVE size and ILV number. We interpret this as a compensatory lysosomal biogenesis response.

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy donors and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, we observed a consistent decrease in BMP immunostaining intensity (New Figure 7, Panel A and B), in agreement with our findings in mouse fibroblasts. We therefore propose that the elevated LAMP2 expression observed in the engineered MEF clone expressing R1441G may reflect a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. We have updated the Results and Discussion section of the manuscript to incorporate and clarify these findings.

      (2) LRRK2 has been reported to interact with endolysosomal membranes. Does the R1441G mutant bind LAMP2- and/or BMP-positive membranes? 

      We agree that LRRK2 has been reported to associate dynamically with endolysosomal membranes, particularly under conditions of endolysosomal stress or damage (Eguchi T, et al. PNAS 2018, 115:E9115-E9124; Bonet-Ponce L, et al. Sci Adv. 2020, 6:eabb2454; Wang X, et al. Elife. 2023, 12:e87255).

      Nevertheless, to explore whether LRRK2 associates with BMP-positive endolysosomes, we performed subcellular fractionation followed by biochemical analysis of endolysosomal fractions, since our available LRRK2 antibodies did not provide reliable immunofluorescence signals. These experiments were carried out using human skin fibroblasts derived from both healthy controls and Parkinson’s disease patients carrying the LRRK2-G2019S mutation. In both control and mutant fibroblasts, a pool of LRRK2 was detected in fractions positive for the BMP synthase CLN5 and the endolysosomal marker CD63 (New Supplementary Figure 4, Panel A), supporting the localization of LRRK2 to endolysosomal membranes that are likely BMP-enriched. Our manuscript’s Results and Methods sections have been updated accordingly.

      Does the mutant affect endolysosomes?

      As referenced in our original manuscript, several studies have reported that pathogenic LRRK2 mutations can lead to endolysosomal defects. Consistent with these reports, we also observed morphological alterations in endolysosomes of cells expressing mutant LRRK2, including reduced MVE size and fewer ILVs, as shown in Figure 1D–F. These observations are in agreement with previously described phenotypes associated with pathogenic LRRK2 variants. Furthermore, in mutant LRRK2 MEFs, and now in humanderived fibroblasts (see new Figure 7, Panel A and B), we observed a decrease in BMP immunostaining signal.

      (3) Immunofluorescence data indicate that BMP is decreased in mutant LRRK2expressing cells compared to WT (Figure 1A-B), but mass spec data indicate that di-22:6BMP and di-18:1-BMP are increased (Figure 3). Authors conclude that the BMP pool detected by mass spec in mutant cells is less antibody-accessible than that present in wt cells, or that the anti-BMP antibody is less specific and that it detects other analytes. This is an awkward conclusion, since the IF signal with the antibody is lower (not higher): why would the antibody be less specific? Could it be that the antibody does not see all BMP isoforms equally well? Moreover, the observations that mutant cells contain smaller MVEs (Figure 1D-F) with fewer ILVs are consistent with the IF data and reduced BMP amounts. This needs to be clarified.

      As previously reported by us (Lu et al., J Cell Biol 2022;221:e202105060) and others (Berg AL, et al. Cancer Lett. 2023, 557:216090), discrepancies can occur between BMP levels detected by immunofluorescence and those quantified by mass spectrometry. This is because immunostaining reflects the pool of antibody-accessible BMP, whereas lipidomics measures the total cellular content of all BMP molecular species, irrespective of their distribution or accessibility.

      We agree that the anti-BMP antibody may not detect all BMP isoforms equally well. Differences in acyl chain composition (such as the degree of saturation or chain length) can alter the stereochemistry of BMP and, consequently, epitope accessibility to antibody binding.

      In addition, in a personal communication with Monther Abu-Remaileh (Stanford University), we were informed that the antibody may also cross-react with other lipid species in endolysosomes. Nevertheless, since there is no formal evidence supporting this, we have removed the sentence in the Discussion section stating “Alternatively, the antibody may also detect non-BMP analytes” to avoid any potential misinterpretations. In its place, we have added a short statement noting that “not all BMP isoforms may be detected equally well”.

      Mass spectrometry data are only shown for two BMP species (di-22:6, di-18:1). What are the major BMP isoforms in WT cells? The authors should show the complete analysis for all BMP species if they wish to draw quantitative conclusions about the amounts of BMP in wt and mutant cells. Finally, BMP and PG are isobaric lipids. Fragmentation of BMPs or PGs results in characteristic fingerprints, but the presence of each daughter ion is not absolutely specific for either lipid. This should be clarified, e.g., were BMP and PG separated before mass spec analysis? Was PG affected? The authors should also compare the BMP data with mass spec data obtained with a control lipid, e.g., PC.

      Regarding BMP isoforms, our targeted UPLC-MS/MS analyses revealed that 2,2′-di-22:6-BMP (sn2/sn2′) and 2,2′-di-18:1-BMP (sn2/sn2′) are the predominant BMP isoforms in MEF cells, consistent with previous reports showing docosahexaenoyl (22:6; DHA) and oleoyl (18:1) BMP as the most abundant isoforms. Across diverse mammalian cells and tissues, BMP typically exhibits a fatty acid composition dominated by oleoyl, with polyunsaturated fatty acids (particularly DHA) also contributing substantially. Enrichment of DHA-containing BMP species has been observed in multiple systems, including rat uterine stromal cells, PC12 cells, THP-1 and RAW macrophages, as well as in rat and human liver. This consistent presence of oleoyl- and docosahexaenoyl-containing BMP species across tissues indicates that these acyl chains are conserved features influencing the lipid’s structural and functional characteristics (Kobayashi et al. J Biol Chem, 2002; Hullin-Matsuda et al. Prostaglandins Leukotriens Essent Fatty Acids, 2009; Thompson et al. Int J Toxicol. 2012; Delton-Vandenbroucke et al. J Lipid Res, 2019).

      Nevertheless, we have included a Table (Panel H in updated Supplemental Figure 1) showing other BMP species that were also detected in our lipidomics analysis. Overall, dioleoyl (18:1)- and di-docosahexaenoyl (22:6)-BMP species were the most abundant in MEF cells, whereas di-arachidonoyl (20:4)- and di-linoleoyl (18:2)-BMP isoforms were present at lower levels. Consistently, R1441G LRRK2 MEFs displayed higher levels of dioleoyl- and di-docosahexaenoyl-BMP compared with WT cells, and these elevations were reduced following LRRK2 kinase inhibition with MLi-2. Data from three independent representative experiments are shown, and the manuscript has been revised accordingly to include these results.

      Regarding the separation of BMP and PG species, we confirm that BMP and PG were chromatographically resolved prior to MS/MS detection using a validated UPLC-MS/MS method developed by Nextcea, Inc. PG exhibits a substantially longer LC retention time than BMP, ensuring complete baseline separation. This approach (established by Nextcea nearly two decades ago and later validated through a multi-year collaboration with the U.S. FDA to clinically qualify di-22:6-BMP as a biomarker) prevents any ambiguity arising from the isobaric nature of BMP and PG species. No changes in PG levels were detected under any experimental conditions.

      Finally, we employed isotope-labeled BMP as an internal standard to ensure robust normalization across samples. These additional details and references cited above have been included in the revised Methods and References sections to further clarify the analytical rigor of our lipidomics workflow.

      (4) It is quite surprising that the amounts of labeled BMP continue to increase for up to 24h after a short 25min pulse with heavy BMP precursors (Figure 4B).

      In these isotope-labeling experiments, it is important to note (as described in our original manuscript) that two distinct pools of metabolically labeled BMP species were detected: semi-labeled BMP (with only one heavy isotope-labeled fatty acyl chain) and fully-labeled BMP (with both fatty acyl chains labeled). We consider the fully-labeled BMP pool to provide the most reliable readout for BMP turnover, as it showed a rapid decline after a 1h chase (decreasing by more than 50% within 8 h in all conditions), reaching its lowest levels at the end of the 48-h chase period.

      The apparent increase in semi-labeled BMP species over time may be explained by continued incorporation of labeled precursors following the initial pulse. Specifically, once existing semi-labeled and fully-labeled BMP molecules are degraded by PLA2G15 (Nyame K, et al. Nature 2025, 642:474-483), the resulting isotope-labeled lysophosphatidylglycerol (LPG) and fatty acids could be recycled and re-enter a new round of BMP biosynthesis, leading to a gradual accumulation of semi-labeled BMP such as di-18:1-BMP. Why would this reasoning not also apply to the fully-labeled species? Once the pulse is completed, newly incorporated non-labeled fatty acyl chains present in the cellular pool can compete with labeled ones during subsequent rounds of lipid remodeling or synthesis. As a result, the probability of generating semi-labeled BMP molecules becomes higher than that of forming fully-labeled species. Consistent with this, our data show an increase in only semi-labeled BMP species (but not in fully-labeled ones) up to 24 hours after the pulse. We have added a clarification regarding this point in the revised manuscript.

      (5) It is argued that upregulation of CLN5 may be due to an overall upregulation of lysosomal enzymes, as LAMP2 levels were also increased (Figure 2A, C, E). Again, this is not consistent with the observed decrease in MVE size and number (Figure 1D-F). As mentioned above, other independent markers of endo-lysosomes should be analyzed (eg, LAMP1, CD63, RAB7), and/or other lysosomal enzymes (e.g. cathepsin. D).

      Our revised manuscript now includes new immunofluorescence data for BMP, LAMP1 and CD63 (New Figure 7, Panels A-F) together with biochemical analysis of CD63 protein levels (New Supplemental Figure 4, Panel B) in human skin fibroblasts derived from healthy controls and LRRK2 G2019S PD patients. Quantitative analysis of these experiments revealed no statistically significant differences in total cellular levels of either LAMP1 or CD63 between groups. However, our results consistently show increased CLN5 protein levels in both mouse and human fibroblast cell lines harboring pathogenic LRRK2 mutations. Upregulation of CLN5 may reflect a compensatory effect from loss of BMP via EV exocytosis. As discussed above, the elevated LAMP2 signal observed in the engineered MEF clone expressing R1441G could represent a cell type-specific effect, potentially linked to differential penetrance of LRRK2 signaling on the lysosomal biogenesis response. Our Results and Discussion sections have been updated accordingly.

      (6) The authors report that the increase in BMP is not due to an increase in BMP synthesis (Figure 4), although they observe a significant increase in CLN5 (Figure 5A) in LRRK2 mutant cells. Some clarification is needed.

      In our original manuscript, we proposed that although CLN5 protein levels are increased in R1441G LRRK2 MEFs, the absence of significant changes in BMP synthesis rates (Figure 4B, C) may reflect either limited substrate availability or that CLN5 is already operating near its maximal enzymatic capacity. Our new subcellular fractionation data (new Figure 7, Panel A) further indicate that, despite a relative increase in total CLN5 levels in G2019S LRRK2 human fibroblasts, the amount of CLN5 associated with endolysosomes remains comparable between mutant LRRK2 and control cells. This suggests that a considerable fraction of upregulated CLN5 may not localize to endolysosomes, potentially accumulating in the endoplasmic reticulum due to enhanced translation or impaired trafficking. Unfortunately, the available anti-CLN5 antibody did not yield reliable immunofluorescence signals, preventing us from directly confirming this possibility. Nevertheless, in light of our new data (new Supplemental Figure 4A), we have included a clarification in the revised manuscript discussing this possibility as well.

      (7) Authors observe that both LAMP2 and BMP are decreased in EVs by GW4869 and increased by bafilomycin (Figure 6). Given my comments above on Figure 1, it would also be nice to illustrate/quantify the effects of these compounds on cells by immunofluorescence.

      We appreciate the reviewer’s suggestion. We have previously published immunofluorescence data showing increased BMP accumulation in endolysosomes following treatment with bafilomycin A1 Lu A, et al. J Cell Biol. 2009, 184:863-879). However, in the present study, our lipidomics analyses revealed a decrease in both di22:6-BMP and di-18:1-BMP species in cells treated with this compound. As discussed above, this apparent discrepancy likely reflects methodological differences between immunofluorescence, which detects only antibody-accessible BMP pools, and lipidomics, which quantifies total cellular BMP content. 

      Moreover, in a recent study (Andreu Z, et al. Nanotheranostics 2023, 7:1-21), BMP levels were analyzed by immunofluorescence in cells treated with spiroepoxide, a potent and selective irreversible inhibitor of nSMase (different from GW4869) known to block EV release. Spiroepoxide-treated cells showed decreased BMP immunostaining; a result that, again, does not align with mass spectrometry data revealing increased cellular BMP levels upon GW4869 treatment. Notably, in that study, spiroepoxide was used instead of GW4869 because the intrinsic autofluorescence of GW4869 could potentially interfere with the immunofluorescence BMP signal.

      We therefore consider lipidomics measurements to provide a more reliable and quantitative representation of BMP dynamics under these conditions.

      Reviewer #1 (Recommendations for the authors):

      Major concerns:

      (1) 48 h for MLi2 treatment seems too long. LRRK2 kinase activity is inhibited with much shorter incubation times. The longer the incubation, the more likely off-target effects are. The authors should repeat these experiments with 1-2 h of MLi2.

      We thank the reviewer for this valuable comment. We acknowledge that MLi-2 is a potent and selective LRRK2 kinase inhibitor that achieves near-complete target engagement within a few hours of treatment. However, prolonged exposure has been widely used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have employed long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202).

      In our study, 48-hour incubations were necessary to sustain full LRRK2 inhibition throughout the extracellular vesicle (EV) collection period. EV biogenesis, BMP biosynthesis, and packaging into EVs are time-dependent processes; therefore, extended incubation and collection periods (48 h) were required to allow downstream effects of LRRK2 inhibition on BMP production and release to manifest, and to obtain sufficient EV material for biochemical and lipidomic analyses. This experimental design also reflects our and others’ previous observations in humans and non-human primates, where urinary BMP changes are associated with chronic or subchronic LRRK2 inhibitor treatment (Baptista MAS, Merchant K, et al. Sci Transl Med. 2020, 12:eaav0820; Jennings D, et al. Sci Transl Med. 2022, 14:eabj2658; Maloney MT, et al. Mol Neurodegener. 2025, 20:89). Importantly, under these conditions, we did not observe significant changes in cell viability or morphology, supporting that the treatment was well tolerated.

      We have clarified this rationale in the revised Methods section to emphasize that the prolonged incubation reflects the experimental design for EV isolation rather than a requirement for achieving LRRK2 inhibition.

      (2) Is there a reason why the authors don't include CD81, CD63, and Syntenin-1 in their study as an EV marker? Using solely Flotilin-1 does not seem to be enough to justify their claims.

      We actually used not only Flotillin-1 but also LAMP2 as EV markers in our study. While both Flotillin-1 and LAMP2 detection on EVs may vary depending on the cell type, we and others have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022). In particular, one of these studies reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Therefore, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and reliably used to characterize small EVs.

      Nevertheless, to further address the reviewer’s concern, we performed additional experiments using a CD63-based fluorescence sensor (CD63-pHluorin), which, combined with TIRF microscopy, enables real-time visualization of CD63-positive exosome release. These experiments were conducted in control and LRRK2-mutant fibroblasts, and the data are presented in new Figure 7 (Panels G-I; Videos 1 and 2). We have also included all relevant references and clarified this point in the revised manuscript.

      (3) Indeed, to quantify the amount of certain proteins in EVs, the authors should normalize them by CD63 or CD81.

      Protein normalization in isolated EV fractions is indeed challenging. Although tetraspanins such as CD63 and CD81 are commonly enriched in EVs, their abundance can vary considerably across EV subpopulations, cell types, and experimental conditions, making them unreliable as universal normalization markers (Théry et al., J Extracell Vesicles, 2018; Margolis & Sadovsky, Nat Rev Mol Cell Biol, 2019).  Current guidelines from the International Society for Extracellular Vesicles (ISEV), as described in the Minimal Information for Studies of Extracellular Vesicles 2018 (MISEV2018; Théry C, et al. JExtracell Vesicles. 2018, 7:1535750) and updated in MISEV2024 (Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), recommend reporting multiple EV markers rather than relying on a single protein for normalization. They also suggest ensuring comparable experimental conditions by using the same number of cells at the start of the experiment and normalizing EV data to cell number or whole-cell lysate protein content at the end of the experiment, among other approaches.

      In our study, we normalized EV data to whole-cell lysate (WCL) protein content, as this approach accounts for differences in EV production due to variations in cell number or treatment conditions and is commonly used in the field (Kowal et al., PNAS, 2016; Mathieu et al., Nat Commun, 2021). We also included Flotillin-1 and LAMP2 as EV markers, both of which have been validated as molecular markers of small EV subpopulations.

      (4) Hyper normalization in WB quantification in Figure 2E-G is statistically incorrect, as it assumes that one group (in this case, R1441G ctrl) has no variability at all, which is not biologically possible. The authors should repeat the quantification without hypernormalizing one of their groups. This issue is prevalent across the whole manuscript.

      We understand the concern regarding “hyper-normalization” (i.e., expressing all values relative to one condition set to 1), which may mask variability in the reference group. However, it is standard practice in immunoblotting analysis to express data relative to a control condition for comparison, as variations in membrane transfer, exposure time, and signal development can differ across blots. In our case, the data are expressed as relative levels (arbitrary units) rather than absolute quantitative values. To facilitate comparison between datasets and account for inter-experimental variation, we continued to express values relative to the mutant LRRK2 MEF condition.

      On the other hand, in lipidomics experiments, despite using the same number of seeded cells and identical extraction and analysis protocols, minor biological and technical variability was observed across independent replicates. This variability is inherent to the experimental system and is now explicitly represented in the new table included in Supplemental Figure 1F, which compiles three independent representative lipidomics experiments showing quantitative BMP levels across different conditions.

      (5) The authors perform a t-test in Figure 2E-G when comparing more than 2 groups, which is wrong. The authors should use a two-way ANOVA as they are comparing genotype and treatment.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The statistical analyses and figure legend have been updated in the revised manuscript accordingly.

      In addition, since our CBE treatments yielded statistically non-significant data, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity to EV-mediated BMP release modulation.

      (6) There is a very strong reduction in flotillin-1 in R1441G cells vs WT (Figure 2G) in the EV fraction. That reduction is further exacerbated with MLi2, which likely means it is not kinase activity dependent. Can the authors comment on that?

      We agree with the reviewer that Flotillin-1 showed a different behavior compared with LAMP2 in these experiments. As recommended by the MISEV guidelines (Théry C, et al. J Extracell Vesicles. 2018;  7:1535750; Welsh JA, et al. J Extracell Vesicles. 2024, 13:e12404), it is important to analyze more than one EV-associated protein marker. We examined LAMP2, which, together with LAMP1, has been reported to be specifically enriched in EVs of endolysosomal origin (exosomes; Mathieu et al., Nat Commun. 2021, 12:4389 ). In contrast, Flotillin-1 is also associated with small EVs but may represent a distinct EV subpopulation from those positive for LAMP proteins (Kowal J, et al. PNAS 2016, 113:E968-E977).

      Nevertheless, the biochemical analysis of isolated EV fractions was complemented by our lipidomics data and, in the revised version, by TIRF microscopy analysis of exosome release in control and G2019S LRRK2 human fibroblasts (new Figure 7, Panels G-I; Videos 1 and 2). In this analysis, we confirmed increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). Collectively, these findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion.

      (7) In Figure 2C, the authors should express that the LAMP2-EV and flotillin-1 EV fractions from the WB are highly exposed. As presently presented, it is slightly misleading.

      We thank the reviewer for this comment. In EV preparations, the amount of protein recovered is typically very low. Therefore, although we loaded all the EV protein obtained from each sample, the immunoblots for LAMP2 and Flotillin-1 in EV fractions required longer exposure times to visualize clear signals across all conditions. We have now indicated in the corresponding figure legend that these EV blots are long-exposure blots to facilitate signal detection and avoid any potential misunderstanding.

      (8) If Figure 2C and D are from two different experiments, they should not be plotted together in Figure 2E-G. You cannot compare the effect of MLi2 vs CBE if done in completely different experiments.

      We appreciate the reviewer’s comment and agree with this observation. The MLi-2 and CBE experiments were performed independently and in separate experimental runs; therefore, we have reanalyzed these datasets separately rather than combining them in a two-way ANOVA. To properly compare more than two groups within each dataset, we have now applied a Kruskal-Wallis test followed by an uncorrected Dunn’s post hoc test (Figure 2 D-F and H-J). This non-parametric approach is more appropriate for our data structure, as EV experiments are usually subject to high variability and immunoblot quantifications involving small sample sizes (n≈6) do not always meet the assumptions of normality or equal variance. The Kruskal-Wallis test does not assume normality or equal variances, making it more robust for small, variable biological datasets. The revised statistical analyses and figure legends have been updated accordingly in the manuscript.

      (9) The authors state that "For the R1441G MEF cells, MLi-2 decreased EV concentration while CBE increased EV particles per ml, in agreement with the effects observed in our biochemical analysis." As Figure S1D shows no statistical significance, the authors don't have sufficient evidence to make this claim.

      We apologize for this overstatement. We have revised the text to clarify that, although the differences did not reach statistical significance, a consistent trend toward decreased EV concentration upon MLi-2 treatment and increased EV release following CBE treatment was observed in R1441G MEF cells.

      (10) "Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, D, F) and suggest a role for LRRK2 and GCase in modulating BMP release in association with LAMP2-positive exosomes from MEF cells." As Figure 3E shows no statistical difference of BMP on EVs upon CBE treatment, this sentence is not accurate and should be reframed. Furthermore, the authors claim an increase in EV-LAMP2 in R1441G cells compared to WT, however, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. This contradiction does not support the authors' conclusions and really puts into question their whole model.

      We thank the reviewer for this observation. After reanalyzing our biochemical data from isolated EV fractions (see new Panels D-F and H-J) using an improved statistical approach, we found that although EV-associated LAMP2 levels were consistently elevated in untreated R1441G LRRK2 MEFs compared to WT cells, CBE treatment only produced a non-significant trend toward increased EV-associated LAMP2 compared to untreated R1441G LRRK2 cells. Accordingly, we have revised the sentence to read as follows:

      “Altogether, given that BMP is specifically enriched in ILVs (which become exosomes upon release), the data presented above support our biochemical analysis (Figure 2C, E, G, I) and suggest that LRRK2 activity regulates BMP release in association with LAMP2positive exosomes, whereas GCase activity appears to have a more variable effect under the tested conditions.”

      We also agree with the reviewer that, in our MEF model, the amount of BMP in EVs of R1441G cells vs WT is unchanged with a non-significant reduction. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EVassociated BMP and LAMP2 levels in R1441G LRRK2 MEFs, and our new data (new Figure 7, Panel G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin– positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G).

      In light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the manuscript concerning the contribution of GCase activity in this model.

      (11) In Figure 5, 16 h of MLi2 treatment is too long and can lead to off-target effects. I would advise reducing it to 1-4 h.

      Prolonged MLi-2 treatments have been extensively used in the field without evidence of significant off-target effects. Several studies, including Fell et al. (2015, J Pharmacol Exp Ther 355:397-409), De Wit et al. (2019, Mol Neurobiol 56:5273-5286), Ho et al. (2022, NPJ Parkinson’s Dis 8:115), Tengberg et al. (2024, Neurobiol Dis 202:106728), and Jaimon et al. (2025, Sci Signal 18:eads5761), have applied long-term (24-48 h) MLi-2 treatments at comparable concentrations without detecting toxicity or off-target alterations, including in MEFs (Ho et al., 2022; Dhekne et al., 2018, eLife 7:e40202). Moreover, the data presented in Figure 5 demonstrate a reduction in CLN5 protein levels in both MEFs and human fibroblasts following MLi-2 treatment, confirming the specificity of the observed effects in LRRK2 mutant cells.

      (12) "Our data suggest that BMP is exocytosed in association with EVs and that LRRK2 and GCase activities modulate BMP secretion." Again, cells carrying the R1441G mutation have the same amount of BMP in EVs than WT. This sentence is not factually accurate. Accordingly, CBE did not change the amount of BMP in EVs.

      We thank the reviewer for this observation and agree that, in our MEF model, the amount of BMP in EVs from R1441G LRRK2 cells is comparable to that observed in WT cells. However, pharmacological modulation supports our conclusion that BMP release is modulated by LRRK2 activity. Specifically, treatment with the LRRK2 inhibitor MLi-2 decreased EV-associated BMP levels in R1441G LRRK2 MEFs, and our new data (new Figure 7G-I; Videos 1 and 2) show increased exocytosis of CD63-pHluorin–positive endolysosomes in G2019S LRRK2 human fibroblasts compared to controls, an effect that was reversed by MLi-2 treatment. The CD63-pHluorin–positive compartment of these cells was also largely positive for BMP (new Figure 7G). These findings further support the regulatory role of LRRK2 activity in EV-mediated BMP secretion. In addition, in light of the reviewer’s comment about CBE treatment, we have softened our conclusions throughout the paper concerning the contribution of GCase activity in this model.

      (13) Figure 6; EV release should have been monitored by more accurate markers such as CD63 and CD81.

      We thank the reviewer for this comment. We and others (Kowal et al., 2016; Lu et al., 2018; Mathieu et al., 2021; Ferreira et al., 2022) have reported enrichment of Flotillin-1 and LAMP proteins in isolated small EV fractions. In particular, one of these studies (Mathieu et al., Nat Commun. 2021), in which bafilomycin A1 was also used (to boost exosome release), reported that “LAMP1-positive subpopulations of EVs represent MVB/lysosome-derived exosomes, which also contain syntenin-1.” Altogether, our choice of EV markers (LAMP2 and Flotillin-1) is consistent with those previously and accurately used to characterize EVs. We have now included all relevant references in the revised manuscript to further clarify this point.

      (14) Figure 6 suggests that exosomal BMP is controlled by EV release. I would think that is rather obvious.

      We agree that the finding that exosomal BMP release is influenced by EV secretion may appear “obvious.” However, our intention in Figure 6 was to provide direct experimental evidence confirming this relationship using pharmacological modulators of EV release. Specifically, inhibition of EV secretion with GW4869 reduced exosomal BMP levels, whereas stimulation with bafilomycin A1 increased them. These data were important to establish a causal link between EV trafficking and BMP export, thereby validating our model and supporting the interpretation that LRRK2 regulates BMP homeostasis through EV-mediated exocytosis, which is further modulated, to some extent, by GCase activity. 

      Minor concerns:

      (1) Figure 1: Change colors to be color blind friendly.

      We thank the reviewer for this helpful suggestion. We have adjusted the colors in Figure 1 to be color-blind friendly. In addition, we have applied the same color-blind friendly palette to the new immunofluorescence data presented in new Figure 7, Panel A and D.

      (2) More consistency on "Xmin" vs "X min" would be appreciated.

      We thank the reviewer for this observation. We have revised the manuscript to ensure consistent formatting of time indications throughout the text and figures, using the standardized format “X min.”

      Reviewer #2 (Recommendations for the authors):

      (1)  Figure 2C-D. Were equal amounts of protein loaded in each lane?

      Equal protein amounts were loaded in lanes corresponding to whole-cell lysate (WCL) fractions and normalized based on α-Tubulin levels.

      For the extracellular vesicle (EV) fractions, all protein recovered from EV pellets after isolation was loaded. In all EV-related experiments, we seeded the same number of EVproducing cells per condition, and the resulting EV-derived data (from both immunoblotting and lipidomics analyses) were normalized to the corresponding whole cell lysate (WCL) protein content to ensure comparability across conditions.

      All these technical details have been included in the Materials section of our revised manuscript.

      (2) The authors refer to the papers of Medoh et al (ref 43) and Singh et al. (44) for the key role of CLN5 in the BMP biosynthetic pathway. However, Medoh et al reported that CLN5 is the lysosomal BMP synthase. In contrast, Singh et al. reported that PLD3 and PLD4 mediate the synthesis of SS-BMP, and did not find any role for CLN5. 

      To avoid any confusion or misinterpretation of our findings regarding CLN5 and given that we do not analyze PLD3 or PLD4 in our study, we have decided to replace the reference to Singh et al. with Bulfon D. et al. (Nat. Commun. 2024, 15:9937) instead. This last work, conducted by an independent group distinct from the one that originally described CLN5, also validated CLN5 as the sole BMP synthase in cells.

      Also, authors mention that bafilomycin A1 (B-A1) dramatically boosts EV exocytosis, referring to Kowal et al., 2016 (ref 35) and Lu et al., 2018 (ref 45). However, this is not shown in Kowal et al.

      We thank the reviewer for pointing out this mistake. We apologize for the incorrect citation and have now corrected the reference. The statement regarding the effect of bafilomycin A1 on EV exocytosis now appropriately refers to Mathieu et al., 2021 and Lu et al., 2018.

      (3) Page 7, it is stated that "No statistically significant differences in intracellular BMP levels were observed in WT LRRK2 MEFs upon LRRK2 or GCase inhibition(Supplemental Figure 1D, E)". The authors probably mean "Supplemental Figure 1F, G"

      We thank the reviewer for noting this error. We have corrected the text to refer to panels F and G of Supplemental Figure 1, which correspond to the relevant data. We have also revised the reference to panel I of Supplemental Figure 1 accordingly.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1) I have to admit that it took a few hours of intense work to understand this paper and to even figure out where the authors were coming from. The problem setting, nomenclature, and simulation methods presented in this paper do not conform to the notation common in the field, are often contradictory, and are usually hard to understand. Most importantly, the problem that the paper is trying to solve seems to me to be quite specific to the particular memory study in question, and is very different from the normal setting of model-comparative RSA that I (and I think other readers) may be more familiar with.

      We have revised the paper for clarity at all levels: motivation, application, and parameterization. We clarify that there is a large unmet need for using RSA in a trial-wise manner, and that this approach indeed offers benefits to any team interested in decoding trial-wise representational information linked to a behavioral responses, and as such is not a problem specific to a single memory study.

      (2) The definition of "classical RSA" that the authors are using is very narrow. The group around Niko Kriegeskorte has developed RSA over the last 10 years, addressing many of the perceived limitations of the technique. For example, cross-validated distance measures (Walther et al. 2016; Nili et al. 2014; Diedrichsen et al. 2021) effectively deal with an uneven number of trials per condition and unequal amounts of measurement noise across trials. Different RDM comparators (Diedrichsen et al. 2021) and statistical methods for generalization across stimuli (Schütt et al. 2023) have been developed, addressing shortcomings in sensitivity. Finally, both a Bayesian variant of RSA (Pattern component modelling, (Diedrichsen, Yokoi, and Arbuckle 2018) and an encoding model (Naselaris et al. 2011) can effectively deal with continuous variables or features across time points or trials in a framework that is very related to RSA (Diedrichsen and Kriegeskorte 2017). The author may not consider these newer developments to be classical, but they are in common use and certainly provide the solution to the problems raised in this paper in the setting of model-comparative RSA in which there is more than one repetition per stimulus.

      We appreciate the summary of relevant literature and have included a revised Introduction to address this bounty of relevant work. While much is owed to these authors, new developments from a diverse array of researchers outside of a single group can aid in new research questions, and should always have a place in our research landscape. We owe much to the work of Kriegeskorte’s group, and in fact, Schutt et al., 2023 served as a very relevant touchpoint in the Discussion and helped to highlight specific needs not addressed by the assessment of the “representational geometry” of an entire presented stimulus set. Principal amongst these needs is the application of trial-wise representational information that can be related to trial-wise behavioral responses and thus used to address specific questions on brain-behavior relationships. We invite the Reviewer to consider the utility of this shift with the following revisions to the Introduction.

      Page 3. “Recently, methodological advancements have addressed many known limitations in cRSA. For example, cross-validated distance measures (e.g., Euclidean distance) have improved the reliability of representational dissimilarities in the presence of noise and trial imbalance (Walther et al., 2016; Nili et al., 2014; Diedrichsen et al., 2021). Bayesian approaches such as pattern component modeling (Diedrichsen, Yokoi, & Arbuckle, 2018) have extended representational approaches to accommodate continuous stimulus features or temporal variation. Further, model comparison RSA strategies (Diedrichsen et al., 2021) and generalization techniques across stimuli (Schütt et al., 2023) have improved sensitivity and inference. Nevertheless, a common feature shared across most of improvements is that they require stimuli repetition to examine the representational structure. This requirement limits their ability to probe brain-behavior questions at the level of individual events”.

      Page 8. “While several extensions of RSA have addressed key limitations in noise sensitivity, stimulus variance, and modeling (e.g., Diedrichsen et al., 2021; Schütt et al., 2023), our tRSA approach introduces a new methodological step by estimating representational strength at the trial level. This accounts for the multi-level variance structure in the data, affords generalizability beyond the fixed stimulus set, and allows one to test stimulus- or trial-level modulations of neural representations in a straightforward way”.

      Page 44. “Despite such prevalent appreciation for the neurocognitive relevance of stimulus properties, cRSA often does not account for the fact that the same stimulus (e.g., “basketball”) is seen by multiple subjects and produces statistically dependent data, an issue addressed by Schütt et al., 2023, who developed cross validation and bootstrap methods that explicitly model dependence across both subjects and stimulus conditions”.

      (3) The stated problem of the paper is to estimate "representational strength" in different regions or conditions. With this, the authors define the correlation of the brain RDM with a model RDM. This metric conflates a number of factors, namely the variances of the stimulus-specific patterns, the variance of the noise, the true differences between different dissimilarities, and the match between the assumed model and the data-generating model. It took me a long time to figure out that the authors are trying to solve a quite different problem in a quite different setting from the model-comparative approach to RSA that I would consider "classical" (Diedrichsen et al. 2021; Diedrichsen and Kriegeskorte 2017). In this approach, one is trying to test whether local activity patterns are better explained by representation model A or model B, and to estimate the degree to which the representation can be fully explained. In this framework, it is common practice to measure each stimulus at least 2 times, to be able to estimate the variance of noise patterns and the variance of signal patterns directly. Using this setting, I would define 'representational strength" very differently from the authors. Assume (using LaTeX notation) that the activity patterns $y_j,n$ for stimulus j, measurement n, are composed of a true stimulus-related pattern ($u_j$) and a trial-specific noise pattern ($e_j,n$). As a measure of the strength of representation (or pattern), I would use an unbiased estimate of the variance of the true stimulus-specific patterns across voxels and stimuli ($\sigma^2_{u}$). This estimator can be obtained by correlating patterns of the same stimuli across repeated measures, or equivalently, by averaging the cross-validated Euclidean distances (or with spatial prewhitening, Mahalanobis distances) across all stimulus pairs. In contrast, the current paper addresses a specific problem in a quite specific experimental design in which there is only one repetition per stimulus. This means that the authors have no direct way of distinguishing true stimulus patterns from noise processes. The trick that the authors apply here is to assume that the brain data comes from the assumed model RDM (a somewhat sketchy assumption IMO) and that everything that reduces this correlation must be measurement noise. I can now see why tRSA does make some sense for this particular question in this memory study. However, in the more common model-comparative RSA setting, having only one repetition per stimulus in the experiment would be quite a fatal design flaw. Thus, the paper would do better if the authors could spell the specific problem addressed by their method right in the beginning, rather than trying to set up tRSA as a general alternative to "classical RSA".

      At a general level, our approach rests on the premise that there is meaningful information present in a single presentation of a given stimulus. This assumption may have less utility when the research goals are more focused on estimating the fidelity of signal patterns for RSA, as in designs with multiple repetitions. But it is an exaggeration to state that such a trial-wise approach cannot address the difference between “true” stimulus patterns and noise. This trial-wise approach has explicit utility in relating trial-wise brain information to trial-wise behavior, across multiple cognitions (not only memory studies, as applied here). We have added substantial text to the Introduction distinguishing cRSA, which is widely employed, often in cases with a single repetition per stimulus, and model comparative methods that employ multiple repetitions. We clarify that we do not consider tRSA an alternative to the model comparative approach, and discuss that operational definitions of representational strength are constrained by the study design.

      Page 3. “In this paper, we present an advancement termed trial-level RSA, or tRSA, which addresses these limitations in cRSA (not model comparison approaches) and may be utilized in paradigms with or without repeated stimuli”.

      Page 4. “Representational geometry usually refers to the structure of similarities among repeated presentations of the same stimulus in the neural data (as captured in the brain RSM) and is often estimated utilizing a model comparison approach, whereas representational strength is a derived measure that quantifies how strongly this geometry aligns with a hypothesized model RSM. In other words, geometry characterizes the pattern space itself, while representational strength reflects the degree of correspondence between that space and the theoretical model under test”.

      Finally, we clarified that in our simulation methods we assume a true underlying activity pattern and a random error pattern. The model RSM is computed based on the true pattern, whereas the brain RSM comes from the noisy pattern, not the model RSM itself.

      Page 9. “Then, we generated two sets of noise patterns, which were controlled by parameters σ<sub>A</sub> and σ<sub>B</sub> , respectively, one for each condition”.

      (4) The notation in the paper is often conflicting and should be clarified. The actual true and measured activity patterns should receive a unique notation that is distinct from the variances of these patterns across voxels. I assume that $\sigma_ijk$ is the noise variances (not standard deviation)? Normally, variances are denoted with $\sigma^2$. Also, if these are variances, they cannot come from a normal distribution as indicated on page 10. Finally, multi-level models are usually defined at the level of means (i.e., patterns) rather than at the level of variances (as they seem to be done here).

      We have added notations for true and measured activity patterns to differentiate it from our notation for variance. We agree that multilevel models are usually defined at the level of means rather than at the level of variances and we include a Figure (Fig 1D) that describes the model in terms of the means. We clarify that the σ ($\sigma$) used in the manuscript were not variances/standard deviations themselves; rather, they were meant to denote components of the actual (multilevel) variance parameter. Each component was sampled from normal distributions, and they collectively summed up to comprise the final variance parameter for each trial. We have modified our notation for each component to the lowercase letter s to minimize confusion. We have also made our R code publicly available on our lab github, which should provide more clarity on the exact simulation process.

      (5) In the first set of simulations, the authors sampled both model and brain RSM by drawing each cell (similarity) of the matrix from an independent bivariate normal distribution. As the authors note themselves, this way of producing RSMs violates the constraint that correlation matrices need to be positive semi-definite. Likely more seriously, it also ignores the fact that the different elements of the upper triangular part of a correlation matrix are not independent from each other (Diedrichsen et al. 2021). Therefore, it is not clear that this simulation is close enough to reality to provide any valuable insight and should be removed from the paper, along with the extensive discussion about why this simulation setting is plainly wrong (page 21). This would shorten and clarify the paper.

      We have added justification of the mixed-effects model given the potential assumption violations. We caution readers to investigate the robustness of their models, and to employ permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. Finally, we agree that the first simulation setting does not possess several properties of realistic RDMs/RSMs; however, we believe that there is utility in understanding the mathematical properties of correlations – an essential component of RSA – in a straightforward simulation where the ground truth is known, thus moving the simulation to Appendix 1.

      (6) If I understand the second simulation setting correctly, the true pattern for each stimulus was generated as an NxP matrix of i.i.d. standard normal variables. Thus, there is no condition-specific pattern at all, only condition-specific noise/signal variances. It is not clear how the tRSA would be biased if there were a condition-specific pattern (which, in reality, there usually is). Because of the i.i.d. assumption of the true signal, the correlations between all stimulus pairs within conditions are close to zero (and only differ from it by the fact that you are using a finite number of voxels). If you added a condition-specific pattern, the across-condition RSA would lead to much higher "representational strength" estimates than a within-condition RSA, with obvious problems and biases.

      The Reviewer is correct that the voxel values in the true pattern are drawn from i.i.d. standard normal distributions. We take the Reviewer’s suggestion of “condition-specific pattern” to mean that there could be a condition-voxel interaction in two non-mutually exclusive ways. The first is additive, essentially some common underlying multi-voxel pattern like [6, 34, -52, …, 8] for all condition A trials, and different one such pattern for condition B trials, etc. The second is multiplicative, essentially a vector of scaling factors [x1.5, x0.5, x0.8, …, x2.7] for all condition A trials, and a different one such vector for condition B trials, etc. Both possibilities could indeed affect tRSA as much as it would cRSA.

      Importantly, If such a strong condition-specific pattern is expected, one can build a condition-specific model RDM using one-shot coding of conditions (see example figure; src: https://www.newbi4fmri.com/tutorial-9-mvpa-rsa), to either capture this interesting phenomenon or to remove this out as a confounding factor. This practice has been applied in multiple regression cRSA approaches (e.g., Cichy et al., 2013) and can also be applied to tRSA.

      (7) The trial-level brain RDM to model Spearman correlations was analyzed using a mixed effects model. However, given the symmetry of the RDM, the correlations coming from different rows of the matrix are not independent, which is an assumption of the mixed effect model. This does not seem to induce an increase in Type I errors in the conditions studied, but there is no clear justification for this procedure, which needs to be justified.

      We appreciate this important warning, and now caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the supplement.

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models. The multilevel structure of RSA data introduces potential dependencies across subjects, stimuli, and trials, which can violate assumptions of independence if not properly modeled. In the present study, we used a model that included random intercepts for both subjects and stimuli, which accounts for variance at these levels and improves the generalizability of fixed-effect estimates. Still, there is a potential for systematic dependence across trials within a subject. To ensure that the model assumptions were satisfied, we conducted a series of diagnostic checks on an exemplar ROI (right LOC; middle occipital gyrus) in the Object Perception dataset, including visual inspection of residual distributions and autocorrelation (Appendix 3, Figure 13). These diagnostics supported the assumptions of normality, homoscedasticity, and conditional independence of residuals. In addition, we conducted permutation-based inference, similar to prior improvements to cRSA (Niliet al. 2014), using a nested model comparison to test whether the mean similarity in this ROI was significantly greater than zero. The observed likelihood ratio test statistic fell in the extreme tail of the null distribution (Appendix 3, Figure 14), providing strong nonparametric evidence for the reliability of the observed effect. We emphasize that this type of model checking and permutation testing is not merely confirmatory but can help validate key assumptions in RSA modeling, especially when applying mixed-effects models to neural similarity data. Researchers are encouraged to adopt similar procedures to ensure the robustness and interpretability of their findings”.

      Exemplar Permutation Testing

      To test whether the mean representational strength in the ROI right LOC (middle occipital gyrus) was significantly greater than zero, we used a permutation-based likelihood ratio test implemented via the permlmer function. This test compares two nested linear mixed-effects models fit using the lmer function from the lme4 package, both including random intercepts for Participant and Stimulus ID to account for between-subject and between-item variability.

      The null model excluded a fixed intercept term, effectively constraining the mean similarity to zero after accounting for random effects:

      ROI ~ 0 + (1 | Participant) + (1 | Stimulus)

      The full model included the same random effects structure but allowed the intercept to be freely estimated:

      ROI ~ 1 + (1 | Participant) + (1 | Stimulus)

      By comparing the fit of these two models, we directly tested whether the average similarity in this ROI was significantly different from zero. Permutation testing (1,000 permutations) was used to generate a nonparametric p-value, providing inference without relying on normality assumptions. The full model, which estimated a nonzero mean similarity in the right LOC (middle occipital gyrus), showed a significantly better fit to the data than the null model that fixed the mean at zero (χ²(1) = 17.60, p = 2.72 × 10⁻⁵). The permutation-based p-value obtained from permlmer confirmed this effect as statistically significant (p = 0.0099), indicating that the mean similarity in this ROI was reliably greater than zero. These results support the conclusion that the right LOC contains representational structure consistent with the HMAXc2 RSM. A density plot of the permuted likelihood ratio tests is plotted along with the observed likelihood ratio test in Appendix 3 Figure 14.

      (8) For the empirical data, it is not clear to me to what degree the "representational strength" of cRSA and tRSA is actually comparable. In cRSA, the Spearman correlation assesses whether the distances in the data RSM are ranked in the same order as in the model. For tRSA, the comparison is made for every row of the RSM, which introduces a larger degree of flexibility (possibly explaining the higher correlations in the first simulation). Thus, could the gains presented in Figure 7D not simply arise from the fact that you are testing different questions? A clearer theoretical analysis of the difference between the average row-wise Spearman correlation and the matrix-wise Spearman correlation is urgently needed. The behavior will likely vary with the structure of the true model RDM/RSM.

      We agree that the comparability between mean row-wise Spearman correlations and the matrix-wise Spearman correlation is needed. We believe that the simulations are the best approach for this comparison, since they are much more robust than the empirical dataset and have the advantage of knowing the true pattern/noise levels. We expand on our comparison of mean tRSA values and matrix-wise Spearman correlations on page 42.

      Page 42. “Although tRSA and cRSA both aim to quantify representational strength, they differ in how they operationalize this concept. cRSA summarizes the correspondence between RSMs as a single measure, such as the matrix-wise Spearman correlation. In contrast, tRSA computes such correspondence for each trial, enabling estimates at the level of individual observations. This flexibility allows trial-level variability to be modeled directly, but also introduces subtle differences in what is being measured. Nonetheless, our simulations showed that, although numerical differences occasionally emerged—particularly when comparing between-condition tRSA estimates to within-condition cRSA estimates—the magnitude of divergence was small and did not affect the outcome of downstream statistical tests”.

      (9) For the real data, there are a number of additional sources of bias that need to be considered for the analysis. What if there are not only condition-specific differences in noise variance, but also a condition-specific pattern? Given that the stimuli were measured in 3 different imaging runs, you cannot assume that all measurement noise is i.i.d. - stimuli from the same run will likely have a higher correlation with each other.

      We recognize the potential of condition-specific patterns and chose to constrain the analyses to those most comparable with cRSA. However, depending on their hypotheses, researchers may consider testing condition RSMs and utilizing a model comparison approach or employ the z-scored approach, as employed in the simulations above. Regarding the potential run confounds, this is always the case in RSA and why we exclude within-run comparisons. We have also added to the Discussion the suggestion to include run as a covariate in their mixed-effects models. However, we do not employ this covariate here as we preferred the most parsimonious model to compare with cRSA.

      Page 46 - 47. “Further, while analyses here were largely employed to be comparable with cRSA, researchers should consider taking advantage of the flexibility of the mixed-effects models and include co variates of non-interest (run, trial order etc.)”.

      (10) The discussion should be rewritten in light of the fact that the setting considered here is very different from the model-comparative RSA in which one usually has multiple measurements per stimulus per subject. In this setting, existing approaches such as RSA or PCM do indeed allow for the full modelling of differences in the "representational strength" - i.e., pattern variance across subjects, conditions, and stimuli.

      We agree that studies advancing designs with multiple repetitions of a given stimulus image are useful in estimating the reliability of concept representations. We would argue however that model comparison in RSA is not restricted to such data. Many extant studies do not in fact have multiple repetitions per stimulus per subject (Wang et al., 2018 https://doi.org/10.1088/1741-2552/abecc3, Gao et al, 2022 https://doi.org/10.1093/cercor/bhac058, Li et al, 2022 https://doi.org/10.1002/hbm.26195, Staples & Graves, 2020 https://doi.org/10.1162/nol_a_00018) that allow for that type of model-comparative approach. While beneficial in terms of noise estimation, having multiple presentations was not a requirement for implementing cRSA (Kriegeskorte, 2008 https://doi.org/10.3389/neuro.06.004.2008). The aim of this manuscript is to introduce the tRSA approach to the broad community of researchers whose research questions and datasets could vary vastly, including but not limited to the number of repeated presentations and the balance of trial counts across conditions.

      (11) Cross-validated distances provide a powerful tool to control for differences in measurement noise variances and possible covariances in measurement noise across trials, which has many distinct advantages and is conceptually very different from the approach taken here.

      We have added language on the value of cross-validation approaches to RSA in the Discussion:

      Page 47. “Additionally, we note that while our proposed tRSA framework provides a flexible and statistically principled approach for modeling trial-level representational strength, we acknowledge that there are alternative methods for addressing trial-level variability in RSA. In particular, the use of cross-validated distance metrics (e.g., crossnobis distance) has become increasingly popular for controlling differences in measurement noise variance and accounting for possible covariance structures across trials (Walther et al., 2016). These metrics offer several advantages, including unbiased estimation of representational dissimilarities under Gaussian noise assumptions and improved generalization to unseen data. However, cross-validated distances are conceptually distinct from the approach taken here: whereas cross-validation aims to correct for noise-related biases in representational dissimilarity matrices, our trial-level RSA method focuses on estimating and modeling the variability in representation strength across individual trials using mixed-effects modeling. Rather than proposing a replacement for cross-validated RSA, tRSA adds a complementary tool to the methodological toolkit—one that supports hypothesis-driven inference about condition effects and trial-level covariates, while leveraging the full structure of the data”.

      (12) One of the main limitations of tRSA is the assumption that the model RDM is actually the true brain RDM, which may not be the case. Thus, in theory, there could be a different model RDM, in which representational strength measures would be very different. These differences should be explained more fully, hopefully leading to a more accessible paper.

      Indeed, the chosen model RSM may not be the true RSM, but as the noise level increases the correlation between RSMs practically becomes zero. In our simulations we assume this to be true as a straightforward way to manipulate the correspondence between the brain data and the model. However, just like cRSA, tRSA is constrained by the model selections the researchers employ. We encourage researchers to have carefully considered theoretically-motivated models and, if their research questions require, consider multiple and potentially competing models. Furthermore, the trial-wise estimates produced by tRSA encourage testing competing models within the multiple regression framework. We have added this language to the Discussion.

      Page 46. ..”choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives”.

      Pages 45-46. “While a number of studies have addressed the validity of measuring representational geometry using designs with multiple repetitions, a conceptual benefit of the tRSA approach is the reliance on a regression framework that engenders the testing of competing conceptual models of stimulus representation (e.g., taxonomic vs. encyclopedic semantic features, as in Davis et al., 2021)”.

      Reviewer #2 (Public review):

      (1)  While I generally welcome the contribution, I take some issue with the accusatory tone of the manuscript in the Introduction. The text there (using words such as 'ignored variances', 'errouneous inferences', 'one must', 'not well-suited', 'misleading') appears aimed at turning cRSA in a 'straw man' with many limitations that other researchers have not recognized but that the new proposed method supposedly resolves. This can be written in a more nuanced, constructive manner without accusing the numerous users of this popular method of ignorance.

      We apologize for the unintended accusatory tone. We have clarified the many robust approaches to RSA and have made our Introduction and Discussion more nuanced throughout (see also 3, 11 and16).

      (2) The described limitations are also not entirely correct, in my view: for example, statistical inference in cRSA is not always done using classic parametric statistics such as t-tests (cf Figure 1): the rsatoolbox paper by Nili et al. (2014) outlines non-parametric alternatives based on permutation tests, bootstrapping and sign tests, which are commonly used in the field. Nor has RSA ever been conducted at the row/column level (here referred to by the authors as 'trial level'; cf King et al., 2018).

      We agree there are numerous methods that go beyond cRSA addressing these limitations and have added discussion of them into our manuscript as well as an example analysis implementing permutation tests on tRSA data (see response to 7). We thank the reviewer for bringing King et al., 2014 and their temporal generalization method to our attention, we added reference to acknowledge their decoding-based temporal generalization approach.

      Page 8. “It is also important to note that some prior work has examined similarly fine-grained representations in time-resolved neuroimaging data, such as the temporal generalization method introduced by King et al. (see King & Dehaene, 2014). Their approach trains classifiers at each time point and tests them across all others, resulting in a temporal generalization matrix that reflects decoding accuracy over time. While such matrices share some structural similarity with RSMs, they do not involve correlating trial-level pattern vectors with model RSMs nor do their second-level models include trial-wise, subject-wise, and item-wise variability simultaneously”.

      (3) One of the advantages of cRSA is its simplicity. Adding linear mixed effects modeling to RSA introduces a host of additional 'analysis parameters' pertaining to the choice of the model setup (random effects, fixed effects, interactions, what error terms to use) - how should future users of tRSA navigate this?

      We appreciate the opportunity to offer more specific proscriptions for those employing a tRSA technique, and have added them to the Discussion:

      Page 46. “While linear mixed-effects modeling offers a powerful framework for analyzing representational similarity data, it is critical that researchers carefully construct and validate their models and choose their model RSMs carefully. In our simulations, we designed our model RSM to be the “true” RSM for demonstration purposes. However, researchers should consider if their models and model alternatives. However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (4) Here, only a single real fMRI dataset is used with a quite complicated experimental design for the memory part; it's not clear if there is any benefit of using tRSA on a simpler real dataset. What's the benefit of tRSA in classic RSA datasets (e.g., Kriegeskorte et al., 2008), with fixed stimulus conditions and no behavior?

      To clarify, our empirical approach uses two different tasks: an Object Perception task more akin to the classic RSA datasets employing passive viewing, and a Conceptual Retrieval task that more directly addresses the benefits of the trialwise approach. We felt that our Object Perception dataset is a simpler empirical fMRI dataset without explicit task conditions or a dichotomous behavioral outcome, whereas the Retrieval dataset is more involved (though old/new recognition is the most common form of memory retrieval testing) and  dependent on behavioral outcomes. However, we recognize the utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (5) The cells of an RDM/RSM reflect pairwise comparisons between response patterns (typically a brain but can be any system; cf Sucholutsky et al., 2023). Because the response patterns are repeatedly compared, the cells of this matrix are not independent of one another. Does this raise issues with the validity of the linear mixed effects model? Does it assume the observations are linearly independent?

      We recognize the potential danger for not meeting model assumptions. Though our simulation results and model checks suggest this is not a fatal flaw in the model design, we caution readers to investigate the robustness of their models, and consider employing permutation testing that does not make independence assumptions. We have also added checks of the model residuals and an example of permutation testing in the Appendix. See response to R1.

      (6) The manuscript assumes the reader is familiar with technical statistical terms such as Type I/II error, sensitivity, specificity, homoscedasticity assumptions, as well as linear mixed models (fixed effects, random effects, etc). I am concerned that this jargon makes the paper difficult to understand for a broad readership or even researchers currently using cRSA that might be interested in trying tRSA.

      We agree this jargon may cause the paper to be difficult to understand. We have expanded/added definitions to these terms throughout the methods and results sections.

      Page 12. “Given data generated with 𝑠<sub>𝑐𝑜𝑛𝑑,𝐴</sub> = 𝑠<sub>𝑐𝑜𝑛𝑑,B</sub>, the correct inference should be a failure to reject the null hypothesis of ; any significant () result in either direction was considered a false positive (spurious effect, or Type I error). Given data generated with , the inference was considered correct if it rejected the null hypothesis of  and yielded the expected sign of the estimated contrast (b<sub>B-𝐴</sub><0). A significant result with the reverse sign of the estimated contrast (b<sub>B-𝐴</sub><0) was considered a Type I error, and a nonsignificant (𝑝 ≥ 0.05) result was considered a false negative (failure to detect a true effect, or Type II error)”.

      Page 2. “Compared to cRSA, the multi-level framework of tRSA was both more theoretically appropriate and significantly sensitive (better able to detect) to true effects”.

      Page 25.”The performance of cRSA and tRSA were quantified with their specificity (better avoids false positives, 1 - Type I error rate) and sensitivity (better avoids false negatives 1 - Type II error rate)”.

      Page 6. “One of the fundamental assumptions of general linear models (step 4 of cRSA; see Figure 1D) is homoscedasticity or homogeneity of variance — that is, all residuals should have equal variance” .

      Page11. “Specifically, a linear mixed-effects model with a fixed effect  of condition (which estimates the average effect across the entire sample, capturing the overall effect of interest) and random effects of both subjects and stimuli (which model variation in responses due to differences between individual subjects and items, allowing generalization beyond the sample) were fitted to tRSA estimates via the `lme4 1.1-35.3` package in R (Bates et al., 2015), and p-values were estimated using Satterthwaites’s method via the `lmerTest 3.1-3` package (Kuznetsova et al., 2017)”.

      (7) I could not find any statement on data availability or code availability. Given that the manuscript reuses prior data and proposes a new method, making data and code/tutorials openly available would greatly enhance the potential impact and utility for the community.

      We thank the reviewer for raising our oversight here. We have added our code and data availability statements.

      Page 9. “Data is available upon request to the corresponding author and our simulations and example tRSA code is available at https://github.com/electricdinolab”.

      Reviewer #1 (Recommendations for the authors):

      (13) Page 4: The limitations of cRSA seem to be based on the assumption that within each different experimental condition, there are different stimuli, which get combined into the condition. The framework of RSA, however, does not dictate whether you calculate a condition x condition RDM or a larger and more complete stimulus x stimulus RDM. Indeed, in practice we often do the latter? Or are you assuming that each stimulus is only shown once overall? It would be useful at this point to spell out these implicit assumptions.

      We agree that stimulus x stimulus RDMs can be constructed and are often used. However, as we mentioned in the Introduction, researchers are often interested in the difference between two (or more) conditions, such as “remembered” vs. “forgotten” (Davis et al., https://doi.org/10.1093/cercor/bhaa269) or “high cognitive load” vs. “low cognitive load” (Beynel et al., https://doi.org/10.1523/JNEUROSCI.0531-20.2020). In those cases, the most common practice with cRSA is to construct condition-specific RDMs, compute cRSA scores separately for each condition, and then compare the scores at the group level. The number of times each stimulus gets presented does not prevent one from creating a model RDM that has the same rows and columns as the brain RDM, either in the same condition (“high load”) or across different conditions.

      (14) Page 5: The difference between condition-level and stimulus-level is not clear. Indeed, this definition seems to be a function of the exact experimental design and is certainly up for interpretation. For example, if I conduct a study looking at the activity patterns for 4 different hand actions, each repeated multiple times, are these actions considered stimuli or conditions?

      We have added clarifying language about what is considered stimuli vs conditions. Indeed, this will depend on the specific research questions being employed and will affect how researchers construct their models. In this specific example, one would most likely consider each different hand action a condition, treating them as fixed effects rather than random effects, given their very limited number and the lack of need to generalize findings to the broader “hand actions” category.

      Page 5. “Critically, the distinction between condition-level and stimulus level is not always clear as researchers may manipulate stimulus-level features themselves. In these cases, what researchers ultimately consider condition-level and stimulus-level will depend on their specific research questions. For example, researchers intending to study generalized object representation may consider object category a stimulus-level feature, while researchers interested in if/how object representation varies by category may consider the same category variable condition-level”.

      (15) Page 5: The fact that different numbers of trials / different levels of measurement noise / noise-covariance of different conditions biases non-cross-validated distances is well known and repeatedly expressed in the literature. We have shown that cross-validation of distances effectively removes such biases - of course, it does not remove the increased estimation variability of these distances (for a formal analysis of estimation noise on condition patterns and variance of the cross-nobis estimator, see (Diedrichsen et al. 2021)).

      We thank the reviewer for drawing our attention to this literature and have added discussions of these methods.

      (16). Page 5: "Most studies present subjects with a fixed set of stimuli, which are supposedly samples representative of some broader category". This may be the case for a certain type of RSA experiments in the visual domain, but it would be unfair to say that this is a feature of RSA studies in general. In most studies I have been involved in, we use a "stimulus" x "stimulus" RDM.

      We have edited this sentence to avoid the “most” characterization. We also added substantial text to the introduction and discussion distinguishing cRSA, which is nonetheless widely employed, especially in cases with a single repetition per stimulus (Macklin et al., 2023, Liu et al, 2024) and the model comparative method and explicitly stating that we do not consider tRSA an alternative to the model comparative approach.

      (17). Page 5: I agree that "stimuli" should ideally be considered a random effect if "stimuli" can be thought of as sampled from a larger population and one wants to make inferences about that larger population. Sometimes stimuli/conditions are more appropriately considered a fixed effect (for example, when studying the response to stimulation of the 5 fingers of the right hand). Techniques to consider stimuli/conditions as a random effect have been published by the group of Niko Kriegeskorte (Schütt et al. 2023).

      Indeed, in some cases what may be thought of as “stimuli” would be more appropriately entered into the model as a fixed effect; such questions are increasingly relevant given the focus on item-wise stimulus properties (Bainbridge et al., Westfall & Yarkoni). We have added text on this issue to the Discussion and caution researchers to employ models that most directly answer their research questions.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question. An effect is fixed when the levels represent the specific conditions of theoretical interest (e.g., task condition) and the goal is to estimate and interpret those differences directly. In contrast, an effect is random when the levels are sampled from a broader population (e.g., subjects) and the goal is to account for their variability while generalizing beyond the sample tested. Note that the same variable (e.g., stimuli) may be considered fixed or random depending on the research questions”.

      (18) Page 6: It is correct that the "classical" RSA depends on a categorical assignment of different trials to different stimuli/conditions, such that a stimulus x stimulus RDM can be computed. However, both Pattern Component Modelling (PCM) and Encoding models are ideally set up to deal with variables that vary continuously on a trial-by-trial or moment-by-moment basis. tRSA should be compared to these approaches, or - as it should be clarified - that the problem setting is actually quite a different one.

      We agree that PCM and encoding models offer a flexible approach and handle continuous trial-by-trial variables. We have clarified the problem setting in cRSA is distinct on page 6, and we have added the robustness of encoding models and their limitations to the Discussion.

      Page 6. “While other approaches such as Pattern Component Modeling (PCM) (Diedrichsen et al., 2018) and encoding models (Naselaris et al., 2011) are well-suited to analyzing variables that vary continuously on a trial-by-trial or moment-by-moment basis, these frameworks address different inferential goals. Specifically, PCM and encoding models focus on estimating variance components or predicting activation from features, while cRSA is designed to evaluate representational geometry. Thus, cRSA as well as our proposed approach address a problem setting distinct from PCM and encoding models”.

      (19) Page 8: "Then, we generated two noise patterns, which were controlled by parameters 𝜎 𝐴 and 𝜎𝐵, respectively, one for each condition." This makes little sense to me. The noise patterns should be unique to each trial - you should generate n_a + n_b noise patterns, no?

      We clarify that the “noise patterns” here are n_voxel x n_trial in size; in other words, all trial-level noise patterns are generated together and each trial has their own unique noise pattern. We have revised our description as “two sets of noise patterns” for clarity starting on page 9.

      (20) Page 9: First, I assume if this is supposed to be a hierarchical level model, the "noise parameters" here correspond to variances? Or do these \sigma values mean to signify standard deviations? The latter would make little sense. Or is it the noise pattern itself?

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (21) Page 10: your formula states "𝜎<sub>𝑠𝑢𝑏𝑗</sub>~ 𝙽(0, 0.5^2)". This conflicts with your previous mention that \sigmas are noise "levels" are they the noise patterns themselves now? Variances cannot be normally distributed, as they cannot be negative.

      As clarified in 4., the σ values are meant to denote hierarchical components of the composite standard deviation; we have updated our notation to use lower case letter s instead for clarity.

      (22) Page 13: What was the task of the subject in the Memory retrieval task? Old/new judgements relative to encoding of object perception?

      We apologize for the lack of clarity about the Memory Retrieval task and have added that information and clarified that the old/new judgements were relative to a separate encoding phase, the brain data for which has been reported elsewhere.

      Page 14. “Memory Retrieval took place one day after Memory Encoding and involved testing participants’ memory of the objects seen in the Encoding phase. Neural data during the Encoding phase has been reported elsewhere. In the main Memory Retrieval task, participants were presented with 144 labels of real-world objects, of which 114 were labels for previously seen objects and 30 were unrelated novel distractors. Participants performed old/new judgements, as well as their confidence in those judgements on a four-point scale (1 = Definitely New, 2 = Probably New, 3 = Probably Old, 4 = Definitely Old)”.

      (23) Page 13: If "Memory Retrieval consisted of three scanning runs", then some of the stimulus x stimulus correlations for the RSM must have been calculated within a run and some between runs, correct? Given that all within-run estimates share a common baseline, they share some dependence. Was there a systematic difference between the within-run and the between-run correlations?

      We have clarified in this portion of the methods that within run comparisons were excluded from our analyses. We also double-checked that the within-run exclusion was included in the description of the Neural RSMs.

      Page 14. “Retrieval consisted of three scanning runs, each with 38 trials, lasting approximately 9 minutes and 12 seconds (within-run comparisons were later excluded from RSA analyses)”.

      Page 18. “This was done by vectorizing the voxel-level activation values within each region and calculating their correlations using Pearson’s r, excluding all within-run comparisons.”

      (24) Page 20: It is not clear why the mean estimate of "representational strength" (i.e., model-brain RSM correlations) is important at all. This comes back to Major point #2, namely that you are trying to solve a very different problem from model-comparative RSA.

      We have clarified that our approach is not an alternative to model-comparative RSA, and that depending on the task constraints researchers may choose to compare models with tRSA or other approaches requiring stimulus repetition (see 3).

      (25) Page 21: I believe the problems of simulating correlation matrices directly in the way that the authors in their first simulation did should be well known and should be moved to an appendix at best. Better yet, the authors could start with the correct simulation right away.

      We agree the paper is more concise with these simulations being moved to the appendix and more briefly discussed. We have implemented these changes (Appendix 1). However, we are not certain that this problem is unknown, and have several anecdotes of researchers inquiring about this “alternative” approach in talks with colleagues, thus we do still discuss the issues with this method.

      (26) Page 26: Is the "underlying continuous noise variable 𝜎𝑡𝑟𝑖𝑎𝑙 that was measured by 𝑣𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 " the variance of the noise pattern or the noise pattern itself? What does it mean it was "measured" - how?

      𝜎𝑡𝑟𝑖𝑎𝑙 is a vector of standard deviations for different trials, and 𝜎𝑡𝑟𝑖𝑎𝑙 i would be used to generate the noise patterns for trial i. v_measured is a hypothetical measurement of trial-level variability, such as “memorability” or “heartbeat variability”. We have revised our description to clarify our methods.

      Reviewer #2 (Recommendations for the authors):

      (8) It would be helpful to provide more clarity earlier on in the manuscript on what is a 'trial': in my experience, a row or column of the RDM is usually referred to as 'stimulus condition', which is typically estimated on multiple trials (instances or repeats) of that stimulus condition (or exemplars from that stimulus class) being presented to the subject. Here, a 'trial' is both one measurement (i.e., single, individual presentation of a stimulus) and also an entry in the RDM, but is this the most typical scenario for cRSA? There is a section in the Discussion that discusses repetitions, but I would welcome more clarity on this from the get-go.

      We have added discussion of stimulus repetition methods and datasets to the Introduction and clarified our use of the terms.

      Page 8. “Critically, in single-presentation designs, a “trial” refers to one stimulus presentation, and corresponds to a row or column in the RSM. In studies with repeated stimuli, these rows are often called “conditions” and may reflect aggregated patterns across trials. tRSA is compatible with both cases: whether rows represent individual trials or averaged trials that create “conditions”, tRSA estimates are computed at the row level”.

      (9) The quality of the results figures can be improved. For example, axes labels are hard to read in Figure 3A/B, panels 3C/D are hard to read in general. In Figure 7E, it's not possible to identify the 'dark red' brain regions in addition to the light red ones.

      We thank the reviewer for raising these and have edited the figures to be more readable in the manner suggested.

      (10) I would be interested to see a comparison between tRSA and cRSA in other fMRI (or other modality) datasets that have been extensively reported in the literature. These could be the original Kriegeskorte 96 stimulus monkey/fMRI datasets, commonly used open datasets in visual perception (e.g., THINGS, NSD), or the above-mentioned King et al. dataset, which has been analyzed in various papers.

      We recognize the great utility of replication from other research groups and do invite researchers to utilize tRSA on their datasets.

      (11) On P39, the authors suggest 'researchers can confidently replace their existing cRSA analysis with tRSA': Please discuss/comment on how researchers should navigate the choice of modeling parameters in tRSA's linear mixed effects setting.

      We have added discussion of the mixed-effects parameters and the various and encourage researchers to follow best practices for their model selection.

      Page 46. “However, researchers should always consider if their models match the goals of their analysis, including 1) constructing the random effects structure that will converge in their dataset and 2) testing their model fits against alternative structures (Meteyard & Davies, 2020; Park et al., 2020) and 3) considering which effects should be considered random or fixed depending on their research question”.

      (12) The final part of the Results section, demonstrating the tRSA results for the continuous memorability factor in the real fMRI data, could benefit from some substantiation/elaboration. It wasn't clear to me, for example, to what extent the observed significant association between representational strength and item memorability in this dataset is to be 'believed'; the Discussion section (p38). Was there any evidence in the original paper for this association? Or do we just assume this is likely true in the brain, based on prior literature by e.g. Bainbridge et al (who probably did not use tRSA but rather classic methods)?

      Indeed, memorability effects have been replicated in the literature, but not using the tRSA method. We have expanded our discussion to clarify the relationship of our findings and the relevant literature and methods it has employed.

      Page 38. “Critically, memorability is a robust stimulus property that is consistent across participants and paradigms (Bainbridge, 2022). Moreover, object memorability effects have been replicated using a variety of methods aside from tRSA, including univariate analyses and representational analyses of neural activity patterns where trial-level neural activity pattern estimates are correlated directly with object memorability (Slayton et al, 2025).”

      (13) The abstract could benefit from more nuance; I'm not sure if RSA can indeed be said to be 'the principal method', and whether it's about assessing 'quality' of representations (more commonly, the term 'geometry' or 'structure' is used).

      We have edited the abstract to reflect the true nuisance in the current approaches.

      Abstract. Neural representation refers to the brain activity that stands in for one’s cognitive experience, and in cognitive neuroscience, a prominent method of studying neural representations is representational similarity analysis (RSA). While there are several recent advances in RSA, the classic RSA (cRSA) approach examines the structure of representations across numerous items by assessing the correspondence between two representational similarity matrices (RSMs): usually one based on a theoretical model of stimulus similarity and the other based on similarity in measured neural data.

      (14) RSA is also not necessarily about models vs. neural data; it can also be between two neural systems (e.g., monkey vs. human as in Kriegeskorte et al., 2008) or model systems (see Sucholutsky et al., 2023). This statement is also repeated in the Introduction paragraph 1 (later on, it is correctly stated that comparing brain vs. model is most likely the 'most common' approach).

      We have added these examples in our introduction to RSA.

      Page 3.”One of the central approaches for evaluating information represented in the brain is representational similarity analysis (RSA), an analytical approach that queries the representational geometry of the brain in terms of its alignment with the representational geometry of some cognitive model (Kriegeskorte et al., 2008; Kriegeskorte & Kievit, 2013), or, in some cases, compares the representational geometry of two neural systems (e.g., Kriegeskorte et al., 2008) or two model systems (Sucholutsky et al., 2023)”.

      (15) 'theoretically appropriate' is an ambiguous statement, appropriate for what theory?

      We apologize for the ambiguous wording, and have corrected the text:

      Page 11. “Critically, tRSA estimates were submitted to a mixed-effects model which is statistically appropriate for modeling the hierarchical structure of the data, where observations are nested within both subjects and stimuli (Baayen et al., 2008; Chen et al., 2021)”.

      (16) I found the statement that cRSA "cannot model representation at the level of individual trials" confusing, as it made me think, what prohibits one from creating an RDM based on single-trial responses? Later on, I understood that what the authors are trying to say here (I think) is that cRSA cannot weigh the contributions of individual rows/columns to the overall representational strength differently.

      We thank the reviewer for their clarifying language and have added it to this section of the manuscript.

      “Abstract. However, because cRSA cannot weigh the contributions of individual trials (RSM rows/columns), it is fundamentally limited in its ability to assess subject-, stimulus-, and trial-level variances that all influence representation”.

      (17) Why use "RSM" instead of "RDM"? If the pairwise comparison metric is distance-based (e..g, 1-correlation as described by the authors), RDM is more appropriate.

      We apologize for the error, and have clarified the Methods text:

      Page3-4. First, brain activity responses to a series of N trials are compared against each other (typically using Pearson’s r) to form an N×N representational similarity matrix.

      (18) Figure 2: please write 'Correlation estimate' in the y-axis label rather than 'Estimate'.

      We have edited the label in Figure 2.

      (19) Page 6 'leaving uncertain the directionality of any findings' - I do not follow this argument. Obviously one can generate an RDM or RSM from vector v or vector -v. How does that invalidate drawing conclusions where one e.g., partials out the (dis)similarity in e.g., pleasantness ratings out of another RDM/RSM of interest?

      We agree such an approach does not invalidate the partial method; we have clarified what we mean by “directionality”.

      Page 8. ”For instance, even though a univariate random variable , such as pleasantness ratings, can be conveniently converted to an RSM using pairwise distance metrics (Weaverdyck et al., 2020), the very same RSM would also be derived from the opposite random variable , leaving uncertain of the directionality (or if representation is strongest for pleasant or unpleasant items) of any findings with the RSM (see also Bainbridge & Rissman, 2018)”.

      (20) P7 'sampled 19900 pairs of values from a bi-variate normal distribution', but the rows/columns in an RDM are not independent samples - shouldn't this be included in the simulation? I.e., shouldn't you simulate first the n=200 vectors, and then draw samples from those, as in the next analysis?

      This section has been moved to Appendix 1 (see responses to Reviewer 1.13).

      (21) Under data acquisition, please state explicitly that the paper is re-using data from prior experiments, rather than collecting data anew for validating tRSA.

      We have clarified this in the data acquisition section.

      Page 13. “A pre-existing dataset was analyzed to evaluate tRSA. Main study findings have been reported elsewhere (S. Huang, Bogdan, et al., 2024)”.

      (22) Figure 4 could benefit from some more explanation in-text. It wasn't clear to me, for example, how to interpret the asterisks depicted in the right part of the figure.

      We clarified the meaning of the asterisks in the main text in addition to the existent text in the figure caption.

      Page 26. “see Figure 4, off-diagonal cells in blue; asterisks indicate where tRSA was statistically more sensitive then cRSA)”.

      (23) Page 38 "the outcome of tRSA's improved characterization can be seen in multiple empirical outcomes:" it seems there is one mention of 'outcomes' too many here.

      We have revised this sentence.

      Page 41. “tRSA's improved characterization can be seen in multiple empirical outcomes”.

      (24) Page 38 "model fits became the strongest" it's not clear what aspect of the reported results in the paragraph before this is referring to - the Appendix?

      Yes, the model fits are in the Appendix, we have added this in text citation.

      Moreover, model-fits became the strongest when the models also incorporated trial-level variables such as fMRI run and reaction time (Appendix 3, Table 6).

      References

      Diedrichsen, J., Berlot, E., Mur, M., Schütt, H. H., Shahbazi, M., & Kriegeskorte, N. (2021). Comparing representational geometries using whitened unbiased-distance-matrix similarity. Neurons, Behavior, Data and Theory, 5(3). https://arxiv.org/abs/2007.02789

      Diedrichsen, J., & Kriegeskorte, N. (2017). Representational models: A common framework for understanding encoding, pattern-component, and representational-similarity analysis. PLoS Computational Biology, 13(4), e1005508.

      Diedrichsen, J., Yokoi, A., & Arbuckle, S. A. (2018). Pattern component modeling: A flexible approach for understanding the representational structure of brain activity patterns. NeuroImage, 180, 119-133.

      Naselaris, T., Kay, K. N., Nishimoto, S., & Gallant, J. L. (2011). Encoding and decoding in fMRI. NeuroImage, 56(2), 400-410.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS Computational Biology, 10(4), e1003553.

      Schütt, H. H., Kipnis, A. D., Diedrichsen, J., & Kriegeskorte, N. (2023). Statistical inference on representational geometries. ELife, 12. https://doi.org/10.7554/eLife.82566

      Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., & Diedrichsen, J. (2016). Reliability of dissimilarity measures for multi-voxel pattern analysis. NeuroImage, 137, 188-200.

      King, M. L., Groen, I. I., Steel, A., Kravitz, D. J., & Baker, C. I. (2019). Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage, 197, 368-382.

      Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., ... & Bandettini, P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60(6), 1126-1141.

      Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS computational biology, 10(4), e1003553.

      Sucholutsky, I., Muttenthaler, L., Weller, A., Peng, A., Bobu, A., Kim, B., ... & Griffiths, T. L. (2023). Getting aligned on representational alignment. arXiv preprint arXiv:2310.13018.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #3:

      Comments on revised version:

      This revised version is in large improved and the responses to reviewers' comments are generally relevant. However, the response regarding pre-nodes is not satisfactory. I understand that the authors prefer to avoid further experimentations, but I think this is an important point that needs to be clarified. Exploring stages between E12 and E15 are therefore of importance. When carefully examining some of the figures (Fig. 1E or 2D) I think that at E15 they may well be pre-nodes formation prior to myelin deposition, on structure the authors considered to be heminodes. To be convincing they should use double or triple labeling with, in addition to the nodal proteins (ankG and/or Nav pan), a good myelin marker such as antiPLP. The rat monoclonal developed by late Pr Ikenaka would give a sharper staining than the anti MAG they used. (I assume the clone must still be available in Okazaki ).

      We appreciate your insightful comment regarding the possible presence of pre-nodal clusters along NM axons and your kind suggestion to use the PLP antibody (clone AA3; Yamamura et al., J Neurochem, 1991). We have obtained this monoclonal antibody from Dr. Kenji Tanaka previously in Okazaki and confirmed that it works well in chicken tissues. However, since this clone recognizes both PLP and DM-20 isoforms, it labels not only myelin-forming oligodendrocytes (MFOLs) but also newly formed oligodendrocytes (NFOLs) (Yokoyama et al., J Neurochem, 2025). Therefore, it is not ideal for determining whether nodal protein clusters are formed before myelin deposition.

      Instead, we performed double immunostaining for MAG and AnkG between E12 and E15 to clarify the temporal relationship between myelin maturation and node formation. The results showed that detectable AnkG clusters along NM axons began to appear very sparsely around E13, coinciding with the emergence of MAG signals, and became more prominent with development. This temporal pattern does not match the definition of pre-nodal clusters, which are formed prior to myelination.

      Although we cannot completely rule out the possibility of undetectable pre-nodal clusters or those composed of molecules other than AnkG, our results support the view that pre-nodal clusters are unlikely to play a major role in determining the regional difference in nodal spacing along NM axons. These new data have been added as Figure 2—figure supplement 1, and the relevant sections in the Results, Discussion, and Figure legend have been revised accordingly (page 5, line 4; page 10, line 7; page 29, line 1).

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public Review):

      Summary:

      The authors attempted to clarify the impact of N protein mutations on ribonucleoprotein (RNP) assembly and stability using analytical ultracentrifugation (AUC) and mass photometry (MP). These complementary approaches provide a more comprehensive understanding of the underlying processes. Both SV-AUC and MP results consistently showed enhanced RNP assembly and stability due to N protein mutations.

      The overall research design appears well planned, and the experiments were carefully executed.

      Strengths:

      SV-AUC, performed at higher concentrations (3 µM), captured the hydrodynamic properties of bulk assembled complexes, while MP provided crucial information on dissociation rates and complex lifetimes at nanomolar concentrations. Together, the methods offered detailed insights into association states and dissociation kinetics across a broad concentration range. This represents a thorough application of solution physicochemistry.

      We thank the Reviewer for this positive assessment. 

      Weaknesses:

      Unlike AUC, MP observes only a part of the solution. In MP, bound molecules are accumulated on the glass surface (not dissociated), thus the concentration in solution should change as time develops. How does such concentration change impact the result shown here?

      We agree with the Reviewer that the concentration in solution above the surface will change with time; however, the impact of surface adsorption turns out to be negligible. To show this we have added a calculation as Supplementary Methods that is based on the number of imaged adsorption events, the fraction of imaged area to total surface area, and the initial sample volume and concentration. Under our experimental conditions the reduction is less than 1%, which is well within the range of experimental concentration errors.

      This is in line with the observation that surface adsorption of proteins to glass is critical and needs to be prevented when working at picomolar concentrations (Zhao H, Mayer ML, Schuck P. 2014. Analysis of protein interactions with picomolar binding affinity by fluorescence-detected sedimentation velocity. Anal Chem 86:3181–3187. doi:10.1021/ac500093m), but is ordinarily negligible when working at the mid nanomolar concentration range. The difference in the MP experiments is that where usually the surface adsorption to glass and plastic is invisible, it is being imaged and quantified in MP. The negligible impact of surface adsorption on solution concentration in typical MP experiments is also in line with the results of several studies that have successfully measured dissociation constants of binding equilibria by MP (Young G et al., Science 360 (2018) 432; Wu & Piszczeck, Anal Biochem 592 (2020) 113575; Solterman et al. Angewandte Chemie 59 (2020) 10774) with samples in the 5-50 nM range and similar experimental setup. It should be noted that in the MP experiments no surface functionalization is employed, in contrast to optical biosensors that utilize surface-immobilized ligands and polymeric matrices and thereby enhance the surface binding capacity.

      Even though this depletion effect is negligible under ordinary MP conditions, the Reviewer raises a good point and readers may have a similar question with this novel technique. For this reason, we have added in the MP section of the Methods the sentence “In either configuration, the impact of surface binding on the sample concentration is < 1% and negligible, as described in the Supplementary Methods S1.” and added the detailed calculations in the Supplement accordingly. The use of SV as a traditional, orthogonal technique and the observation of consistent results with those of MP should further dispel readers’ methodological concerns in this point.

      Reviewer #2 (Public Review):

      Summary:

      In this manuscript, the authors apply a variety of biophysical and computational techniques to characterize the effects of mutations in the SARS-CoV-2 N protein on the formation of ribonucleoprotein particles (RNPs). They find convergent evolution in multiple repeated independent mutations strengthening binding interfaces, compensating for other mutations that reduce RNP stability but which enhance viral replication.

      Strengths:

      The authors assay the effects of a variety of mutations found in SARS-CoV-2 variants of concern using a variety of approaches, including biophysical characterization of assembly properties of RNPs, combined with computational prediction of the effects of mutations on molecular structures and interactions. The findings of the paper contribute to our increasing understanding of the principles driving viral self-assembly, and increase the foundation for potential future design of therapeutics such as assembly inhibitors.

      Thank you for highlighting the strengths of our paper and the potential impact on future design of therapeutics.

      Weaknesses:

      For the most part, the paper is well-written, the data presented support the claims made, and the arguments are easy to follow. However, I believe that parts of the presentation could be substantially improved. I found portions of the text to be overly long and verbose and likely could be substantially edited; the use of acronyms and initialisms is pervasive, making parts of the exposition laborious to follow; and portions of the figures are too small and difficult to read/understand.

      We are glad the Reviewer concurs the data support our conclusions, and finds the arguments easy to follow.  We appreciate the comment that the work was not optimally presented. To address this point, we have identified multiple opportunities to streamline the text without jeopardizing the clarity. We have also rewritten the end of the Introduction.

      As recommended, we have reduced and harmonized the use of acronyms and abbreviations throughout the text to improve readability. Specifically, we have now spelled out nucleic acid (NA), intrinsically disordered regions (IDR), full-length (FL), AlphaFold (AF3), and variants of concern (VOC).

      Finally, we have improved the presentation of most figures, adding labels and new panels, and increased the label font sizes to facilitate more detailed inspections of the data.

      Reviewer #3 (Public Review):

      This manuscript investigates how mutations in the SARS-CoV-2 nucleocapsid protein (N) alter ribonucleoprotein (RNP) assembly, stability, and viral fitness. The authors focus on mutations such as P13L, G214C, and G215C, combining biophysical assays (SV-AUC, mass photometry, CD spectroscopy, EM), VLP formation, and reverse genetics. They propose that SARS-CoV-2 exploits "fuzzy complex" principles, where distributed weak interfaces in disordered regions allow both stability and plasticity, with measurable consequences for viral replication.

      Strengths:

      (1) The paper demonstrates a comprehensive integration of structural biophysics, peptide/protein assays, VLP systems, and reverse genetics.

      (2) Identification of both de novo (P13L) and stabilizing (G214C/G215C) interfaces provides a mechanistic insight into RNP formation.

      (3) Strong application of the "fuzzy complex" framework to viral assembly, showing how weak/disordered interactions support evolvability, is a significant conceptual advance in viral capsid assembly.

      (4) Overall, the study provides a mechanistic context for mutations that have arisen in major SARS-CoV-2 variants (Omicron, Delta, Lambda) and a mechanistic basis for how mutations influence phenotype via altered biomolecular interactions.

      We are grateful for these comments highlighting this work as a significant conceptual advance.

      Weaknesses:

      (1) The arrangement of N dimers around LRS helices is presented in Figure 1C, but the text concedes that "the arrangement sketched in Figure 1C is not unique" (lines 144-146) and that AF3 modeling attempts yielded "only inconsistent results" (line 149).

      The authors should therefore present the models more cautiously as hypotheses instead. Additional alternative arrangements should be included in the Supplementary Information, so the readers do not over-interpret a single schematic model.

      We agree that in the absence of high-resolution structures the RNP models are hypothetical, and have now emphasized this in the Results, following the Reviewer’s recommendation. To present alternative arrangements that satisfy the biophysical constraints upfront, we have promoted the previous Supplementary Figure 11 showing different models to the first Supplementary Figure, and expanded it with examples of different oligomers. In this way it is referenced early on in the Results and in the legend to Figure 1C. We agree this strengthens the manuscript, as one of the take-home messages is the inherent polydispersity of the RNPs.

      The fact that AF3 can only provide inconsistent results will not come as a surprise, given the substantial disordered regions of the complex, and is a drawback of AF3 rather than our structural model. We slightly emphasized this point so as to clarify that the presentation of the AF3-based RNP structure serves solely as supporting evidence that our hypothetical model is sterically reasonable.

      The new Results paragraph reads:

      “As suggested in the cartoon of Figure 1C, this supports the hypothesis of a three-dimensional arrangement with a central LRS oligomer with symmetry properties and dimensions similar to low resolution EM images of model RNPs (Carlson et al., 2022, 2020) and cryo-ET of RNPs in virions (Klein et al., 2020; Yao et al., 2020).  It should be noted, however, that the arrangement sketched in Figure 1C is not unique and other subunit orientations could be envisioned that satisfy all constraints from experimentally observed binding interfaces, including different oligomers and anti-parallel subunits as illustrated in Supplementary Figure S1. Extending previous ColabFold structural predictions that show multiple N-protein dimers self-assembled via the LRS coiled-coils (Zhao et al., 2023), we attempted the AlphaFold modeling of RNPs combining multiple N dimers with SL7 RNA ligands, mimicking our biophysical assembly model. Current AlphaFold restrictions limit the prediction to pentamers of N-protein dimers with 10 copies of SL7 RNA. While only inconsistent results were obtained – which is not surprising given the large intrinsically disordered regions exceed the predictive power of AlphaFold – some models did produce an overall RNP organization similar to Figure 1C, suggesting such an arrangement is at least sterically reasonable with regard to possible N-protein subunit orientations in an RNP (Supplementary Figure S2)”

      (2) Negative-stained EM fibrils (Figure 2A) and CD spectra (Figure 2B) are presented to argue that P13L promotes β-sheet self-association. However, the claim could benefit from more orthogonal validation of β-sheet self-association. Additional confirmation via FTIR spectra or ThT fluorescence could be used to further distinguish structured β-sheets from amorphous aggregation.

      We completely agree that the application of multiple orthogonal biophysical methods can strengthen the conclusions. In addition to EM fibrils and CD spectra (a classical gold standard technique for protein secondary structure in solution), we already have support from ColabFold modeling, as well as NMR results from the Zweckstetter lab showing the potential for for β-sheet-like conformations.

      Furthermore, we believe the evidence for the absence of ‘amorphous aggregates’ is very strong, as this would be inconsistent with the long-range order required to create the visibly fibrillar morphology in EM, and amorphous aggregates would be inconsistent with the increased solution viscosity. In this context, it is also highly relevant that the β-sheet-like secondary structure recorded by CD is concentration-dependent and reversible upon dilution. The long-range spatial order of fibrils is consistent with the formation of secondary structure in solution.

      In addition, it must be kept in mind that what we see is specific to N-arm peptides carrying the P13L mutation (in EM, CD, and structural prediction) and does not occur in the other two N-arm peptides (ancestral N-arm and N-arm with deletion of 31-33), linker peptides, or C-arm peptides.

      Most importantly, as elaborated in more detail below, we do not claim that fibril formation is physiologically relevant. At the heart of this – in the context of the evolution of fuzzy complexes – is that the P13L mutation creates additional weak protein-protein interactions. Indeed, the assembly of fibrils geometrically requires at least two interfaces for each subunit. These weak interactions are at play physiologically in the context of the disordered RNP particles, and in macromolecular condensates, but not in the formation of fibrils. Therefore, while we appreciate the suggestion for FTIR spectra ThT staining, we are afraid further emphasis on the fibril structure might confuse the reader, and therefore we would rather clarify upfront that these fibrillar assemblies are not thought to form in vivo from full-length protein, but merely demonstrate the presence of N-arm self-association interfaces in the model of truncated peptides.

      Accordingly, we have amended the Results paragraph reporting the fibrils:

      “Thus, the N-arm mutation P13L is responsible for the formation of fibrils in N-arm peptides after prolonged storage. Some of these N-arm fibrils exhibit a twisted morphology with width of »5 nm (Figure 2A), in some instances exhibiting patterns of strand breaks. Such fibrils are frequently encountered in proteins that can stack β-sheets, such as in amyloids (Paravastu et al., 2008). While we have not observed fibril formation in the context of full-length N, and have no evidence such fibrils are physiologically relevant, their occurrence in solutions of truncated N-arm peptide nonetheless demonstrates the introduction of ordered N-arm self-association interfaces in conformations of P13L mutants.”

      And more completely summarized experimental evidence prior to describing the ColabFold prediction results (which previously did not include mention of the NMR):

      “Finally, confirming the interpretation of the EM images and the CD data, as well as the b-structure propensity reported from NMR data (Zachrdla et al., 2022), the structural prediction of N[10-20]:P13L in ColabFold displayed oligomers with stacking b-sheets …”

      (3) In the main text, the authors alternate between emphasizing non-covalent effects ("a major effect of the cysteines already arises in reduced conditions without any covalent bonds," line 576) and highlighting "oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs". Therefore, the biological relevance of disulfide redox chemistry in viral assembly in vivo remains unclear. Discussing cellular redox plausibility and whether the authors' oxidizing conditions are meant as a mechanistic stress test rather than physiological mimicry could improve the interpretation of these results.

      The paper could benefit if the authors provide a summary figure or table contrasting reduced vs. oxidized conditions for G214C/G215C mutants (self-association, oligomerization state, RNP stability). Explicitly discuss whether disulfides are likely to form in infected cells.

      We thank the Reviewer for raising this most interesting point.  The reason why the biological relevance of N dilsulfides remains unclear is simply that this is still unknown, unfortunately. Recently, Kubinski et al. have strongly argued for the formation of disulfides in infected cells, but in our view the evidence remains weak since the majority of disulfide bonds in that work presented as post-lysis artifacts, and it appears the non-covalent effects alone could explain the physiological observations. We aimed for a balanced presentation and wrote in the relevant Results section:

      “Covalent disulfide bonds in the LRS in non-reducing conditions were found to further promote LRS oligomerization. However, there is no conclusive data yet whether covalent bonds in the LRS occur in vivo, or any G215C effect is entirely non-covalent due to the significant strengthening of LRS helix oligomerization (see Discussion).”

      Despite the uncertainty regarding physiological disulfide bond formation, we believe it is useful to ask whether covalently crosslinked N dimers would aid or constrain RNP assembly in our biophysical model. We have now better explained this motivation in the Results section describing the RNP experiments:

      “Even though it is still unclear whether disulfide bonds of N cysteine mutants form in vivo, we were curious about the impact of disulfide-linked oligomers of the cysteine mutants on their RNP structure and stability in our biophysical assembly model.”

      The referenced paragraph from the Discussion reads:

      “Regarding the cysteine mutations that have been repeatedly introduced in the LRS prior to the rise of the Omicron VOCs, it is an open question whether they lead to covalent bonds in vivo or in the VLP assay. While examples of disulfide-linked viral nucleocapsid proteins have been reported (Kubinski et al., 2024; Prokudina et al., 2004; Wootton and Yoo, 2003), a methodological difficulty in their detection is artifactual disulfide bond formation post-lysis of infected cells (Kubinski et al., 2024; Wootton and Yoo, 2003).  However, our results clearly show that a major effect of the cysteines already arises in reduced conditions without any covalent bonds, through extension of the LRS helices, and concomitant redirection of the disordered N-terminal sequence. While oxidized tetrameric N-proteins of N:G214C and N:G215C can be incorporated into RNPs, the covalent bonds provided only marginally improved RNP stability.  Interestingly, the introduction of cysteines imposes preferences of RNP oligomeric states dependent on oxidation state, consistent with our MD simulations highlighting the impact of cysteine orientation of 214C versus 215C relative to the hydrophobic surface of the LRS helices. Overall, considering potentially detrimental structural constraints from covalent bonds on LRS clusters seeding RNPs, energetic penalties on RNP disassembly, as well as the required monomeric state of the LRS helix for interaction with the NSP3 Ubl domain (Bessa et al., 2022), at present it is unclear to what extent the formation of disulfide linkages between LRS helices would be beneficial or detrimental in the viral life cycle.”

      We feel that this text addresses the Reviewer’s comment, and that expanding the existing discussion further would conflict with other recommendations to shorten and focus the text.

      Finally, we have addressed the valuable suggestion of a new table summarizing the oligomeric state and self-association of the different cysteine mutants by inserting a new column in the existing Table 1 reporting all species’ oligomeric state at low micromolar concentrations. In this way they can be compared at a glance with the other mutants as well. A more detailed comparison of the concentration-dependent size-distribution is provided in Figure 4.

      (4) VLP assays (Figure 7) show little enhancement for P13L or G215C alone, whereas Figure 8 shows that P13L provides clear fitness advantages. This discrepancy is acknowledged but not reconciled with any mechanistic or systematic rationale. The authors should consider emphasizing the limitations of VLP assays and the sources of the discrepancy with respect to Figure 8.

      We thank the Reviewer for this comment, which highlights a very important point. 

      For clarification and to improve the cohesion of the manuscript we have inserted a reference to the Discussion after the presentation of the VLP results, which provides a natural transition to the following description of the reverse genetics experiments:

      “As expanded on in the Discussion, the failure to observe enhancement by P13L alone may be related to limitations of the VLP assay in sensitivity, including the restriction to a single round of infection, and protein expression levels.”

      This references a paragraph in the Discussion about the limitations of the VLP assay in general and the reasons we believe the enhancement by P13L alone was not picked up:

      “…While this assay has been widely used for rapid assessment of spike protein and N variants (Syed et al., 2021), it has limitations due to the addition of non-genomic RNA and the lack of double membrane vesicles from which gRNA emerges through the NSP3/NSP4 pore complex potentially poised for packaging (Bessa et al., 2022; Ke et al., 2024; Ni et al., 2023). It should also be recognized that the results do not directly reflect the relative efficiency of RNP assembly only, since protein expression levels, their localization, and their posttranslational modifications are not controlled for. Susceptibility for such factors might be exacerbated with mutations that modulate weak protein interactions. For example, as shown previously (Syed et al., 2024; Zhao et al., 2024), a GSK3 inhibitor inhibiting N-protein phosphorylation significantly enhances VLP formation and eliminates the advantage provided for by the N:G215C mutation relative to the ancestral N – presumably due to an increase in assembly-competent, non-phosphorylated N-protein erasing an affinity advantage. A similar process may be underlying the absent or marginal improvement in VLP readout from the cysteine LRS mutants and P13L at the achieved transfection level in the present work, and the enhanced signal from R203K/G204R and R203M (the latter being consistent with previous reports (Li et al., 2025; Syed et al., 2021)) modulating protein phosphorylation. Nonetheless, mirroring the results of the biophysical in vitro experiments, the addition of RNP-stabilizing P13L and G214C mutations on top of R203K/G204R led to a significantly larger VLP signal.

      The VLP assay may be limited in sensitivity to mutation effects due to its restriction to a single round of infection. To avoid this and other potential limitations of the VLP assay for the study of viral packaging, for the key mutation N:P13L we carried out reverse genetics experiments. These showed the sole N:P13L mutation significantly increases viral fitness (Figure 8).”

      (5) Figures 5 and 6 are dense, and the several overlays make it hard to read. The authors should consider picking the most extreme results to make a point in the main Figure 5 and move the other overlays to the Supplementary. Additionally, annotating MP peaks directly with "2×, 4×, 6× subunits" can help non-experts.

      We completely agree with the Reviewer – these figures were very dense.  To mitigate this problem without having the reader to switch back-and-forth to the supplement, we subdivided the panels of Figure 5 and showed only a subset of curves in each.  In this way the data are easier to read while still readily compared. It is a large figure, but it contains the key data for the present work and is therefore worthwhile to have in one place. For the MP histogram data we also have inserted the suggested peak labels. Similarly, we have split Figure 6A into two panels for clarity.

      (6) The paper has several names and shorthand notations for the mutants, making it hard to keep up. The authors could include a table that contains mutation keys, with each shorthand (Ancestral, Nο/No, Nλ, etc.) mapped onto exact N mutations (P13L, Δ31-33, R203K/G204R, G214C/G215C, etc.). They could then use the same glyphs (Latin vs Greek) consistently in text and figure labels.

      Yes, we agree this is a problem and we apologize for the confusion. However, it is not possible to refer exclusively to either Latin or Greek terminology, which we feel would be even more detrimental to readability (the former being exhaustively lengthy and the latter being imprecise). But we have used a rational system: If the complete set of mutations of a variant are present, then its Greek letter will be used as an abbreviation, and otherwise we use Latin amino acid/position indicators for individual mutations or combinations thereof. Unfortunately, previously we inadvertently failed to explicitly mention this, and we are most grateful for the Reviewer to point this out.

      We have now rectified this by including upfront the sentence:

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N<sub>­­λ</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      This will define the two shorthands N<sub>λ</sub> and N<sub>ο</sub> used. Furthermore, as suggested and pointed to in the text, Table 1 does provide the keys to mutation and variants, including the information in which variant any of the other mutations studied here occur.

      (7) The EM fibrils (Figure 2A) and CD spectra (Figure 2B) were collected at mM peptide concentrations. These are far above physiological levels and may encourage non-specific aggregation. Similarly, the authors mention" ultra-weak binding energies that require mM concentrations to significantly populate oligomers". On the other hand, the experiments with full-length protein were performed at concentrations closer to biologically relevant concentrations in the micromolar range. While I appreciate the need to work at high concentrations to detect weak interactions, this raises questions about physiological relevance.

      This is indeed an important point to clarify. We agree that much lower nucleocapsid protein concentrations are present in the cytosol on average, and these were used in our RNP assembly experiments. However, there are at least two important physiologically relevant cases where high local N concentrations do occur:

      (1) Once assembled in RNPs, the disordered N-terminal extensions are locally at a very high concentration within the volume they can explore while tethered to the NTD. A back-of-the-envelope calculation assuming 12 N-protein subunits confining 12 N-terminal extensions to the volume of a single RNP (≈14x14x14 nm<sup>3</sup> by cryoEM; Klein et al 2020) leads to an effective concentration of 7.4 mM. Obviously the N-arm peptides are not completely free and there will be constraints that would hinder or promote encounter complex probability, but interfaces with mM Kd are clearly strong enough to populate Narm-Narm contacts extending from N-protein in the RNP.

      Additionally, any interaction where N-proteins are brought in close proximity could allow weak N-arm interactions to provide additional stability. Besides the RNP, we demonstrate this in our Results for nucleic-acid liganded N tetramers (Figure 4B), but this might similarly occur in complexes with NSP3 or host proteins. Generally, it is quite common that small additional binding energies play important roles in the modulation of multivalent protein complexes.

      (2) Within the macromolecular condensate the local concentration will be substantially higher than on average within the infected cell.  While we do not know its precise concentration, it is well-established that the sum of many ultra-weak interactions is driving the formation of this dense liquid phase. In our previous eLife paper (Nguyen et al., 2024) we have shown LLPS is suppressed with the R203K/G204R mutation, but it is ‘rescued’ with the additional P13L/del31-33 mutation of the Omicron variant showing strong LLPS. Similarly, LLPS is suppressed by the LRS mutant L222P, but rescued in conjunction with P13L. This is another biologically relevant scenario where weak interactions are critical.

      We have emphasized these points in the revised manuscript as described below.

      Specifically:

      (a) Could some of the fibril/β-sheet features attributed to P13L (Figure 2A-C) reflect non-specific aggregation at high concentrations rather than bona fide self-association motifs that could play out in biologically relevant scenarios?

      We understand this concern from the experience with proteins that often have limited solubility and tendencies to aggregate, sometimes accompanied by unfolding and driven by hydrophobic interactions, or clustering on the path to LLPS. However, we are struggling to reconcile the picture of non-specific aggregation with the context of our P13L N-arm peptides. The term ‘non-specific aggregation’ implies the idea of amorphous aggregates, which we would contend is inconsistent with the observed geometry of fibrils, which exhibit long-range order. In addition, non-specific aggregation does not lead to increased solution viscosity, which we describe, but fibril formation does. Another connotation of ‘aggregates’ is irreversibility.  However, we find the beta-sheet-like conformation seen at 1 mM becomes significantly more disordered when the same sample is diluted to 0.4 mM peptide. This is consistent with a reversible self-association driven by a conformational change toward ordered secondary structure.

      To highlight the reversibility, we have clarified the description: “Interestingly, diluting the 1 mM sample (solid) to a concentration of 0.4 mM (dashed) reveals a large shift in the far-UV spectra … both indicative of a significant increase of disorder upon dilution. This is consistent with the stabilization of b-sheets in a reversible, strongly cooperative self-association process with an effective K<sub>D</sub> in the high mM to low mM range.”

      We have also inserted a concentration conversion to mg/ml units, which shows even 1 mM of peptides is only ~5 mg/ml, i.e. not excessively high. “While the ancestral N-arm at »1 mM (» 4.6 mg/ml) concentrations exhibits CD spectra with a minimum at »200 nm typical of disordered conformations (black)”

      With regard to the question of specificity, we have studied similar N-arm peptides without P13L mutations and with the 31-33 deletion under equivalent conditions. But we observe the reversible self-association, conformational change, and fibril formation only for those containing the P13L mutation, consistent with ColabFold predictions. Neither did we observe fibrils with disordered C-arm peptides.

      How these weak self-association motifs in the N-arm can be physiologically relevant in the context of full-length protein modulating the stability of multi-molecular complexes and enhancing LLPS was outlined above, and further clarified in the manuscript as detailed below.

      (b) How do the authors justify extrapolating from the mM-range peptide behaviors to the crowded but far lower effective concentrations in cells?

      As pointed out above, the key to this question is the local preconcentration as the N-arm peptides are tethered to the rest of protein in the context of flexible multi-molecular assemblies. Another mechanism to consider is the formation of condensates. The response to the next comment will expand on this.

      The authors should consider adding a dedicated section (either in Methods or Discussion) justifying the use of high concentrations, with estimation of local concentrations in RNPs and how they compare to the in vitro ranges used here. For concentration-dependent phenomena discussed here, it is vital to ensure that the findings are not artefacts of non-physiological peptide aggregation..

      The use of high concentration in biophysical experiments is quite common, for example, in NMR or crystallography, insofar as they elucidate molecular properties. We believe this is obvious; the Reviewer will certainly agree with us, and this does not require further elaboration. The property observed in this case is the existence of specific, weak protein self-association interfaces in the N-arm.

      Our response to the Reviewer’s point 7(a) addresses the distinction between artefactual aggregation and self-association of N-arm peptides. The relevance of these weak protein self-association interfaces in the context of the full-length protein is the second underlying question.

      As we have previously stated in a dedicated Results paragraph:

      “In contrast to the modulation of the coiled-coil LRS interfaces, the de novo creation of the N-arm self-association interface through beta-sheet interactions enabled by P13L cannot be readily observed in full-length N-protein at low M concentrations. Similar to the ancestral LRS interface, it provides only ultra-weak binding energies that require mM concentrations to significantly populate oligomers. This is fully consistent with the previous observation by SV-AUC that neither N:P13L,31-33 nor N<sub>o</sub> with the full set of Omicron mutations show any significant higher-order self-association at low M concentrations, whereas at high local concentrations – as observed in phase-separated droplets – they can modulate and cooperatively enhance self-association processes (Nguyen et al., 2024). (If fact, P13L can substitute for the LRS promoting LLPS, as observed in the rescue of LLPS by N:P13L,31-33/L222P mutants whereas N:L222P LRS-abrogating mutants are deficient in LLPS.) Another process that increases the local concentration of N-arm chains is the tetramerization of full-length N-protein. As described earlier, occupancy of the NA-binding site in the NTD allosterically promotes self-assembly of the LRS into higher oligomers (Zhao et al., 2021). We hypothesized that these oligomers may be cooperatively stabilized by additional N-arm interactions in P13L mutants.”

      To state completely unambiguously why weak interfaces are important, we have followed the Reviewer’s suggestion and added an additional clarification already earlier, at the end of the P13L Results section:

      “While this self-association interface in the P13L N-arm is weak and its direct observation in biophysical experiments requires mM concentrations, which far exceed average intracellular concentration of N, such  weak interactions can become highly relevant physiologically when high local concentrations are prevailing, for example, when the disordered extension is preconcentrated while tethered within macromolecular assemblies as in the RNP, or in macromolecular condensates.”

      Furthermore, we have added early in the Discussion:

      “Even though the solution affinity of the N-arm P13L interface is ultra-weak, the average local concentration of N-arm chains across the RNP volume (in a back-of-the-envelope calculation assuming a ≈14 nm cube (Klein et al., 2020) with a dodecameric N cluster) is ≈7.4 mM, such that disordered N-arm peptides could well create populations of N-arm clusters stabilizing RNPs through this interface.  However, besides the RNP-stabilizing mutants we have also observed unexpected RNP destabilization by the ubiquitous R203K/G204R double mutation, which may be caused by the introduction of additional charges close to the self-association interface in the LRS. In our experiments, this destabilization is more than compensated for by the P13L mutation. (Another scenario where ultra-weak interactions can have a critical impact is in molecular condensates. We previously reported the suppression of LLPS by the R203K/G204R mutation, which is rescued by the additional P13L/Δ31-33 mutation (Nguyen et al., 2024). This is consistent with compensatory weak stabilizing and destabilizing impacts of weak interactions on the RNP observed here.)”

      Reviewer #1 (Recommendations for the Authors):

      In Figure 1B, it is unclear what the orange lines connecting polypeptides represent, as well as the zig-zag orange lines in the N-arm.

      We thank the Reviewer for this comment. We intended this to represent regions of self-association but recognize the patterned background is confusing. We have changed this now to solid-colored backgrounds, and indicated this in the figure legend:

      “Regions of self-association are indicated by shaded backgrounds.”

      Regarding presentation, in Figure 5 (MP), the relationship between mass and oligomer size should be shown more clearly.

      We agree. To this end we have labeled the peaks in the MP histograms in Figure 5 with the oligomeric state of the 2N/2SL7 subunits.

      Reviewer #2 (Recommendations for the Authors):

      I find the science of the paper to be convincing and compellingly supported.

      Thank you for this positive statement.

      My primary complaints are with presentation or minor technical questions that, honestly, primarily arise due to my own ignorance and unfamiliarity with some of the techniques employed.

      My primary issue is with the figures. I find, generally, the text in axes labels, ticks, and legends to be too small to comfortably read. This is particularly true in the CD spectra and

      other data presented in Figures 1D, 2B, 4, 5, 6, and 8.

      We agree and have increased the font size of all text and labels of the plots in Figure 1, 2, 4, 5, 6, and 8.

      I also found the use of initialisms to be a bit overbearing and inconsistent. For example, the authors repeatedly switch between spelling out "nucleic acid" and the initialism "NA" (which is also never explicitly spelled out in the text). With the already substantial length of the text, my own personal opinion would be to suggest spelling out all initialisms in the interest of making the reading easier.

      This is a valid criticism. To improve the readability, we have followed this advice and systematically spelled out “nucleic acid” instead of using “NA”.  Similarly, we have now written out full-length instead of the abbreviation FL, and omitted the abbreviation IDR for intrinsically disordered regions, as well as VOC for variant of concern, and AF3 for AlphaFold.

      Regarding the reference to mutants, we have now explained upfront the system of Latin and Greek nomenclature we consistently applied.

      “We will adopt a nomenclature where the complete set of defining mutations of a variant will be referred to by its Greek letter, i.e., N:P13L/R203K/G204R/G214C is N­­<sub>l</sub>, and analogously the set of Omicron mutations N:P13L/Δ31-33/R203K/G204R are referred to as N<sub>ο</sub>; see Table 1”

      I found the text to be verbose, bordering on overly so; the Introduction is more than two pages long. The section "Enhanced oligomerization of the leucine-rich sequence through cysteine mutations" has two long paragraphs of introduction before the present results are discussed, et cetera. An (admittedly, very rough) estimation of the length of the paper places it at ~9,000 -10,000 words long, and I think that the presentation might benefit from significant editing and

      shortening.

      We agree the manuscript is longer than would be desirable, and we generally prefer not to insert mini-introductions into Results sections. On the other hand, in order to make a solid contribution to understanding the big picture of fuzzy complexes in molecular evolution of RNA virus proteins it is indispensable to go into the details of RNP assembly and several of the interfaces. Therefore, we feel the length is in the range that it needs to be without losing clarity. In addition, other Reviewer suggestions to extend the discussion, for example, of limitations of VLP assays and the in vivo state of cysteines, conflict with significant shortening.

      In the particular case of the cysteine mutations, cited by the Reviewer, we believe it is important to add detailed background on G215C, because the Results proceed in a comparison of the self-association mode between G215C and G214C. This is of significant interest in the present context not only for the independent introduction of interface-enhancing mutations highlighting the evolution of fuzzy complexes, but also because it illustrates the pleomorphic ability of RNPs.

      Nonetheless, we have slightly shortened this text and merged the background into a single paragraph. More generally, we have critically reread the text to remove tangential sentences where possible and to make it more concise.

      I have a few more specific comments.

      In Figure 1A, I suggest explicitly labeling the location of the LRS, as it comes up repeatedly.

      Yes, we thank the Reviewer for this suggestion and have introduced this label in Figure 1A.

      In Figure 1B, the legend indicates that the red lines indicate "new inter-dimer interactions." However, these red lines are overlayed on a vertical stripe of red squiggles; it is unclear to me and not explicitly described in the legend what these squiggles are meant to illustrate.

      We agree this background was confusing. As mentioned in our Response to Reviewer #1 we have replaced the structured background with a solid background and explained in the figure legend that these areas depict regions of self-association.

      On lines 44-45, the authors state, "The IDRs amount to 45%, ..." 45% of what?

      Thank you, this was unclear.  We have now clarified “The IDRs amount to ≈45% of total residues”

      In lines 244 - 246, the authors compare the sizes of complexes in reducing versus non- reducing conditions as measured by dynamic light scattering, stating, "However, dynamic light scattering (DLS) revealed the presence of N210-246:G214C complexes with hydrodynamic radii 244 ranging from 6 to 40 nm (in comparison to 1-2 nm for N210- 246:G215C(Zhao et al., 2022)) in reducing conditions, and slightly larger in non-reducing conditions (Supplementary Figure S4)." Using this single statistic seems to me to be a less-than-ideal way of characterizing what seems to me to be happening here. In Supplementary Figure 4, it appears to me that what is happening is that in non-reduced conditions, the sample is monodisperse, whereas in reducing conditions, the distribution becomes polydisperse/bimodal, with two clearly separate populations. I feel that this could use a more

      thorough description rather than just stating the overall range of particle sizes.

      Yes, the Reviewer is correct – it is indeed a good idea to be more precise here. To this end we have carried out cumulant analyses on the autocorrelation functions, as a time-honored method to quantify the polydispersity.  Both samples are polydisperse, but more so in reducing conditions. We have now added “For N210-246:G214C a cumulant analysis results in radii of 8.8 nm and 10.6 nm and polydispersity indices of 0.40 and 0.35 for reducing and non-reducing conditions, respectively”

      Finally, I have one remaining comment that is a result of my own inexperience with circular dichroism and interpreting the spectra. For me personally, I would appreciate a more thoroughdescription/illustration of the statistics involved in the CD spectra, but perhaps this is not necessary for people who are more familiar with interpreting these kinds of data. For example, in Figure 1D, it is not clear to me what the error bars/confidence intervals for the CD data look like. I see many squiggles, some of which the authors claim are significant (e.g., the differences between ~215 - 230 nm), and others are not worthy of comment. Let's say, for example, that I fit a smoothed spline through these data and then measure the magnitude of the fluctuations from that spline to define/quantify confidence intervals. What does that distribution look like? Or maybe the confidence intervals are so small that all squiggles are significant?

      Thank you, this is a good question. As mentioned in the methods section, the CD spectra shown are averages of triplicate scans. Therefore, it is straightforward to extract the standard deviation at each wavelength from the three measurements (although a spline would probably work just as well). The values are what one would expect for the squiggles to be random noise. In the region 215 – 220 nm characteristic for helical secondary structure the standard deviations are small relative to the separation between curves, which indicates that the differences are highly significant. Naturally, the curves do overlap in other spectral regions, which would make a plot including the wavelength-dependent error bars or confidence bands too crowded. Therefore, we have kept the plot of the averaged triplicate scans, but have now provided the average standard deviations for all species in the figure legend and mentioned their significant separation:

      “Triplicate scans yield average standard deviations of 0.13 (N), 0.17 (N+SL7), 0.16 (N<sub>l</sub>), and 0.21 (N<sub>l</sub> +SL7) 10<sup>3</sup> deg cm<sup>2</sup>/dmol, respectively, with non-overlapping confidence bands for the different species, for example, between 215-220 nm.”

      Reviewer #3 (Recommendations for the Authors):

      (1) The Discussion reiterates much of the background (mutational tolerance, fuzziness, SLiMs) already covered in the Introduction, diluting focus on the key new findings. The authors should consider shortening and refocusing the discussion on the main contributions in light of existing knowledge of viral assembly.

      In the Introduction we have provided background on intrinsically disordered proteins in general and their mutational tolerance, as well as the concept of fuzzy complexes. The first several paragraphs of the Discussion have a different focus, which is protein binding interfaces between viral proteins (obviously key in fuzzy complexes), specifically their modulation and the remarkable de novo introduction of binding interfaces. We believe this deserves emphasis, since this highlights a novel aspect of fuzziness, for the mutant spectrum of RNA viruses to encode a range and of assembly stabilities and architectures. 

      To reduce redundancy between the end of the Introduction and the beginning of the Discussion, we have shortened the last paragraph of the Introduction and removed its preview of the conclusions, as described in the response to the next comment of the Reviewer (see below).

      Unfortunately, the length of the Discussion is dictated in part also by the need to discuss methodological aspects, among them the limitations of VLP assays, and the redox state of the cysteine in the LRS mutants, which were important points recommended by other suggestions of the Reviewers. Similarly, we believe the discussion of other potential functions of Omicron N-arm mutations is warranted, as well as the background of the R203K/G204R double mutation that has attracted significant attention in the field due to its effects on phosphorylation and expression of truncated N species that also form RNPs. Our goal was to integrate the results by us and other laboratories regarding specific mutation effects into a comprehensive picture of molecular evolution of N, which we believe the framework of fuzzy complexes can provide.

      (2) The Abstract and early Introduction set a broad stage (IDPs, fuzziness), but don't explicitly state the concrete hypotheses that the experiments test. Please add 2-3 sentences in the Introduction that enumerate testable hypotheses, e.g.:

      (a) P13L creates a new N-arm interface that increases RNP stability.

      (b) G214C/G215C strengthens LRS oligomerization to stabilize higher-order N assemblies.

      We agree the introduction can be improved.  However, it seems to us that it cannot be neatly framed in the hypothesis – answer dichotomy, without losing a lot of nuances and without requiring an even longer and more detailed introduction.

      One of the main questions is to test whether the framework of fuzzy complexes can be applied to understand molecular evolution of N, and we feel the introduction is already flowing well towards this:

      “ … In fuzzy complexes the total binding energy is distributed into multiple distinct ultra-weak interaction sites (Olsen et al., 2017). Similar to individual RNA virus proteins with loose or absent structure, maintaining disorder and a spatial distribution of low-energy interactions in the protein complexes may increase the tolerance for mutations and improve evolvability of protein complexes.\

      The unprecedented worldwide sequencing effort of SARS-CoV-2 genomes during its rapid evolution in humans provides a unique opportunity to examine these concepts. ...”

      To bring this to a more concrete set of questions in the end, we have shortened and rewritten the last paragraph in the Introduction:

      “To examine how architecture and energetics of RNP assemblies can be impacted by N-protein mutations we study a panel of N-proteins derived from ancestral Wuhan-Hu-1 and different VOCs, including Alpha, Delta, Lambda, and Omicron (see Table 1), in biophysical experiments, VLP assays, and mutant virus. Specifically, we ask how the RNP size distribution and life-time is modulated by: (1) the novel binding interface created by the P13L mutation of Omicron; (2) enhancements of other weak self-association interfaces through G215C of Delta and G214C of Lambda; (3) the ubiquitous R203K/G204R double mutation of Alpha, Lambda, and Omicron.  We also test whether the P13L mutation improves viral fitness, similar to G215C and R203K/G204R. The results are discussed in the framework of fuzzy complexes and molecular evolution of N in the course of viral adaptation to the human host. Understanding the salient features of the binding interfaces in viral assembly and their evolution expands our foundation for the design of therapeutics such as assembly inhibitors.”

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):  

      From my reading, this study aimed to achieve two things:  

      (1) A neurally-informed account of how Pieron's and Fechner's laws can apply in concert at distinct processing levels.  

      (2) A comprehensive map in time and space of all neural events intervening between stimulus and response in an immediately-reported perceptual decision.  

      I believe that the authors achieved the first point, mainly owing to a clever contrast comparison paradigm, but with good help also from a new topographic parsing algorithm they created. With this, they found that the time intervening between an early initial sensory evoked potential and an "N2" type process associated with launching the decision process varies inversely with contrast according to Pieron's law. Meanwhile, the interval from that second event up to a neural event peaking just before response increases with contrast, fitting Fechner's law, and a very nice finding is that a diffusion model whose drift rates are scaled by Fechner's law, fit to RT, predicts the observed proportion of correct responses very well. These are all strengths of the study.   

      We thank the reviewer for their comments that added context to the events we detected in relation to previous findings. We also believe that the change in the HMP algorithm suggested by the reviewer improved the precision of our analyses and the manuscript. We respond to the reviewer’s specific comments below.

      (1) The second, generally stated aim above is, in the opinion of this reviewer, unconvincing and ill-defined. Presumably, the full sequence of neural events is massively task-dependent, and surely it is more in number than just three. Even the sensory evoked potential typically observed for average ERPs, even for passive viewing, would include a series of 3 or more components - C1, P1, N1, etc. So are some events being missed? Perhaps the authors are identifying key events that impressively demarcate Pieron- and Fechner-adherent sections of the RT, but they might want to temper the claim that they are finding ALL events. In addition, the propensity for topographic parsing algorithms to potentially lump together distinct processes that partially co-evolve should be acknowledged.  

      We agree with the reviewer that the topographical solutions found by HMP will be dependent on the task and the quality and type of data. We address this point in the last section of the discussion (see also response to R3.5). We would also like to add that the events detected by HMP are, by construction, those that contribute to the RT and not necessarily all ERPs elicited by a stimulus.

      In addition to the new last section of the discussion we also make these points clear in the revised manuscript at the discussion start: 

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we  aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task”.

      Regarding the typical visual ERPs, in response to this comment but also comments R1.2, R1.3 and R2.1, we aimed for a more precise description of the topographies and thus reduced the width of the HMP expected events to 25ms. This ensures that we do not miss events shorter than the initial expectations of 50ms (see Appendix B of Weindel et al., 2024 and also response to  R1.3). This new estimation provides evidence for at least two of the visual ERPs that, based on their timings and topographies (in relation with the spatial frequency of the stimulus), we interpret as the N40 and the P100 (see response to R1.5 for the justification of this categorization). We provide a description and justification of the interpretations in the result section “Five trial-recurrent sequential events occur in the EEG during decisions” and the discussion section “Visual encoding time”.

      (2) To take a salient example, the last neural event seems to blend the centroparietal positivity with a more frontal midline negativity, some of which would capture the CNV and some motor-execution related components that are more tightly time-locked to, of course, the response. If the authors plotted the traditional single-electrode ERP at the frontal focus and centroparietal focus separately, they are likely to see very different dynamics and contrast- and SAT-dependency. What does this mean for the validity of the multivariate method? If two or more components are being lumped into one neural event, wouldn't it mean that properties of one (e.g., frontal burstiness at response) are being misattributed to the other (centroparietal signal that also peaks but less sharply at response)?

      Using the new HMP parameterization described above we show that the reviewer's intuition was correct. Using an expected pattern duration of 25ms the last event in the original manuscript splits in two events. The before-last event, now referred to the lateralized readiness potential (LRP) presents a strong lateralization (Figure 3) with an increased negativity over the motor cortex contralateral to the right hand. The effect of contrast is mostly on the last event that we interpret as the CPP (Figure 5). Despite the improved precision of the topographies of the identified events, it is however to be noted that some components will overlap. If the LRP is generated when a certain amount of evidence is accumulated (e.g. that the CPP crosses a certain value) then a time-based topography will necessarily include that CPP activity in addition to the lateralized potential. We discuss this in the section “Motor execution” of the discussion:

      “Adding the abrupt onset of this potential, we believe that this event is the start of motor execution, engaged after a certain amount of evidence. The evidence for this interpretation is manifest in the fact that the event's topography shares some activity with the CPP event that follows, an expected result if the LRP is triggered at a certain amount of evidence, indexed by the CPP”.

      (3) Also related to the method, why must the neural events all be 50 ms wide, and what happens if that is changed? Is it realistic that these neural events would be the same duration on every trial, even if their duration was a free parameter? This might be reasonable for sensory and motor components, but unlikely for cognitive.  

      The HMP method is sensitive to the event's duration as shown in the manuscript about the method (Appendix B of Weindel et al., 2024). Nevertheless as long as the topography in the real data is longer than the expected one it shouldn't be missed (i.e. same goes for by-trial variations in the event width). For this reason we halved the expected event width of 50ms (introduced by the original HsMM-MVPA paper by Anderson and colleagues) in the revision. This new estimation with 25ms thus is much less likely to miss events as evidenced by the new visual and motor events. In the revised manuscript this is addressed at the start of the Results section:

      “Contrary to previous applications (Anderson et al.,2016; Berberyan et al., 2021; Zhang et al., 2018; Krause et al., 2024) we assumed that the multivariate pattern was represented by a 25ms half-sine as our previous research showed that a shorter expected pattern width increases the likelihood of detecting cognitive events (see Appendix B of Weindel et al., 2024)”.

      Regarding the event width as a free parameter this is both technically and statistically difficult to implement as the amount of computing capacity, flexibility and trade-offs among the HMP parameters would, given the current implementation, render the model unfit for most computers and statistically unidentifiable.

      (4) In general, I wonder about the analytic advantage of the parsing method - the paradigm itself is so well-designed that the story may be clear from standard average event-related potential analysis, and this might sidestep the doubts around whether the algorithm is correctly parsing all neural events.  

      Average ERP analysis suffers from an impossibility to differentiate between an effect of an experimental factor on the amplitude vs. on the timing of the underlying components (Luck, 2005). Furthermore the overlap of components across trials bluries the distinction between them. For both reasons we would not be able to reach the same level of certainty and precision using ERP analyses. Furthermore the relatively low number of trials per experimental cell (contrast level X SAT X participant = 6 trials) makes the analyses hard to perform on ERP which typically require more trials per modality. From the reviewer’s comment we understand that this point was not clear. We therefore discuss this in the revision, Section “Functional interpretation of the events” of the results:

      “Nevertheless identifying neural dynamics on these ERPs centered on stimulus is complicated by the time variation of the underlying single-trial events (see probabilities displayed in Figure 3 for an illustration and Burle et al., 2008, for a discussion). The likely impact of contrast on both amplitude and time on the underlying single-trial event does not allow one to interpret the average ERP traces as showing an effect in one or the other dimension without strong assumptions (Luck, 2005)”.

      (5) In particular, would the authors consider plotting CPP waveforms in the traditional way, across contrast levels? The elegant design is such that the C1 component (which has similar topography) will show up negative and early, giving way to the CPP, and these two components will show opposite amplitude variations (not just temporal intervals as is this paper's main focus), because the brighter the two gratings, the stronger the aggregate early sensory response but the weaker the decision evidence due to Fechner. I believe this would provide a simple, helpful corroborating analysis to back up the main functional interpretation in the paper.  

      We agree with the suggestion and have introduced the representation on top of Figure 5 for sets of three electrodes in the occipital, posterior and frontal regions. The new panels clearly show an inversion of the contrast effect dependent on the time and locus of the electrodes. We discuss this in Section “Functional interpretation of the events” of the results:

      “This representation shows that there is an inversion of the contrast effect with higher contrasts having a higher amplitude on the electrodes associated with visual potentials in the first couple of deciseconds (left panel of Figure 5A) while parietal and frontal electrodes shows a higher amplitude for lower contrasts in later portions of the ERPs (middle and right panel of Figure 5A)”.

      To us, this crucially shows that we cannot achieve the same decomposition using traditional ERP analyses. In these plots it appears that while, as described by the reviewer, there is an inversion, the timing and amplitude of the changes due to contrast can hardly be interpreted.

      (6) The first component is picking up on the C1 component (which is negative for these stimulus locations), not a "P100". Please consult any visual evoked potential study (e.g., Luck, Hillyard, etc). It is unexpected that this does not vary in latency with contrast - see, for example. Gebodh et al (2017, Brain Topography) - and there is little discussion of this. Could it be that nonlinear trends were not correctly tested for?  

      We disagree with the reviewer on the interpretation of the ERP. The timing of the detected component is later than the one usually associated with a C1. Furthermore the central display does not create optimal conditions to detect a C1

      We do agree that the topography raises the confusion but we believe that this is due to the spatial frequency of the stimulus that generates a high posterior positivity (see references in the following extract). The new HMP solution also now happens to show an effect of contrast on the P100 latencies, we believe this is due to the increased precision in the time location of the component. We discuss this in the “Visual encoding time” section of the discussion:

      “The following event, the P100, is expressed around 70ms after the N40, its topography is congruent with reports for stimuli with low spatial frequencies as used in the current study (Kenemans et al., 2002, 2000; Proverbio et al., 1996). The timing of this P100 component is changed by the contrast of the stimulus in the direction expected by the Piéron law (Figure 4A)”. 

      (7) There is very little analysis or discussion of the second stage linked to attention orientation - what would the role of attention orientation be in this task? Is it spatial attention directed to the higher contrast grating (and if so, should it lateralise accordingly?), or is it more of an alerting function the authors have in mind here?  

      We agree that we were not specific enough on the interpretation of this attention stage. We now discuss our hypothesis in the section “Attention orientation” of the discussion:  

      “We do however observe an asymmetry in the topographical map Figure 3. This asymmetry might point to an attentional bias with participants (or at least some participants) allocating attention to one side over the other in the same way as the N2pc component (Luck and Hillyard, 1994, Luck et al., 1997). Based on this collection of observations, we conclude that this third event represents an attention orientation process. In line with the finding of Philiastides et al. (2006), this attention orientation event might also relate to the allocation of resources. Other designs varying the expected cognitive load or spatial attention could help in further interpreting the functional role of this third event”.

      We would like to add that it is unlikely that the asymmetry we mention in the discussion cannot stem from the redirection towards higher contrast as the experimental design balanced the side of presentation. We therefore believe that this is a behavioral bias rather than a bias toward the highest contrast stimulus as suggested by the reviewer. We hope that, while more could be tested and discussed, this discussion is sufficient given the current manuscript's goal.

      Reviewer #2 (Public review):  

      Summary:  

      The authors decomposed response times into component processes and manipulated the duration of these processes in opposing directions by varying contrast, and overall by manipulating speed-accuracy tradeoffs. They identify different processes and their durations by identifying neural states in time and validate their functional significance by showing that their properties vary selectively as expected with the predicted effects of the contrast manipulation. They identify 3 processes: stimulus encoding, attention orienting, and decision. These map onto classical event-related potentials. The decision-making component matched the CPP, and its properties varied with contrast and predicted decision-accuracy, while also exhibiting a burst not characteristic of evidence accumulation.  

      Strengths:  

      The design of the experiment is remarkable and offers crucial insights. The analysis techniques are beyond state-of-the-art, and the analyses are well motivated and offer clear insights.  

      Weaknesses:  

      It is not clear to me that the results confirm that there are only 3 processes, since e.g., motor preparation and execution were not captured. While the authors discuss this, this is a clear weakness of the approach, as other components may also have been missed. It is also unclear to what extent topographies map onto processes, since, e.g., different combinations of sources can lead to the same scalp topography.  

      We thank the reviewer for their kind words and for the attention they brought on the question of the missing motor preparation event. In light of this comment (and also R1.1, R3.3) the revised manuscript uses a finer grained approach for the multivariate event detection. This preciser estimation comes from the use of a shorter expected pattern in which the initial expectation of a 50ms half-sine was halved, therefore ensuring that we do not miss events shorter than the initial expectations (see Appendix B of Weindel et al., 2024 and also response to  R1.3). In the new solution the motor component that the reviewer expected is found as evidenced by the topography of the event, its lateralization and a time-to-response congruent with a response execution event. This is now described in the section “Motor execution” of the revised manuscript: 

      “The before last event, identified as the LRP, shows a strong hemispheric asymmetry congruent with a right hand response. The peak of this event is approximately 100 ms before the response which is congruent with reports that the LRP peaks at the onset of electromyographical activity in the effector muscle (Burle et al., 2004), typically happening 100ms before the response in such decision-making tasks (Weindel et al., 2021). Furthermore, while its peak time is dependent on contrast, its expression in the EEG is less clearly related to the contrast manipulation than the following CPP event”.

      Reviewer #3 (Public review):  

      Summary:  

      In this manuscript, the authors examine the processing stages involved in perceptual decision-making using a new approach to analysing EEG data, combined with a critical stimulus manipulation. This new EEG analysis method enables single-trial estimates of the timing and amplitude of transient changes in EEG time-series, recurrent across trials in a behavioural task. The authors find evidence for three events between stimulus onset and the response in a two-spatial-interval visual discrimination task. By analysing the timing and amplitude of these events in relation to behaviour and the stimulus manipulation, the authors interpret these events as related to separable processing stages for stimulus encoding, attention orientation, and decision (deliberation). This is largely consistent with previous findings from both event-related potentials (across trials) and single-trial estimates using decoding techniques and neural network approaches.  

      Strengths:  

      This work is not only important for the conceptual advance, but also in promoting this new analysis technique, which will likely prove useful in future research. For the broader picture, this work is an excellent example of the utility of neural measures for mental chronometry.  

      We appreciate the very positive review and thank the reviewer for pointing out important weaknesses in our original manuscript and also providing resources to address them in the recommendations to authors. Below we comment on each identified weakness and how we addressed them.   

      Weaknesses:  

      (1) The manuscript would benefit from some conceptual clarifications, which are important for readers to understand this manuscript as a stand-alone work. This includes clearer definitions of Piéron's and Fechner's laws, and a fuller description of the EEG analysis technique.

      We agree that the description of both laws were insufficient, we therefore added the following text in the last paragraph of the introduction:

      “Piéron’s law predicts that the time to perceive the two stimuli (and thus the choice situation) should follow a negative power law with the stimulus intensity (Figure 1, green curve). In contradistinction, Fechner’s law states that the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches (Figure 1, yellow curve). As the task of our participants is to judge the contrast difference, Piéron’s law should predict the time at which the comparison starts (i.e. the stimuli become perceptible), while Fechner’s law should implement the comparison, and thus decision, difficulty”.

      Regarding the EEG analysis technique we added a few elements at the start of the result:

      “The hidden multivariate pattern model (HMP) implemented assumed that a task-related multivariate pattern event is represented by a half-sine whose timing varies from trial to trial based on a gamma distribution with a shape parameter of 2 and a scale, controlling the average latency of the event, free-to-vary per event (Weindel et al., 2024)”.

      We also made the technique clearer at the start of the discussion:

      “By modeling the recorded single-trial EEG signal between stimulus onset and response as a sequence of multivariate events with varying by-trial peak times, we aimed to detect recurrent events that contribute to the duration of the reaction time in the present perceptual decision-making task. In addition to the number of events, using this hidden multivariate pattern approach (Weindel et al., 2024) we estimated the trial-by-trial probability of each event’s peak, therefore accessing at which time sample each event was the most likely to occur”.

      Additionally, we added a proper description in the method section (see the new first paragraph of the “Hidden multivariate pattern” subsection). 

      (2) The manuscript, broadly, but the introduction especially, may be improved by clearly delineating the multiple aims of this project: examining the processes for decision-making, obtaining single-trial estimates of meaningful EEG-events, and whether central parietal positivity reflects ramping activity or steps averaged across trials.

      For the sake of clarity we removed the question of the ramping activity vs steps in the introduction and focused on the processes in decision-making and their single-trial measurement as this is the main topic of the paper. Furthermore the references provided by the reviewer allowed us to write a more comprehensive review of previous studies and how the current study is in line with those. These changes are mainly manifested in these new sentences:

      “As an example Philiastides et al. (2006) used a classifier on the EEG activity of several conditions to show that the strength of an early EEG component was proportional to the strength of the stimulus while a later component was related to decision difficulty and behavioral performance (see also Salvador et al., 2022; Philiastides and Sajda, 2006). Furthermore the authors interpreted that a third EEG component was indicative of the resource allocated to the upcoming decision given the perceived decision difficulty. In their study, they showed that it is possible to use single-trial information to separate cognitive processes within decision-making. Nevertheless, their method requires a decoding approach, which requires separate classifiers for each component of interest and restrains the detection of the components to those with decodable discriminating features (e.g. stimuli with strong neural generators such as face stimuli, see Philiastides et al., 2006)”.

      (3) A fuller discussion of the limitations of the work, in particular, the absence of motor contributions to reaction time, would also be appreciated. 

      As laid out in responses to comments R1.1 and R2 the new estimates now include evidence for a motor preparation component. We discuss this in the new “motor execution” paragraph in the discussion section. Additionally we discuss the limitation of the study and the method in the two last paragraphs of the discussion (in the new Section “Generalization and limitation”).

      (4) At times, the novelty of the work is perhaps overstated. Rather, readers may appreciate a more comprehensive discussion of the distinctions between the current work and previous techniques to gauge single-trial estimates of decision-related activity, as well as previous findings concerning distinct processing stages in decision-making. Moreover, a discussion of how the events described in this study might generalise to different decision-making tasks in different contexts (for example, in auditory perception, or even value-based decision-making) would also be appreciated.  

      We agree that the original text could be read as overstating. In addition to the changes linked to R3.2 we also now discuss the link with the previous studies in the before-last paragraph of the discussion before the conclusion in the new “Generalization and limitations” section:

      “The present study showed what cognitive processes are contributing to the reaction time and estimated single-trial times of these processes for this specific perceptual decision-making task. The identified processes and topographies ought to be dependent on the task and even the stimuli (e.g. sensory events will change with the sensory modality). More complex designs might generate a higher number of cognitive processes (e.g. memory retrieval from a cue, Anderson et al., 2016) and so could more natural stimuli which might trigger other processes in the EEG (e.g. appraisal vs. choice as shown by Frömer et al., 2024). Nevertheless, the observation of early sensory vs. late decision EEG components is likely to generalize across many stimuli and tasks as it has been observed in other designs and methods (Philiastides et al., 2006; Salvador et al., 2022). To these studies we add that we can evaluate the trial-level contribution, as already done for specific processes (e.g. Si et al., 2020; Sturm et al., 2016), for the collection of events detected in the current study”.

      Reviewing Editor Comments:  

      As you will see, all three reviewers agree that the paper makes a valuable contribution and has many strengths. You will also see that they have provided a range of constructive comments highlighting potential issues with the interpretation of the outcomes of your signal decomposition method. In particular, all three reviewers point out that your results do not identify separate motor preparation signals, which we know must be operating on this type of task. The reviewers suggest further discussion of this issue and the potential limitations of your analysis approach, as well as suggesting some additional analyses that could be run to explore this further. While making these changes would undoubtedly enhance the paper and the final public reviews, I should note that my sense is that they are unlikely to change the reviewers' ratings of the significance of the findings and the strength of evidence in the final eLife assessment  

      Reviewer #1 (Recommendations for the authors):  

      (1) Abstract: "choice onset" is ill-defined and not the label most would give the start of the RT interval. Do you mean stimulus onset?  

      We replaced with "choice onset" with "stimulus onset" in the abstract

      (2) Similarly "choice elements" in the introduction seem to refer to sensory attributes/objects being decided about?  

      We replaced "choice-elements" with "choice-relevant features of the stimuli"

      (3) "how the RT emerges from these putative components" - it would be helpful to specify more what level of answer you're looking for, as one could simply answer "when they're done."  

      We replaced with "how the variability in RTs emerges from these putative components"

      (4) Line 61-62: I'm not sure this is a fully correct characterisation of Frömer et al. It was not similar in invoking a step function - it did not invoke any particular mechanism or function, and in that respect does not compare well to Latimer et al. Also, I believe it was the overlap of stimulus-locked components, not response-locked, that they argued could falsely generate accumulator-like buildup in the response-locked ERP.  

      We indeed wrongly described Frömer et al. The sentence is now "In human EEG data, the classical observation of a slowly evolving centro-parietal positivity, scaling with evidence accumulation, was suggested to result from the overlap of time-varying stimulus-related activity in the response-locked event related potential"

      (5) Line 78: Should this be single-trial *latency*?  

      This referred to location in time but we agree that the term is confusing and thus replaced it with latencies.

      (6) The caption of Figure 1 should state what is meant by the y-axis "time"  

      We added the sentence "The y-axis refers the time predicted by each law given a contrast value (x-axis) and the chosen set of parameters." in the caption of Figure 1

      (7) Line 107: Is this the correct description of Fechner's law? If the perceived difference follows the log of the physical difference, then a constant physical difference should mean a constant perceived difference. Perhaps a typo here.  

      This was indeed a typo we replaced the corresponding part of the sentence with "the perceived difference between the two patches follows the logarithm of the absolute contrast of the two patches"

      (8) Line 128: By scale, do you mean magnitude/amplitude?  

      No, this refers to the parameter of a gamma distribution. To clarify we edited the sentence:  "based on a gamma distribution with a shape parameter of 2 and a scale parameter, controlling the average latency of the event, free-to-vary per event"

      (9) The caption of Figure 3 is insufficient to make sense of the top panel. What does the inter-event interval mean, and why is it important to show? What is the "response" event?  

      We agree that the top panel was insufficiently described. To keep the length of the paper short and because of the relatively low amount of information provided by these panels we replaced them for a figure only showing the average topographies as well as the asymmetry tests for each event.

      (10) Figure 4: caption should say what the top vs bottom row represents (presumably, accuracy vs speed emphasis?), and what the individual dots represent, given the caption says these are "trial and participant averaged". A legend should be provided for the rightmost panels.  

      We agree and therefore edited Figure 4. The beginning of the caption mentioned by the reviewer now reads: “A) The panels represent the average duration between events for each contrast level, averaged across participants and trials (stimulus and response respectively as first and last events) for accuracy (top) and speed instructions (bottom).”. Additionally we added legends for the SAT instructions and the model fits.

      (11) Line 189: argued for a decision-making role of what?  

      Stafford and Gurney (2004) proposed that Pieron’s law could reflect a non-linear transformation from sensory input to action outcomes, which they argued reflected a response mechanism. We (Van Maanen et al., 2012) specified this result by showing that a Bayesian Observer Model in which evidence for two alternative options was accumulated following Bayes Rule indeed predicted a power relation between the difference in sensory input of the two alternatives, and mean RT. However, the current data suggest that such an explanation cannot be the full story, as also noted by R3. To clarify this point we replaced the comment by the following sentence:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron-like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014 for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (12) Table 2: There is an SAT effect even on the first interval, which is quite remarkable and could be discussed more - does this mean that the C1 component occurs earlier under speed pressure? This would be the first such finding.  

      The original event we qualified as a P100 was sensitive to SAT but the earliest event is now the N40 and isn’t statistically sensitive to speed pressure in this data. We believe that the fact that the P100 is still sensitive to SAT is not a surprise and therefore do not outline it.

      (13) Line 221: "decrease of activation when contrast (and thus difficulty) increases" - is this shown somewhere in the paper?  

      The whole section for this analysis was rewritten (see comment below)

      (14) I find the analysis of Figure 5 interesting, but the interpretation odd. What is found is that the peak of the decision signal aligns with the response, consistent with previous work, but the authors choose to interpret this as the decision signal "occurring as a short-lived burst." Where is the quantitative analysis of its duration across trials? It can at least be visually appraised in the surface plot, and this shows that the signal has a stimulus-locked onset and, apart from the slowest RTs, remains present and for the most part building, until response. What about this is burst-like? A peak is not a burst.  

      This was the residue of a previous version of the paper where an analysis reported that no evidence accumulation trace was found. But after proper simulations this analysis turned out to be false because of a poor statistical test. Thus we removed this paragraph in the revised manuscript and Figure 5 has now been extended to include surface plots for all the events.

      Reviewer #2 (Recommendations for the authors):  

      Overall, I really enjoyed reading this paper. However, in some places the approach is a bit opaque or the results are difficult to follow. As I read the paper, I noted:  

      Did you do a simple DDM, or did you do a collapsing bound for speed?  

      The fitted DDM was an adaptation of the proportional rate diffusion model. We make this clearer at the end of the introduction: "Given that Fechner’s law is expected to capture decision difficulty we connected this law to the classical diffusion decision models by replacing the rate of accumulation with Fechner’s law in the proportional rate diffusion model of Palmer et al.(2005).”

      It is confusing that the order of intervals in the text doesn't match the order in the table. It might be better to say what events the interval is between rather than assuming that the reader reconstructs.  

      We agree and adapted the order in both the text and the table. The table is now also more explicit (e.g. RT instead of S-R)

      Otherwise, I do wonder to what extent the method is able to differentiate processes that yield similar scalp topographies and find it a bit concerning that no motor component was identified.  

      We believe that the new version with the LRP/CPP is a demonstration that the method can handle similar topographies. The method can handle events with close topographies as long as they are separate in time, however if they are not sequential to one another the method cannot capture both events. We now discuss this, in relation with the C1/P100 overlap, in the discussion section “Visual encoding time”:

      “Nevertheless this event, seemingly overlapping with the P100 even at the trial level (Figure 5C), cannot be recovered by the method we applied. The fact that the P100 was recovered instead of the C1 could indicate that only the timing of the P100 contributes to the RT (see Section 3 of Weindel et al., 2024)”.

      And we more generally address the question of overlap in the new section “Generalization and limitation”.

      Reviewer #3 (Recommendations for the authors):  

      Major Comments:  

      (1) If we agree on one thing, it is that motor processes contribute to response time. Line 364: "In the case of decision-making, these discrete neural events are visual encoding, attention-orientation, and decision commitment, and their latency make up the reaction time." Does the third event, "decision commitment", capture both central parietal positivity (decision deliberation) and motor components? If so, how can the authors attribute the effects to decision deliberation as opposed to motor preparation?  

      Thanks to the suggestions also in the public part. This main problem is now addressed as we do capture both a motor component and a decision commitment.

      Line 351 suggests that the third event may contain two components.  

      This was indeed our initial, badly written, hypothesis. Nevertheless the new solution again addresses this problem.

      The time series in Figure 6 shows an additional peak that is not evident in the simulated ramp of Appendix 1.  

      This was probably due to the overlap of both the CPP and the LRP. It is now much clearer that the CPP looks mostly like a ramp while the LRP looks much more like a burst-like/peaked activity. We make this clear in the “Decision event” paragraph of the discussion section:

      “Regarding the build-up of this component, the CPP is seen as originating from single-trial ramping EEG activities but other work (Latimer et al., 2015; Zoltowski et al., 2019) have found support for a discrete event at the trial-level. The ERPs on the trial-by-trial centered event in Figure 5 show support for both accounts. As outlined above, the LRP is indeed a short burst-like activity but the build-up of the CPP between high vs low contrast diverges much earlier than its peak”.

      Previous analyses (Weindel et al., 2024) found motor-related activity from central parietal topographies close to the response by comparing the difference in single-trial events on left- vs right-hand response trials. The authors suggest at line 315 that the use of only the right hand for responding prevented them from identifying a motor event.  

      The use of only the right hand should have made the event more identifiable because the topography would be consistent across trials (rather than inverting on left vs right hand response trials).  

      The reviewer is correct, in the original manuscript we didn’t test for lateralization, but the comment of the reviewer gave us the idea to explicitly test for the asymmetry (Figure 3). This test now clearly shows what would be expected for a motor event with a strong negativity over the left motor cortex.

      The authors state on line 422 that the EEG data were truncated at the time of the response.  

      Could this have prevented the authors from identifying a motor event that might overlap with the timing of the response?  

      We thank the reviewer for this suggestion. This would have been a possibility but the problem is that adding samples after the response also adds the post-response processes (error monitoring, button release, stimulus disappearance, etc.). While increasing the samples after the response is definitely something that we need to inspect, we think that the separation we achieved in this revision doesn’t call for this supplementary analysis.

      The largest effects of contrast on the third event amplitude appear around the peak as opposed to the ramp. If the peak is caused by the motor component, how does this affect the conclusions that this third event shows a decision-deliberation parietal processes as opposed to a motor process (a number of studies suggest a causal role for motor processes in decision-making e.g. Purcell et al., 2010 Psych Rev; Jun et al., 2021 Nat Neuro; Donner et al., 2009 Curr Bio).  

      This result now changed and it does look like the peak capturing most of the effect is no longer true. We do however think that there might be some link to theories of motor-related accumulation. We therefore added this to the discussion in the Motor execution section:

      “Based on all these observations, it is therefore very likely that this LRP event signs the first passage of a two-step decision process as suggested by recent decision-making models (Servant et al., 2021; Verdonck et al., 2021; Balsdon et al., 2023)”.

      I would suggest further investigation into the motor component (perhaps by extending the time window of analysed EEG to a few hundred ms after the response) and at least some discussion of the potential contribution of motor processes, in relation to the previous literature.  

      We believe that the absence of a motor component is sufficiently addressed in the revised manuscript and in the responses to the other comments.    

      (2) What do we learn from this work? Readers would appreciate more attention to previous findings and a clearer outline of how this work differs. Two points stand out, outlined below. I believe the authors can address these potential complaints in the introduction and discussion, and perhaps provide some clarification in the presentation of the results.  

      In the introduction, the authors state that "... to date, no study has been able to provide single-trial evidence of multiple EEG components involved in decision-making..." (line 64). Many readers would disagree with this. For example, Philiastides, Ratcliff, & Sadja (2006) use a single-trial analysis to unravel early and late EEG components relating to decision difficulty and accuracy (across different perceptual decisions), which could be related to the components in the current work. Other, network-based single-trial EEG analyses (e.g., Si et al., 2020, NeuroImage, Sturn et al., 2016 J Neurosci Methods) could also be related to the current component approach. Yet other approaches have used inverse encoding models to examine EEG components related to separable decision processes within trials (e.g., Salvador et al., 2022, Nat Comms). The results of the current work are consistent with this previous work - the two components from Philiastides et al., 2006 can be mapped onto the components in the current work, and Salvador et al., 2022 also uncover stimulus- and decision-deliberation related components.  

      We completely agree with the reviewer that the link to previous work was insufficient. We now include all references that the reviewer points out both in the introduction (see response R3.2) and in the discussion (see response R3.4). We wish to thank the reviewer for bringing these papers to our attention as they are important for the manuscript.

      The authors relate their components to ERPs. This prompts the question of whether we would get the same results with ERP analyses (and, on the whole, the results of the current work are consistent with conclusions based on ERP analyses, with the exception of the missing motor component). It's nice that this analysis is single-trial, but many of the follow-up analyses are based on grouping by condition anyway. Even the single-trial analysis presented in Figure 4 could be obtained by median splits (given the hypotheses propose opposite directions of effects, except for the linear model). 

      We do not agree with the reviewer in the sense that classical ERP analyses would require much more data-points. The performance of the method is here to use the information shared across all contrast levels to be able to model the processing time of a single contrast level (6 trials per participant). Furthermore, as stated in the response to R1.4 and R1.5, the aim of the paper is to have the time of information processing components which cannot be achieved with classical ERPs without strong, and likely false, assumptions.

      Medium Comments:  

      (1) The presentation of Piéron's law for the behavioural analysis is confusing. First, both laws should be clearly defined for readers who may be unfamiliar with this work. I found the proposal that Piéron's law predicts decreasing RT for increasing pedestal contrast in a contrast discrimination paradigm task surprising, especially given the last author's previous work. For example, Donkin and van Maanen (2014) write "However, the commonality ofPiéron's Law across so many paradigms has lead researchers (e.g., Stafford & Gurney, 2004; Van Maanen et al., 2012) to propose that Piéron's Law is unrelated to stimulus scaling, but is a result of the architecture of the response selection (or decision making) process." The pedestal contrast is unrelated to the difficulty of the contrast discrimination task (except for the consideration of Fechner's law). Instead, Piéron's law would apply to the subjective difference in contrast in this task, as opposed to the pedestal contrast. The EEG results are consistent with these intuitions about Piéron's law (or more generally, that contrast is accumulated over time, so a later EEG component for lower pedestal contrast makes sense): pedestal contrast should lead to faster detection, but not necessarily faster discrimination. Perhaps, given the complexity of the manuscript as a whole, the predictions for the behavioural results could be simplified?  

      We agree that the initial version was confusing. We now clarified the presentation of Piéron's law at the end of the introduction (see also response to R2).

      Once Fechner's law is applied, decision difficulty increases with increasing contrast, so Piéron's law on the decision-relevant intensity (perceived difference in contrast) would also predict increasing RT with increasing pedestal contrast. It is unlikely that the data are of sufficient resolution to distinguish a log function from a power of a log function, but perhaps the claim on line 189 could be weakened (the EEG results demonstrate Piéron's law for detection, but do not provide evidence against Piéron's law in discrimination decisions).  

      This is an excellent observation, thank you for bringing it to our attention. Indeed, the data support the notion that Pieron’s law is related to detection, but do not rule out that it is also related to decision or discrimination. In earlier work, we (Donkin & Van Maanen, 2014) addressed this question as well, and reached a similar conclusion. After fitting evidence accumulation models to data, we found no linear relationship between drift rates and stimulus difficulty, as would have been the case if Pieron's law could be fully explained by the decision process (as -indirectly- argued by Stafford & Gurney, 2004; Van Maanen et al., 2012). The fact that we observed evidence for a non-linear relationship between drift rates and stimulus difficulty led us to the same conclusion, that Pieron’s law could be reflected in both discrimination and decision processes. We added the following comment to the discussion about the functional locus of Pieron's law to clarify this point:

      “Note that this observation is not necessarily incongruent with theoretical work that argued that Piéron’s law could also be a result of a response selection mechanism (Stafford and Gurney, 2004; Van Maanen et al., 2012; Palmer et al., 2005). It could be that differences in stimulus intensity between the two options also contribute to a Piéron like relationship in the later intervals, that is convoluted with Fechner’s law (see Donkin and Van Maanen, 2014, for a similar argument). Unfortunately, our data do not allow us to discriminate between a pure logarithmic growth function and one that is mediated by a decreasing power function”.

      (2) Appendix 1 shows that the event detection of the HMP method will also pick up on ramping activity. The description of the problem in the introduction is that event-like activity could look like ramping when averaged across trials. To address this problem, the authors should simulate events (with some reasonable dispersion in timing such that they look like ramping when averaged) and show that the HMP method would not pull out something that looked like ramping. In other words, the evidence for ramping in this work is not affected by the previously identified confounds.  

      We agree that this demonstration was necessary and thus added the suggested simulation to Appendix 1. As can be seen in the Figure 1 of the appendix, when we simulate a half-sine the average ERP based on the timing of the event looks like a half-sine.

      (3) Some readers may be interested in a fuller discussion of the failure of the Fechner diffusion model in the speed condition.  

      We are unsure which failure the reviewer refers to but assumed it was in relation to the behavioral results and thus added: 

      It is unlikely that neither Piéron nor Fechner law impact the RT in the speed condition. Instead this result is likely due to the composite nature of the RT where both laws co-exist in the RT but cancel each other out due to their opposite prediction.

      Minor Comments:  

      (1) "By-trial" is used throughout. Normally, it is "trial-by-trial" or "single-trial" or "trial-wise".

      We replaced all occurrences of “by-trial”  with the three terms suggested were appropriate.

      (2) Line 22: "The sum of the times required for the completion of each of these precessing steps is the reaction time (RT)." The total time required. Processing.  

      Corrected for both.

      (3) Line 26/27: "Despite being an almost two century old problem (von Helmholtz, 2021)." Perhaps the citation with the original year would make this point clearer.  

      We agree and replaced the citation.

      (4) Line 73: "accounted by estimating". Accounted for by estimating.  

      Corrected.

      (5) Line 77 "provides an estimation on the." Of the.  

      Corrected.

      (6) Line 86: "The task of the participants was to answer which of two sinusoidal gratings." The picture looks like Gabor's? Is there a 2d Gaussian filter on top of the grating? Clarify in the methods, too.  

      We incorrectly described the stimuli as those were indeed just Gabor’s. This is now corrected both in the main text and the method section.

      (7) Figure 1 legend: "The Fechner diffusion law" Fechner's law or your Fechner diffusion model?  

      Law was incorrect so we changed to model as suggested.

      (8) Line 115: "further allows to connects the..." Allows connecting the.  

      Corrected.

      (9) Line 123: "lower than 100 ms or higher than..." Faster/slower.  

      Corrected.

      (10) Line 131: "To test what law." Which law.?  

      Corrected to model.

      (11) Figure 2 legend: "Left: Mean RT (dot) and average fit (line) over trials and participants for each contrast level used." The fit is over trials and participants? Each dot is? Average trials for each contrast level in each participant?  

      This sentence was corrected to “Mean RT (dot) for each contrast level and averaged predictions of the individual fits (line) with Accuracy (Top) and Speed (Bottom) instructions.”.

      (12) Line 231: "A comprehensive analysis of contrast effect on". The effect of contrast on.  

      This title was changed to “functional interpretation of the events”.

      (13) Line 23: "the three HMP event with". Three HMP events.

      The sentence no longer exists in the revised manuscript.

      (14) Line 270: "Secondly, we computed the Pearson correlation coefficient between the contrast averaged proportion of correct." Pearson is for continuous variables. Proportion correct is not continuous. Use Spearman, Kendall, or compute d'.  

      The reviewer rightly pointed out our error, we corrected this by computing Spearman correlation.

      (15)  Line 377: "trial 𝑛 + 1 was randomly sampled from a uniform distribution between 0.5 and 1.25 seconds." It's just confusing why post-response activity in Figure 5 does look so consistent. Throughout methods: "model was fitted" should be "was fit", and line 448, "were split".  

      We do not have a specific hypothesis of why the post-response activity in the previous Figure 5 was so consistent. Maybe the Gaussian window (same as in other manuscripts with a similar figure, e.g. O’Connell et al. 2012) generated this consistency. We also corrected the errors mentioned in the methods.

      (16) The linear mixed models paragraph is a bit confusing. Can it clearly state which data/ table is being referred to and then explain the model? "The general linear mixed model on proportion of correct responses was performed using a logit link. The linear mixed models were performed on the raw milliseconds scale for the interval durations and on the standardized values for the electrode match." We go directly from proportion correct to raw milliseconds...  

      The confusion was indeed due to the initial inclusion of a general linear mixed model on proportion correct which was removed as it was not very informative. The new revision should be clearer on the linear mixed models (see first sentence of subsection ‘linear mixed models' in the method section).

      (17) A fuller description of the HMP model would be appreciated.  

      We agree that this was necessary and added the description of the HMP model in the corresponding method section “Hidden multivariate pattern” in addition to a more comprehensive presentation of HMP in the first paragraph of the Result and Discussion sections.

      (18) Line 458: "Fechner's law (Fechner, 1860) states that the perceived difference (𝑝) between the two patches follows the logarithm of the difference in physical intensity between..." ratio of physical intensity.  

      Corrected.

      (19) P is defined in equations 2 and 4. I would include the beta in equation 4, like in equation 2, then remove the beta from equations 3 and 5 (makes it more readable). I would also just include the delta in equation 2, state that in this case, c1 = c+delta/2 or whatever.  

      This indeed makes the equation more readable so we applied the suggestions for equations 2, 3, 4 and 5. The delta was not added in equation 2 but instead in the text that follows:

      “Where 𝐶1 = 𝐶0 + 𝛿, again with a modality and individual specific adjustment slope (𝛽).” 

      (20) The appendix suggests comparing the amplitudes with those in Figure 3, but the colour bar legend is missing, so the reader can only assume the same scale is used?  

      We added the color bar as it was indeed missing. Note though that the previous version displayed the estimation for the simulated data while this plot in the revised manuscript shows the solution on real data obtained after downsampling the data (and therefore look for a larger pattern as in the main text). We believe that this representation is more useful given that the solution for the downsampled data is no longer the same as the one in the main text (due to the difference in pattern width).

    1. he pervasiveness of these formats means that our culture uses the style and content of these shows as ways to interpret reality. For example, think about a TV news program that frequently shows heated debates between opposing sides on public policy issues. This style of debate has become a template for handling disagreement to those who consistently watch this type of program.

      This passage explains that when we watch certain media styles over and over, we start using them to understand real life. If a news show always shows loud, heated arguments, viewers may think that’s the “normal” way to handle disagreements. Media formats can quietly shape how people act and communicate.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      We are grateful to the reviewers for their thoughtful and constructive evaluations of our manuscript. Their comments helped us clarify key aspects of the study and strengthen both the presentation and interpretation of our findings. The central goal of this work is to dissect how the opposing activities of GATA4 and CTCF coordinate chromatin topology and transcriptional timing during human cardiomyogenesis. The reviewers’ feedback has allowed us to refine this message and better contextualize our results within the broader framework of chromatin regulation and cardiac development.

      In response to the reviews, in our preliminary revision we have already implemented substantial improvements to the manuscript, including additional analyses, clearer data visualization, and revisions to the text to avoid overinterpretation. These refinements enhance the robustness of our conclusions without altering the overall scope of the study. A small number of additional analyses and experiments are ongoing and will be added to the full revision, as detailed below.

      We believe that the revised manuscript, together with the planned updates, fully addresses the reviewers’ concerns and substantially strengthens the contribution of this work to the field.

      Reviewer 1 – Point 1:

      In the datasets you are examining, what are the relative percentages in each of the four groups relating compartmentalization change to expression change (A→B, expression up; A→B, down; B→A, up; B→A, down)?

      We quantified compartment–expression relationships using Hi-C and bulk RNA-seq from H9 ESCs and CMs. The percentages for each category are shown below and incorporated into updated Figure S2H.

      Group

      Downregulated in CM

      Upregulated in CM

      A-to-A

      11.92%

      8.44%

      A-to-B

      18.20%

      2.79%

      B-to-A

      7.96%

      18.07%

      B-to-B

      14.36%

      6.44%

      A chi-squared test comparing observed vs. expected distributions (based on gene density across bins) confirmed a strong association between compartment dynamics and transcriptional behavior. B-to-A genes are significantly enriched among genes upregulated in CMs, while A-to-B genes are enriched among those downregulated (updated Figure S2H).

      We next assessed with GSEA how these gene classes respond to GATA4 and CTCF knockdown. In 2D CMs, GATA4 knockdown reduces expression of CM-upregulated B-to-A genes and increases expression of CM-downregulated A-to-B genes, whereas CTCF knockdown produces the opposite pattern (updated Figure 2F).

      Applying the same analysis to cardioid bulk RNA-seq (updated Figure 4E) revealed the strongest effects in SHF-RV organoids, consistent with monolayer data. In SHF-A organoids, only GATA4 knockdown had a measurable impact on CM-upregulated B-to-A and CM-downregulated A-to-B genes. Because the subsets of CM-downregulated B-to-A and CM-upregulated A-to-B genes were very small and showed no consistent trends, Figure 4 focuses on the two informative categories only. The full classification is provided in Reviewer Figure 1 below.

      (The figure cannot be rendered in this text-only format)

      Reviewer Figure 1. GSEA for CM-upregulated B-to-A and CM-downregulated A-to-B genes. p-values by Adaptive Monte-Carlo Permutation test.

      Reviewer 1 – Point 2

      This phrase in the abstract is imprecise: ‘whereas premature CTCF depletion accelerates yet confounds cardiomyocyte maturation.’


      The abstract has been revised to: “whereas premature CTCF depletion accelerates yet alters cardiomyocyte maturation.” (lines 29-30).

      Reviewer 1 – Point 3

      Regarding this statement: "Disruption of [3D chromatin architecture] has been linked to genetic dilated cardiomyopathy (DCM) caused by lamin A/C mutations8,9, and mutations in chromatin regulators are strongly enriched in de novo congenital heart defects (CHD)10, underscoring their pathogenic relevance11." The first studies to implicate chromatin structural changes in heart disease, including the role of CTCF in that process, were PMID: 28802249, a model of acquired, rather than genetic, disease.

      We added the following sentence to the paragraph introducing CTCF: “Moreover, depletion of CTCF in the adult cardiomyocytes leads to heart failure28,29.” (line 72)

      Reviewer 1 – Point 4

      Can you quantify this statement: ‘the compartment switch coincided with progressive reduction of promoter–gene body interactions’?

      We quantified promoter–gene body contacts by calculating the area under the curve (AUC) of the virtual 4C signal derived from H9 Hi-C data across differentiation. As a result of this analysis we added the following sentence: “Quantitatively, interactions between the TTN promoter and its gene body decreased by ~55% from the pluripotent stage to day 80 cardiomyocytes.” (lines 89-91).


      Reviewer 1 – Point 5

      Regarding this statement: "six regions became less accessible in CMs, correlating with ChIP-seq signal for the ubiquitous architectural protein CTCF." I don't see 6 ATAC peaks in either TTN trace in Figure 1A.

      We corrected the text as it follows: “TTN experienced clear changes in chromatin accessibility during CM differentiation: ATAC-seq identified two CM-specific peaks that correlated with ChIP-seq signal for the cardiac pioneer TF GATA4 at the two promoters, one driving full length titin and the other the shorter cronos isoform. In contrast, two regions became less accessible in CMs, correlating with two of the six ChIP-seq peaks for the ubiquitous architectural protein CTCF” (lines 93-97). We attribute the differences between ChIP-seq and ATAC-seq profiles to methodological sensitivity and/or biological variability between datasets generated in different laboratories and cell batches.

      Reviewer 1 – Point 6

      Western blots need molecular weight markers.

      We edited the relevant panels accordingly (updated Figures 1E and 2B).

      Reviewer 1 – Point 7

      Regarding this statement: "The decrease in CTCF protein levels may explain its selective detachment from TTN during cardiomyogenesis." At face value, these findings suggest the opposite: i.e. that a massive downregulation of CTCF at protein level should affect its binding across the genome, which is not tested and is hard to evaluate between ChIP-seq studies from different groups and from different developmental timeframes.

      We revised the text to avoid implying selective detachment and performed a genome-wide analysis of CTCF occupancy using ENCODE ChIP-seq datasets generated by the same laboratory with matched protocols in hESCs and hESC-derived CMs. This analysis shows that 43.2% of CTCF sites present in ESCs are lost in CMs, whereas only 5.7% are gained, confirming a broad reduction in CTCF binding during differentiation. These results are now included in__ updated Figure 1B__.

      Reviewer 1 – Point 8a

      A couple thoughts on the FISH experiments in Figure 2. A claim of 'impaired B-A transition' would be more convincing if you show, by FISH, that the relative distance of TTN from lamin B increases with differentiation.

      Although prior work from us and others has established that TTN transitions from the nuclear periphery in hESCs to a more internal position during cardiomyogenesis (Poleshko et al. 2017; Bertero et al. 2019a), we are reproducing this trajectory in WTC11 hiPSCs as part of the FISH experiments for the full revision.

      __Reviewer 1 – Point 8b __

      In the [FISH] images: are you showing a total projection of all z planes? One assumes the quantitation is relative to a 3D reconstruction in which the lamin B signal is restricted to the periphery. Have you shown this? __

      Quantification was performed on full 3D reconstructions from Z-stacks, as detailed in the Methods (lines 721-727). While the original submission displayed maximum-intensity projections, updated Figure 2D and Figure S2E now show representative single optical sections, which more clearly highlight the spatial relationship between the TTN locus and the nuclear lamina.

      Reviewer 1 – Point 8c

      Lastly, these data are very interesting and important, provoking reexamination of your interpretation of the results in Figure 1. Figure 1 was interpreted to show that less CTCF binding led to decreased lamina (and thus B compartment) association during development. Figure 2 shows that depleting CTCF does not change association of TTN with lamina.

      Our interpretation is that by day 25 of hiPSC-CM differentiation the TTN locus may have reached its maximal radial repositioning even in control cells, limiting the ability to detect earlier effects of CTCF depletion. To test whether CTCF knockdown accelerates lamina detachment at earlier stages, we are repeating the FISH analysis for the inducible CTCF knockdown line at multiple time points during differentiation.

      Reviewer 1 – Point 9

      A thought about this statement: "Altogether, these results suggest that GATA4 and CTCF function as positive and negative regulators of B-to-A compartment switching, likely acting through global and local chromatin remodeling, respectively." GATA4 induces TTN expression and its knockdown prevents TTN expression-the evidence that GATA4 affects compartmentalization is unclear. By activating the gene, GATA4 may shift TTN to B classification.

      Our current data do not allow us to disentangle whether GATA4-driven transcriptional activation precedes or follows the B-to-A compartment shift. We have therefore removed the mechanistic speculation from this sentence to avoid overinterpretation. Nevertheless, the analyses in updated Figure 2F, discussed in the response to Reviewer 1 - Point 1, show that GATA4 knockdown preferentially reduces expression of CM-upregulated B-to-A genes, while CTCF knockdown has the opposite effect, supporting the conclusion that both factors influence the transcriptional programs associated with B-to-A transitions.

      Reviewer 1 – Point 10

      __I'm not sure what I am looking at in Figure 3C. Are those traces integration of interactions over a defined window? "Each [mutant is] clearly different from WT" is not obvious from the presentation. The histograms are plotting AUC of what? Interactions of those peaks with the mutated region? I genuinely appreciate how laborious this experiment must have been and encourage you to explain better what you are showing. __

      We revised the main text to avoid overstating the differences (“clearly” “in a similar manner”, line 192) and expanded the l__egends of updated Figures 3C–D__ to clarify what is being shown: “(C) 4C-seq in hiPSCs using the promoter-proximal region of TTN as viewpoint. The top panel shows raw interaction profiles. The lower panels plot pairwise differences between conditions to reveal subtle changes. A schematic indicating the 4C viewpoint is included for clarity. Right inset: zoom of the CBS4–5 region. Mean of n = 3 cultures. (D) AUC of the differential 4C-seq signal for defined intervals (panel C). p-values by one-sample t-test against μ = 0.”. We also added a visual cue in updated Figure 3C indicating the 4C viewpoint to facilitate interpretation.

      Reviewer 1 – Point 11

      Again acknowledging how challenging these experiments are: when you mutant a locus, you change CTCF binding but you also change the DNA. Thus, attributing the changes in interactions to presence/absence of CTCF binding is difficult, because the DNA substrate itself has changed. Perhaps you are presenting all of this as a negative result, given the modest effect on transcription, which is as important as a positive result, given the assumptions usually made about such things. But the results are not clearly described and your interpretation seems to go between implying the structural change causative and being agnostic.

      We recognize that deleting a genomic region can affect both CTCF binding and the DNA substrate itself. For this reason, we implemented two parallel genome-editing strategies:

      (1) a straightforward Cas9-mediated deletion of ~100 bp centered on each CBS, and

      (2) a more precise HDR approach replacing only the 20 bp core CTCF motif.

      Because the HDR strategy succeeded, all downstream analyses were carried out on these minimal edits, which substantially limit disruption of other transcription factor motifs and reduce the likelihood of sequence-dependent polymer effects unrelated to CTCF.

      Nevertheless, to avoid implying unwarranted causality in the absence of more conclusive evidence, we added a paragraph to the Discussion outlining these limitations, including the sentence: “Our study also reflects general challenges in separating chromatin-architectural and transcriptional mechanisms. Although the CBS edits were restricted to the core CTCF motifs, additional sequence-dependent effects cannot be fully excluded, and we therefore interpret the resulting changes as consistent with—but not exclusively due to—loss of CTCF binding.” (lines 365-368)

      Reviewer 1 - Point 12.

      Figure 4C: since you have RNA-seq data, a much more objective way to present these data would be to show all data (again, A-B, up; A-B, down; B-A, up; B-A, down) and the effects of CTCF or GATA4. Regardless, you can still focus on the cardiac specific genes. But my guess is if you examine all genes, the pattern you show in panel C will not be present in the majority of cases. Furthermore, if this hypothesis is wrong, such an analysis will allow you to identify other genes affected by the mechanisms you describe and your analysis will test whether these mechanisms are in fact conserved at different loci.

      As outlined in our response to Point 1, we extended the analysis to all genes undergoing compartment changes and incorporated this into the cardioid RNA-seq dataset. This revealed a clear and consistent relationship between GATA4 or CTCF knockdown and the expression of B-to-A and A-to-B gene classes (updated Figure 4E).

      Reviewer 2 - Point 1.1

      1. CTCF regulation at TTN locus:

      (1) Figure 1A: The claim of the authors about convergent CTCF sites and transcriptional activation of TTN is quite simplistic. This claim is only valid when we know where cohesin is loaded. If cohesin is loaded at then intragenic GATA4 binding site, then the only important CTCF sites is at the promoter of TTN. I suggest that the authors read few more publications which may help the authors to better understand how cohesin and CTCF team up to regulate transcription, such as Hsieh et al., Nature Genetics, 2022; Liu et al., Nature Genetics, 2021; Rinzema et al., Nature Structural and Molecular Biology, 2022.

      __Suggestion: The authors should add cohesin (RAD21/SMC1A) and NIPBL ChIP-seq for better interpretation. __

      In line with the reviewer’s insightful suggestion, we integrated cohesin ChIP-seq data into updated Figure 1A. Specifically, we added a RAD21 ChIP-seq track from hESCs, which provides direct evidence of cohesin occupancy across the TTN locus. RAD21 binding closely parallels CTCF binding at five sites within the gene body, supporting a model in which promoter-proximal CTCF anchors cohesin to stabilize repressive loops at this locus. This analysis substantially strengthens the mechanistic framework and is consistent with the studies recommended by the reviewer, which we have now cited (lines 68 and 104).

      Reviewer 2 - Point 1.2. (2) Figure 3B: If delta2CBS only has heterozygenous deletion of CBS6, why we would expect the binding will be weaken to 50%. However, the CTCF binding is reduced to around 1/10 in the ChIP-qPCR. How do the authors explain this?

      Sequencing of the Δ2CBS line shows that one CBS6 allele carries the intended EcoRI replacement, while the second allele contains a 2-bp deletion within the core CTCF motif (Figure S3C). Remarkably, this small deletion is sufficient to abolish CTCF binding, resulting in complete loss of occupancy at CBS6 despite heterozygosity. We clarified this in the text as follows: “CTCF ChIP-qPCR in hiPSCs confirmed complete loss of CTCF binding at the targeted sites, including CBS6 in the Δ2CBS line, indicating that the 2-bp deletion sufficed to disrupt CTCF binding while occupancy at other CBSs remained unaffected.” (lines 187–189).

      Reviewer 2 - Point 1.3a (3) Figure 3C: There are two problems with the 4C experiments: (a) The changes are really mild. In fact, none of the p-values in Figure 3D are significant.

      The effect of deleting CBS1 is indeed modest, consistent with reports that individual CTCF binding sites often show functional redundancy (i.e., Rodríguez-Carballo et al. 2017; Barutcu et al. 2018; Kang et al. 2021). Nevertheless, our 4C-seq experiments have reproducibly shown the same directional trend across biological replicates. To increase statistical power and more rigorously assess the robustness of this effect, we are generating additional 4C replicates as part of the full revision.

      Reviewer 2 - Point 1.3b [In the 4C experiments] (b) The authors should also consider a model that CTCF directly serves as a repressor. In this way, 3D genome may not be involved. B-A switch is simply caused by the activation of the locus.

      We now explicitly acknowledge this possibility in the Discussion. The revised text states: “Moreover, our data cannot unambiguously separate CTCF’s architectural role from potential direct repressive activity. Both mechanisms could contribute to the observed effects, and our findings likely reflect the combined influence of CTCF on chromatin topology and gene regulation.” (lines 368–371).

      Reviewer 2 - Point 2.1a 2. __(CTCF) detachment: The authors mentioned few times "detachment". In the context of this manuscript, the authors indicate detachment from nuclear lamina. However, the authors haven't provide convincing evidence about this. __

      In the two instances where we used the term “detachment,” we intended it to refer exclusively to reduced CTCF binding to DNA, not to lamina repositioning. To avoid ambiguity, we have replaced “detachment” with “reduced binding” in both locations (lines 123 and 329). We do not use this term to describe TTN–lamina positioning.

      Reviewer 2 - Point 2.1b (1) Figure 1D: I doubt whether such changes of CTCF protein abundance will lead to LAD detachment. Suggest the authors read van Schaik et al., Genome Biology, 2022. With the full depletion of CTCF, the effects on LADs are still very restricted.

      We agree that the observed correlation between reduced CTCF levels and the relocation of TTN away from a LAD does not establish causality. As outlined in our response to Reviewer 1 – Point 8c, we are performing additional FISH experiments at earlier differentiation stages in the CTCF inducible knockdown line to directly assess whether partial CTCF depletion is sufficient to alter the timing of TTN–lamina separation.

      Reviewer 2 - Point 2.2 (2) Figure 2D: Lamin B1 should be mostly at nuclear periphery. I have few questions: (1) is the antibody specific? (2) do these cells carry mutation in LMNB1 gene? (3) is the staining actually LMNA?

      As also clarified in response to Reviewer 1 – Point 8b, the original images displayed maximum-intensity projections of Z-stacks, which obscured the peripheral distribution of LMNB1. We have updated Figure 2D and Figure S2E to show representative individual optical sections, which more clearly display the expected peripheral LMNB1 signal. We also confirm that the antibody used is specific for LMNB1 and previously validated (Bertero et al. 2019b), and that the WTC11-derived lines used in this study carry no mutation in LMNB1.

      Reviewer 2 - Point 3

      3. Opposite functions of GATA4 and CTCF: These data in Figure 5E-H argues the opposite role of GATA4 and CTCF in transcriptional regulation. Would it be that CTCF KD just affected cell proliferation, which is actually known for many cell types, rather than affect CM differentiation process? If this is the reason, inversed correlation between CTCF KD and GATA4 KD in Figure 4D could also be explained by opposite effects on cell cycle.

      We directly evaluated this possibility. In FHF–LV cardioids, cell cycle profiling in Figure 6C and Figure S6C (now S7C) showed that CTCF knockdown does not alter the distribution of CMs across G1/S/G2–M phases, in contrast to the marked increase in proliferation observed with GATA4 knockdown.

      Because this comment referred specifically to the SHF data, we also analyzed mitotic gene expression in the SHF–RV bulk RNA-seq dataset using GSEA. CTCF knockdown did not significantly enrich any cell cycle–related gene sets, whereas GATA4 knockdown produced a strong enrichment for mitotic cell cycle terms, in line with FHF-LV data (Reviewer Figure 2).

      These results are summarized in updated Figure S5C, reporting also the results of the broader GSEA analysis, and together indicate that the transcriptional divergence between CTCF and GATA4 knockdown is not simply explained by opposing effects on proliferation.

      (The figure cannot be rendered in this text-only format)

      Reviewer Figure 2. GSEA for mitotic cell cycle in SHF-RV after inducible knockdown of CTCF (left) or GATA4 (right). p-values by Adaptive Monte-Carlo Permutation test.

      Reviewer 2 - Point 4 4. In discussion, the authors suggested that CTCF is a local chromatin remodeller. In my view, association with local chromatin compaction doesn't qualify CTCF as a chromatin remodeler. To my knowledge, CTCF does not have an enzymatic domain, then how does it remodel chromatin?

      Our intended meaning was that CTCF shapes 3D chromatin architecture through its role in organizing intergenic looping, not that it remodels chromatin enzymatically. To avoid confusion, we have removed the original sentence from the Discussion.

      Reviewer 2 - Point 5. 5. Some conclusions are drawn based on insignificant p-values, e.g. Figure 2F, Figure 3D, etc. The authors should be careful about their conclusion, and tone down their statement for the observations have borderline significance.

      The conclusions based on bulk RNA-seq have been revised in response to Reviewer 1 – Point 1 (updated Figure 2F). By subsetting B-to-A and A-to-B genes according to their expression dynamics, this analysis now yields clearer and statistically significant differences between conditions.

      Regarding the 4C-seq data, as acknowledged in Reviewer 2 – Point 3a, the observed effects are modest. We are generating additional biological replicates to increase statistical power. In the meantime, we have adjusted the text to avoid overstating these findings. The revised manuscript now states: “While the difference did not reach significance, these trends suggest …” (lines 199–200).

      Reviewer 2 - Minor comment 1. Minor comments: 1. Figure 1A: (1) I suggest to label two promoters in the gene model. It's unclear in the figure in the current version; (2) I was a bit confused with the way how the authors labeled CTCF directionality. I thought there are a lot of promoters. Why didn't they use triangles?

      We updated Figure 1A to label both TTN promoters and indicate their orientation. For CTCF sites, we now clearly display the motif direction and core binding region as determined by FIMO analysis of the CTCF ChIP-seq peaks, improving consistency and interpretability.

      Reviewer 2 - Minor comment 2. 2. Figure 2C: I think the drastical reduction of titin-mEGFP levels is only due to the way how the authors analyze their FACS data. Can the author quantify on median fluorescence intensity?

      The gating strategy for titin-mEGFP⁺ cells was defined using a reporter-negative control, and cells lacking TNNT2 expression showed no detectable titin-mEGFP signal, confirming the specificity of the gate. To complement this analysis, we also quantified the median fluorescence intensity (MFI) of titin-mEGFP⁺ cells. The MFI analysis corroborates the original findings, showing a significant decrease in GATA4 knockdown and an increase in CTCF knockdown (updated Figure S2D).

      __Reviewer 2 - Minor comment 3. 3. Figure S2G: P value should be -log10, I assume. Please label it accurately. __

      We appreciate the reviewer pointing out this labeling error. In the revised manuscript, this panel has been removed to accommodate the updated compartment–expression analysis now presented in updated Figure 2H (see response to Reviewer 1 – Point 1), and the issue is no longer applicable.

      References

      Barutcu AR, Maass PG, Lewandowski JP, Weiner CL, Rinn JL. 2018. A TAD boundary is preserved upon deletion of the CTCF-rich Firre locus. Nat Commun 9: 1444.

      Bertero A, Fields PA, Ramani V, Bonora G, Yardımcı GG, Reinecke H, Pabon L, Noble WS, Shendure J, Murry CE. 2019a. Dynamics of genome reorganization during human cardiogenesis reveal an RBM20-dependent splicing factory. Nature communications 10: 1538.

      Bertero A, Fields PA, Smith AS, Leonard A, Beussman K, Sniadecki NJ, Kim D-H, Tse H-F, Pabon L, Shendure J, et al. 2019b. Chromatin compartment dynamics in a haploinsufficient model of cardiac laminopathy. Journal of Cell Biology 218: 2919–44.

      Kang J, Kim YW, Park S, Kang Y, Kim A. 2021. Multiple CTCF sites cooperate with each other to maintain a TAD for enhancer–promoter interaction in the β-globin locus. The FASEB Journal 35: e21768.

      Poleshko A, Shah PP, Gupta M, Babu A, Morley MP, Manderfield LJ, Ifkovits JL, Calderon D, Aghajanian H, Sierra-Pagán JE, et al. 2017. Genome-Nuclear Lamina Interactions Regulate Cardiac Stem Cell Lineage Restriction. Cell 171: 573–587.

      Rodríguez-Carballo E, Lopez-Delisle L, Zhan Y, Fabre PJ, Beccari L, El-Idrissi I, Huynh THN, Ozadam H, Dekker J, Duboule D. 2017. The HoxD cluster is a dynamic and resilient TAD boundary controlling the segregation of antagonistic regulatory landscapes. Genes Dev 31: 2264–2281.

    2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

      Learn more at Review Commons


      Referee #2

      Evidence, reproducibility and clarity

      Becca et al. characterized the functions of GATA4 and CTCF in the context of cardiomyogenesis. The authors aim to establish a link between 3D genome changes (A/B compartment and long-range chromatin interactions) and activation of cardiac specific genes such as TTN. They showed opposite effects of GATA4 and CTCF in regulating these genes as well as phenotypical traits. I have the following suggestions and questions:

      Major comments:

      1. CTCF regulation at TTN locus:

      (1) Figure 1A: The claim of the authors about convergent CTCF sites and transcriptional activation of TTN is quite simplistic. This claim is only valid when we know where cohesin is loaded. If cohesin is loaded at then intragenic GATA4 binding site, then the only important CTCF sites is at the promoter of TTN. I suggest that the authors read few more publications which may help the authors to better understand how cohesin and CTCF team up to regulate transcription, such as Hsieh et al., Nature Genetics, 2022; Liu et al., Nature Genetics, 2021; Rinzema et al., Nature Structural and Molecular Biology, 2022.

      Suggestion: The authors should add cohesin (RAD21/SMC1A) and NIPBL ChIP-seq for better interpretation. (2) Figure 3B: If delta2CBS only has heterozygenous deletion of CBS6, why we would expect the binding will be weaken to 50%. However, the CTCF binding is reduced to around 1/10 in the ChIP-qPCR. How do the authors explain this?

      (3) Figure 3C: There are two problems with the 4C experiments: (a) The changes are really mild. In fact, none of the p-values in Figure 3D are significant; (b) The authors should also consider a model that CTCF directly serves as a repressor. In this way, 3D genome may not be involved. B-A switch is simply caused by the activation of the locus. 2. (CTCF) detachment: The authors mentioned few times "detachment". In the context of this manuscript, the authors indicate detachment from nuclear lamina. However, the authors haven't provide convincing evidence about this.

      (1) Figure 1D: I doubt whether such changes of CTCF protein abundance will lead to LAD detachment. Suggest the authors read van Schaik et al., Genome Biology, 2022. With the full depletion of CTCF, the effects on LADs are still very restricted.

      (2) Figure 2D: Lamin B1 should be mostly at nuclear periphery. I have few questions: (1) is the antibody specific? (2) do these cells carry mutation in LMNB1 gene? (3) is the staining actually LMNA? 3. Opposite functions of GATA4 and CTCF: These data in Figure 5E-H argues the opposite role of GATA4 and CTCF in transcriptional regulation. Would it be that CTCF KD just affected cell proliferation, which is actually known for many cell types, rather than affect CM differentiation process? If this is the reason, inversed correlation between CTCF KD and GATA4 KD in Figure 4D could also be explained by opposite effects on cell cycle. 4. In discussion, the authors suggested that CTCF is a local chromatin remodeller. In my view, association with local chromatin compaction doesn't qualify CTCF as a chromatin remodeler. To my knowledge, CTCF does not have an enzymatic domain, then how does it remodel chromatin? 5. Some conclusions are drawn based on insignificant p-values, e.g. Figure 2F, Figure 3D, etc. The authors should be careful about their conclusion, and tone down their statement for the observations have borderline significance.

      Minor comments:

      1. Figure 1A: (1) I suggest to label two promoters in the gene model. It's unclear in the figure in the current version; (2) I was a bit confused with the way how the authors labeled CTCF directionality. I thought there are a lot of promoters. Why didn't they use triangles?
      2. Figure 2C: I think the drastical reduction of titin-mEGFP levels is only due to the way how the authors analyze their FACS data. Can the author quantify on median fluorescence intensity?
      3. Figure S2G: P value should be -log10, I assume. Please label it accurately.

      Significance

      Strengths and limitations:

      I feel that single-cell analysis and functional analysis of GATA4 and CTCF using cardiac organoid model are elegant. However, the weak part of the manuscript is the link between 3D genome and activation of TTN. I also think the authors should include more possible explanations for the interpretation of some genome organization data (CTCF site deletion, 4C, etc).

      Advance: The study does provide useful information to understand transcriptional regulation during cardiac lineage specification. The link between 3D genome and cardiac lineage specification is conceptually nice but needs more data to support.

      Audience: developmental biologists who is interested in heart development and molecular biologists with specific interests in gene regulation.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      This report demonstrates that the gene expression output of the Wnt pathway, when controlled precisely by a synthetic light-based input, depends substantially on the frequency of stimulation. The particular frequency-dependent trend that is observed - anti-resonance, a suppression of target gene expression at intermediate frequencies given a constant duty cycle - is a novel aspect that has not been clearly shown before for this or other signaling pathways. The paper provides both clear experimental evidence of the phenomenon with engineered cellular systems and a model-based analysis of how the pairing of rate constants in pathway activation/deactivation could result in such a trend.

      Strengths:

      This report couples in vitro experimental data with an abstracted mathematical model. Both of these approaches appear to be technically sound and to provide consistent and strong support for the main conclusion. The experimental data are particularly clear, and the demonstration that Brachyury expression is subject to anti-resonance in ESCs is particularly compelling. The modeling approach is reasonably scaled for the system at the level of detail that is needed in this case, and the hidden variable analysis provides some insight into how the anti-resonance works.

      Weaknesses:

      (1) The anti-resonance phenomenon has not been demonstrated using physiological Wnt ligands; however, I view this as only a minor weakness for an initial report of the phenomenon. The potential significance of the phenomenon for Wnt outweighs the amount of effort it would take to carry the demonstration further - testing different frequencies/duty cycles at the level of ligand stimulus using microfluidics could get quite involved, and would likely take quite some time. Adding some more discussion about how the time scales of ligand-receptor binding could play into the reduced model would further ameliorate this issue.

      We thank the reviewer for this comment and the interesting suggestion to test the anti-resonance phenomenon with microfluidics. We agree that combining physiological Wnt ligands with microfluidic stimulation would go beyond the scope of this current study, though it is an interesting extension. One advantage of the optogenetic setup, as mentioned in the discussion, is that the Wnt stimulus can be turned off sharply. This allows us to test the output from perfectly square wave input profiles; in microfluidics, washing the sticky ligand off the cells might “smear” the effective input profile cells respond to.

      We show in Supplement Fig. 6, that our reduced model matches the experimental data and that we would expect the antiresonance phenomenon as long as (see Fig. 4). Practically, a smeared input profile implies an effective reduction of 𝑘<sub>off</sub>, which means that the phenomenon would be visible with microfluidics (provided the minimum is deep enough, see Fig. 4). However, this should still be considered with caution, as the antiresonance would then appear because the cells essentially receive a smeared out or continuous pulse in the high frequency limit, rather than cells responding to a square wave in a specific way.

      (2) While the model is fully consistent with the data, it has not been validated using experimental manipulations to establish that the mechanisms of the cell system and the model are the same. There may be some ways to make such modifications, for example, using a proteasome inhibitor. An alternative would be to more explicitly mention the need to validate the model's mechanism with experiments.

      We thank the reviewer for this valuable and constructive comment. We agree that future experimental perturbations that directly modulate pathway activation and reset kinetics—such as proteasome inhibition, targeted degradation of pathway components, or engineered changes in receptor turnover—would provide an important validation of the model’s mechanistic interpretation. In the present study, our primary goal was to establish the existence and quantitative features of anti-resonance in the Wnt pathway and to identify the minimal set of timescale relationships that can explain it. We view the proposed experimental validations as exciting next steps that extend beyond the scope of the current work, and we are grateful to the reviewer for emphasizing their importance. We now mention this explicitly in the discussion of our manuscript.

      (3) I think the manuscript misses an opportunity to discuss the potential of the phenomenon in other pathways. The hedgehog pathway, for example, involves GSK3-mediated partial proteolysis of a transcription factor, which could conceivably be subject to similar behaviors, and there are certainly other examples as well.

      We thank the reviewer for pointing out an opportunity to emphasize the possibility of this phenomenon in other pathways. The minimal model indicates that anti-resonance emerges whenever a rapid activating process is paired with a slower deactivating/reset process. Beyond Hedgehog/Gli processing, candidate circuits include: NF-κB (rapid IκBα phosphorylation/degradation vs slower IκBα resynthesis), ERK (fast phosphorylation bursts vs slower transcriptional negative feedback such as DUSPs), Notch (fast γ-secretase NICD release vs slower NICD turnover and feedback), BMP/TGF-β–SMAD (fast R-SMAD phosphorylation vs slower receptor trafficking/SMAD7 feedback), and Hippo/YAP (rapid cytoplasmic sequestration vs slower transcriptional feedback). Each contains the same timescale separation that should create a frequency ‘stop-band,’ predicting suppressed gene expression or fate transitions at intermediate stimulation frequencies. We have updated the manuscript’s discussion to mention the Hedgehog connection with the following added sentence in the discussion: Analogous band-stop filtering should arise in other developmental circuits that couple a fast ‘ON’ step to slower deactivation or negative feedback. In Hedgehog, for example, PKA/CK1/GSK3-mediated partial proteolysis of Gli with slower recovery of full-length Gli creates the same fast-activation/slow-reset motif our hidden-variable model predicts will yield anti-resonance, and Wnt–Hedgehog crosstalk through the shared kinase GSK3 suggests such frequency selectivity could occur in other developmental signaling pathways.

      We also added an additional sentence regarding different activation and deactivation timescales in other pathways.

      (4) Some aspects of the modeling and hidden variable analysis are not optimally presented in the main text, although when considered together with the Supplemental Data, there are no significant deficiencies.

      We have addressed the model choices and analysis now more clearly in the main manuscript and also referred to the Supplemental Data more directly.

      Reviewer #2 (Public review):

      Summary:

      By combining optogenetics with theoretical modelling, the authors identify an anti-resonance behavior in the WnT signaling pathway. This behavior is manifested as a minimal response at a certain stimulation frequency. Using an abstracted hidden variable model, the authors explain their findings by a competition of timescales. Furthermore, they experimentally show that this anti-resonance influences the cell fate decision involved in human gastrulation.

      Strengths:

      (1) This interdisciplinary study combines precise optogenetic manipulation with advanced modelling.

      (2) The results are directly tested in two different systems: HEK293T cells and H9 human embryonic stem cells.

      (3) The model is implemented based on previous literature and has two levels of detail: i) a detailed biochemical model and ii) an abstract model with a hidden parameter.

      Weaknesses:

      (1) While the experiments provide both single-cell data and population data, the model only considers population data.

      We thank the reviewer for correctly pointing out that the single-cell measurements would in principle allow us to incorporate the cell-to-cell heterogeneity into the model. In this study, we sought to identify a minimal quantitative model of the Wnt pathway that could explain anti-resonance through competing time scales. We believe that, for our purposes, focusing on population data allowed us to keep the complexity of the model to a minimum to increase its explanatory value. We agree with the reviewer that considering single-cell trajectories is an interesting direction for further work.

      (2) Although the model captures the experimental data for TopFlash very well, the beta-Cat curves (Figure 2B) are only described qualitatively. This discrepancy is not discussed.

      Indeed, our model fits to mean β-catenin expressions are more qualitative than for TopFlash. The fit for β-catenin was tricky, as expression of β-catenin is typically low and closer to the detectable limits than TopFlash. These experimental constraints mean that the variation between individual signal trajectories is higher for β-catenin compared to the light-off condition than for TopFlash. Therefore, we strove to obtain a qualitative rather than a quantitative fit to the mean expression profile in β-catenin.  The current model fit is well within the standard deviation of variation. Given the observed heterogeneity and the fact that we take the parameters from literature (which ensures that the order of magnitude of parameters is in a sensible range), we believe that the model fits are reasonable. We now mention this explicitly in the text.

      Overall Assessment:

      The authors convincingly identified an anti-resonance behavior in a signaling pathway that is involved in cell fate decisions. The focus on a dynamic signal and the identification of such a behavior is important. I believe that the model approach of abstracting a complicated pathway with a hidden variable is an important tool to obtain an intuitive understanding of complicated dependencies in biology. Such a combination of precise ontogenetic manipulation with effective models will provide a new perspective on causal dependencies in signaling pathways and should not be limited only to the system that the authors study.

      We thank both reviewers for the positive assessment of our manuscript.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      There are several points that deserve more discussion, as noted above in the review.

      (1) It would be worthwhile to consider whether a relatively simple experiment with a proteasome inhibitor or similar pharmacological manipulation could provide useful validation data for the model.

      We address this point above in the weaknesses section from reviewer 1.

      (2) The figure legend for S5C should clarify whether the values plotted are at a particular fixed time point, or (more likely) at a certain time following the second pulse, which would be variable.

      We have modified the figure caption to clarify that the values plotted are at a fixed time point in the simulation (t\=48 hrs). We chose this timepoint sufficiently long after the second pulse to ensure that there are no residual dynamical effects. We thank the reviewer for noting this.

      (3) As noted in the Sci Score document, various aspects of the resource reporter should be improved, such as including RRIDs, etc.

      We are sending out our plasmids to AddGene; versions for Python and Matlab are listed in our methods section.

      Reviewer #2 (Recommendations for the authors):

      I mostly have suggestions to improve the clarity of the presentation.

      (1) Not all symbols in the equations given in the main text are explained. This is rather annoying, because either you present them and explain what they are or you don't show them and refer to the supplements. For example, d_0 or c_o or \bar{b} or n or K are not explained.

      We have now more clearly presented the parameters in the main text and added signposts to the Methods section.

      (2) Overall, it is often not clear what data in the figures are redundant, although the authors referred to them in the text. For example, in Figure 2c, a curve for 24 hours is shown and referred back to Figure 1D. However, in Figure 1D there is no curve for 24 hours. Is the data from Supplementary Figure 1 H and K also in the main text?

      We thank the referee for pointing out these redundancies. We have now included the 24hr line in Figure 1D and are now only showing the unsmoothed data, also in the main text of the manuscript. To clarify supplemental figures, we have now removed S1H and S1K since all they showed was the unsmoothed version of the data. The remaining plots in Supplementary Figure 1 are normalized differently from what we show in Figure 1 to demonstrate our choice of normalization is not the reason for the observed optogenetic response.

    1. Author response:

      Reviewer #1 (Public Review):

      Lai and Doe address the integration of spatial information with temporal patterning and genes that specify cell fate. They identify the Forkhead transcription factor Fd4 as a lineage-restricted cell fate regulator that bridges transient spatial transcription factors to terminal selector genes in the developing Drosophila ventral nerve cord. The experimental evidence convincingly demonstrates that Fd4 is both necessary for lateborn NB7-1 neurons, but also sufficient to transform other neural stem cell lineages toward the NB7-1 identity. This work addresses an important question that will be of interest to developmental neurobiologists: How can cell identities defined by initial transient developmental cues be maintained in the progeny cells, even if the molecular mechanism remains to be investigated? In addition, the study proposes a broader concept of lineage identity genes that could be utilized in other lineages and regions in the Drosophila nervous system and in other species. 

      Thanks for the accurate summary and positive comments!

      While the spatial factors patterning the neuroepithelium to define the neuroblast lineages in the Drosophila ventral nerve cord are known, these factors are sometimes absent or not required during neurogenesis. In the current work, Lai and Doe identified Fd4 in the NB7-1 lineage that bridges this gap and explains how NB7-1 neurons are specified after Engrailed (En) and Vnd cease their expression. They show that Fd4 is transiently co-expressed with En and Vnd and is present in all nascent NB7-1 progenies. They further demonstrate that Fd4 is required for later-born NB7-1 progenies and sufficient for the induction of NB7-1 markers (Eve and Dbx) while repressing markers of other lineages when force-expressed in neural progenitors, e.g., in the NB56 lineage and in the NB7-3 lineage. They also demonstrate that, when Fd4 is ectopically expressed in NB7-3 and NB5-6 lineages, this leads to the ectopic generation of dorsal muscle-innervating neurons. The inclusion of functional validation using axon projections demonstrates that the transformed neurons acquire appropriate NB7-1 characteristics beyond just molecular markers. Quantitative analyses are thorough and well-presented for all experiments.

      Thanks for the positive comments!

      (1) While Fd4 is required and sufficient for several later-born NB7-1 progeny features, a comparison between early-born (Hb/Eve) and later-born (Run/Eve) appears missing for pan-progenitor gain of Fd4 (with sca-Gal4; Figure 4) and for the NB7-3 lineage (Figure 6). Having a quantification for both could make it clearer whether Fd4 preferentially induces later-born neurons or is sufficient for NB7-1 features without temporal restriction.

      We quantified the percentage of Hb+ and Runt+ cells among Eve+ cells with sca-gal4, and the results are shown in Figure 4-figure supplement 1. We found that the proportion of early-born cells is slightly reduced but the proportion of later-born cells remain similar. Interestingly, we also found a subset of Eve+ cells with a mixed fate (Hb+Runt+) but the reason remains unclear.

      (2) Fd4 and Fd5 are shown to be partially redundant, as Fd4 loss of function alone does not alter the number of Eve+ and Dbx+ neurons. This information is critical and should be included in Figure 3.

      Because every hemisegment in an fd4 single mutant is normal, we just added it as the following text: “In fd4 mutants, we observe no change in the number of Eve+ neurons or Dbx+ neurons (n=40 hemisegments).”

      (3) Several observations suggest that lineage identity maintenance involves both Fd4dependent and Fd4-independent mechanisms. In particular, the fact that fd4-Gal4 reporter remains active in fd4/fd5 mutants even after Vnd and En disappear indicates that Fd4's own expression, a key feature of NB7-1 identity, is maintained independently of Fd4 protein. This raises questions about what proportion of lineage identity features require Fd4 versus other maintenance mechanisms, which deserves discussion.

      We agree, thanks for raising this point. We add the following text to the Discussion. “Interestingly, the fd4 fd5 mutant maintains expression of fd4:gal4, suggesting that the fd4/fd5 locus may have established a chromatin state that allows “permanent” expression in the absence of Vnd, En, and Fd4/Fd5 proteins.”

      (4) Similarly, while gain of Fd4 induces NB7-1 lineage markers and dorsal muscle innervation in NB5-6 and NB7-3 lineages, drivers for the two lineages remain active despite the loss of molecular markers, indicating some regulatory elements retain activity consistent with their original lineage identity. It is therefore important to understand the degree of functional conversion in the gain-of-function experiments. Sparse labeling of Fd4 overexpressing NB5-6 and NB7-3 progenies, as was done in Seroka and Doe (2019), would be an option.

      We agree it is interesting that the NB7-3 and NB5-6 drivers remain on following Fd4 misexpression. To explore this, we used sca-gal4 to overexpress Fd4 and observed that Lbe expression persisted while Eg was largely repressed (see Author response image 1 below). The results show that Lbe and Eg respond differently to Fd4. A non-mutually exclusive possibility is that the continued expression of lbe-Gal4 UAS-GFP or eg-Gal4 UAS-GFP may be due to the lengthy perdurance of both Gal4 and GFP.

      Author response image 1.

      (5) The less-penetrant induction of Dbx+ neurons in NB5-6 with Fd4-overexpression is interesting. It might be worth the authors discussing whether it is an Fd4 feature or an NB56 feature by examining Dbx+ neuron number in NB7-3 with Fd4-overexpression.

      In the NB7-3 lineages misexpressing Fd4, only 5 lineages generated Dbx+ cells (0.1±0.4, n=64 hemisegments), suggesting that the low penetrance of Dbx+ induction is an intrinsic feature of Fd4 rather than lineage context. We have added this information in the results section. 

      (6) It is logical to hypothesize that spatial factors specify early-born neurons directly, so only late-born neurons require Fd4, but it was not tested. The model would be strengthened by examining whether Fd4-Gal4-driven Vnd rescues the generation of laterborn neurons in fd4/fd5 mutants.

      When we used en-gal4 driver to express UAS-vnd in the fd4/fd5 mutant background, we found an average 7.4±2.2 Eve+ cells per hemisegment (n=36), significantly higher than fd4/fd5 mutant alone (3.9±0.8 cells, n=52, p=2.6x10<sup.-11</sup>) (Figure 3J). In addition, 0.2±0.5 Eve+ cells were ectopic Hb+ (excluding U1/U2), indicating that Vnd-En integration is sufficient to generate both early-born and late-born Eve+ cells in the fd4/fd5 mutants. We have added the results to the text.

      (7) It is mentioned that Fd5 is not sufficient for the NB7-1 lineage identity. The observation is intriguing in how similar regulators serve distinct roles, but the data are not shown. The analysis in Figure 4 should be performed for Fd5 as supplemental information.

      Thanks for the suggestion. Because the results are exactly the same as the wild type, we don’t think it is necessary to provide an additional images or analysis as supplemental information.

      Reviewer #2 (Public review):

      Via a detailed expression analysis, they find that Fd4 is selectively expressed in embryonic NB7-1 and newly born neurons within this lineage. They also undertake a comprehensive genetic analysis to provide evidence that fd4 is necessary and sufficient for the identity of NB7-1 progeny. 

      Thanks for the accurate summary!

      The analysis is both careful and rigorous, and the findings are of interest to developmental neurobiologists interested in molecular mechanisms underlying the generation of neuronal diversity. Great care was taken to make the figures clear and accessible. This work takes great advantage of years of painstaking descriptive work that has mapped embryonic neuroblast lineages in Drosophila. 

      Thanks for the positive comments!

      The argument that Fd4 is necessary for NB7-1 lineage identity is based on a Fd4/Fd5 double mutant. Loss of fd4 alone did not alter the number of NB7-1-derived Eve+ or Dbx+ neurons. The authors clearly demonstrate redundancy between fd4 and fd5, and the fact that the LOF analysis is based on a double mutant should be better woven through the text.

      The authors generated an Fd5 mutant. I assume that Fd5 single mutants do not display NB7-1 lineage defects, but this is not stated. The focus on Fd4 over Fd5 is based on its highly specific expression profile and the dramatic misexpression phenotypes. But the LOF analysis demonstrates redundancy, and the conclusions in the abstract and through the results should reflect the existence of Fd5 in the conclusions of this manuscript.

      We agree, and have added new text to clarify the single mutant phenotypes (there are none) and the double mutant phenotype (loss of NB7-1 molecular and morphological features. The following text is added to the manuscript: “Not surprisingly, we found that fd4 single mutants or fd5 single mutants had no phenotype (Eve+ neurons were all normal). Thus, to assess their roles, we generated a fd4 and fd5 double mutant. Because many Eve+ and Dbx+ cells are generated outside of NB7-1 lineage, it was also essential to identify the Eve+ or Dbx+ cells within NB7-1 lineage in wild type and fd4 mutant embryos. To achieve this, we replaced the open reading frame of fd4 with gal4 (called fd4-gal4) (see Methods); this stock simultaneously knocked out both fd4 and fd5 (called fd4/fd5 mutant hereafter) while specifically labeling the NB7-1 lineage. For the remainder of this paper we use the fd4/fd5 double mutant to assay for loss of function phenotypes.”

      It is notable that Fd4 overexpression can rewire motor circuits. This analysis adds another dimension to the changes in transcription factor expression and, importantly, demonstrates functional consequences. Could the authors test whether U4 and U5 motor axon targeting changes in the fd4/fd5 double mutant? To strengthen claims regarding the importance of fd4/fd5 for lineage identity, it would help to address terminal features of U motorneuron identity in the LOF condition.

      Thanks for raising this important point. We examined the axon targeting on body wall muscles in both wild type and in fd4/fd5 mutant background and added the results in Figure 3-figure supplement 2. We found that the axon targeting in the late-born neuron region (LL1) is significantly reduced, suggesting that the loss of late-born neurons in fd4/fd5 mutant leads to the absence of innervation of corresponding muscle targets.

      Reviewer #3 (Public review):

      The goal of the work is to establish the linkage between the spatial transcription factors (STFs) that function transiently to establish the identities of the individual NBs and the terminal selector genes (typically homeodomain genes) that appear in the newborn postmitotic neurons. How is the identity of the NB maintained and carried forward after the spatial genes have faded away? Focusing on a single neuroblast (NB 7-1), the authors present evidence that the fork-head transcription factor, fd4, provides a bridge linking the transient spatial cues that initially specified neuroblast identity with the terminal selector genes that establish and maintain the identity of the stem cell's progeny. 

      Thanks for the positive comments!

      The study is systematic, concise, and takes full advantage of 40+ years of work on the molecular players that establish neuronal identities in the Drosophila CNS. In the embryonic VNC, fd4 is expressed only in the NB 7-1 and its lineage. They show that Fd4 appears in the NB while the latter is still expressing the Spatial Transcription Factors and continues after the expression of the latter fades out. Fd4 is maintained through the early life of the neuronal progeny but then declines as the neurons turn on their terminal selector genes. Hence, fd4 expression is compatible with it being a bridging factor between the two sets of genes. 

      Thanks for the accurate summary!

      Experimental support for the "bridging" role of Fd4 comes from a set of loss-of-function and gain-of-function manipulations. The loss of function of Fd4, and the partially redundant gene Fd5, from lineage 7-1 does not aoect the size of the lineage, but terminal markers of late-born neuronal phenotypes, like Eve and Dbx, are reduced or missing. By contrast, ectopic expression of fd4, but not fd5, results in ectopic expression of the terminal markers eve and Dbx throughout diverse VNC lineages. 

      Thanks for the accurate summary!

      A detailed test of fd4's expression was then carried out using lineages 7-3 and 5-6, two well-characterized lineages in Drosophila. Lineage 7-3 is much smaller than 7-1 and continues to be so when subjected to fd4 misexpression. However, under the influence of ectopic Fd4 expression, the lineage 7-3 neurons lost their expected serotonin and corazonin expression and showed Eve expression as well as motoneuron phenotypes that partially mimic the U motoneurons of lineage 7-1.

      Thanks for the positive comments!

      Ectopic expression of Fd4 also produced changes in the 5-6 lineage. Expression of apterous, a feature of lineage 5-6, was suppressed, and expression of the 7-1 marker, Eve, was evident. Dbx expression was also evident in the transformed 5-6 lineages, but extremely restricted as compared to a normal 7-1 lineage. Considering the partial redundancy of fd4 and fd5, it would have been interesting to express both genes in the 5-6 lineage. The anatomical changes that are exhibited by motoneurons in response to Fd4 expression confirm that these cells do, indeed, show a shift in their cellular identity.

      We appreciate the positive comments. We agree double misexpression of Fd4 and Fd5 might give a stronger phenotype (as the reviewer says) but the lack of this experiment does not change the conclusions that Fd4 can promote NB7-1 molecular and morphological aspects at the expense of NB5-6 molecular markers.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      The study introduces an open-source, cost-effective method for automating the quantification of male social behaviors in Drosophila melanogaster. It combines machine-learning-based behavioral classifiers developed using JAABA (Janelia Automatic Animal Behavior Annotator) with inexpensive hardware constructed from off-the-shelf components. This approach addresses the limitations of existing methods, which often require expensive hardware and specialized setups. The authors demonstrate that their new "DANCE" classifiers accurately identify aggression (lunges) and courtship behaviors (wing extension, following, circling, attempted copulation, and copulation), closely matching manually annotated groundtruth data. Furthermore, DANCE classifiers outperform existing rule-based methods in accuracy. Finally, the study shows that DANCE classifiers perform as well when used with low-cost experimental hardware as with standard experimental setups across multiple paradigms, including RNAi knockdown of the neuropeptide Dsk and optogenetic silencing of dopaminergic neurons.

      The authors make creative use of existing resources and technology to develop an inexpensive, flexible, and robust experimental tool for the quantitative analysis of Drosophila behavior. A key strength of this work is the thorough benchmarking of both the behavioral classifiers and the experimental hardware against existing methods. In particular, the direct comparison of their low-cost experimental system with established systems across different experimental paradigms is compelling.

      While JAABA-based classifiers have been previously used to analyze aggression and courtship (Tao et al., J. Neurosci., 2024; Sten et al., Cell, 2023; Chiu et al., Cell, 2021; Isshi et al., eLife, 2020; Duistermars et al., Neuron, 2018), the demonstration that they work as well without expensive experimental hardware opens the door to more low-cost systems for quantitative behavior analysis.

      We thank the reviewer for their positive assessment and constructive suggestions. We have cited these additional JAABA studies in the Introduction. We clarified that several prior JAABA-based classifiers were developed using specialized machinevision cameras or custom setups, and that in some cases the original code and classifiers were not made publicly available, which limits reproducibility and wider adoption. To address this, we explicitly note in the revised manuscript that DANCE was developed with accessibility in mind.

      Although the study provides a detailed evaluation of DANCE classifier performance, its conclusions would be strengthened by a more comprehensive analysis. The authors assess classifier accuracy using a bout-level comparison rather than a frame-level analysis, as employed in previous studies (Kabra et al., Nat Methods, 2013). They define a true positive as any instance where a DANCE-detected bout overlaps with a manually annotated ground-truth bout by at least one frame. This criterion may inflate true positive rates and underestimate false positives, particularly for longer-duration courtship behaviors. For example, a 15-second DANCE-classified wing extension bout that overlaps with ground truth for only one frame would still be considered a true positive. A frame-level analysis performance would help address this possibility.

      We thank the reviewer for raising this important point. Our original use of bout-level analysis followed existing literature (Duistermars et al., 2018; Ishii et al., 2020; Chiu et al., 2021; Tao et al., 2024; Hindmarsh Sten et al., 2025). While our lunge classifier already operates at the frame level, we have now performed additional frame-level evaluations for the duration based courtship classifiers. These analyses revealed only minor differences in precision, recall, and F1 scores compared with the original bout-level approach (see new Figure 5—Figure Supplement 3). Details of this analysis are now included in the Materials and Methods.

      In summary, this work provides a practical and accessible approach to quantifying Drosophila behavior, reducing the economic barriers to the study of the neural and molecular mechanisms underlying social behavior.

      We thank the reviewer for their encouraging comments and for recognizing the accessibility and practical value of our approach. We appreciate the constructive suggestions, which have helped strengthen the manuscript.

      Reviewer #2 (Public review):

      Summary:

      This manuscript addresses the development of a low-cost behavioural setup and standardised open-source high-performing classifiers for aggression and courtship behaviour. It does so by using readily available laboratory equipment and previously developed software packages. By comparing the performance of the setup and the classifiers to previously developed ones, this study shows the classifier's overperformance and the reliability of the low-cost setup in recapitulating previously described effects of different manipulations on aggression and courtship.

      Strengths:

      The newly developed classifiers for lunges, wing extension, attempted copulation, copulation, following, and circling, perform better than available previously developed ones. The behavioural setup developed is low cost and reliably allows analysis of both aggression and courtship behaviour, validated through social experience manipulation (social isolation), gene knock (Dsk in Dilp2 neurons) and neuronal inactivation (dopaminergic neurons) known to affect courtship and aggression.

      We thank the reviewer for the clear summary of our work and for highlighting its strengths. We appreciate these positive comments and suggestions, which have helped improve the clarity of the manuscript.

      Weaknesses:

      Aggression encompasses multiple defined behaviours, yet only lunges were analysed. Moreover, the CADABRA software to which DANCE was compared analyses further aggression behaviours, making their comparisons incomplete. In addition, though DANCE performs better than CADABRA and Divider in classifying lunges in the behavioural setup tested, it did not yield very high recall and F1 scores.

      We thank the reviewer for raising this important point. We focused on lunges because they are widely used as a standard proxy for male aggression across multiple laboratories (Agrawal et al., 2020; Asahina et al., 2014; Chiu et al., 2021; Chowdhury et al., 2021; Dierick et al., 2007; Hoyer et al., 2008; Jung et al., 2020; Nilsen et al., 2004; Watanabe et al., 2017). As noted in the Discussion, our study also provides a template for the future development of additional aggression classifiers (fencing, wing flick, tussle, chase, female headbutt) and courtship classifiers (tapping, licking, rejection), which can be trained and shared through the same DANCE framework. Developing and validating these was beyond the scope of the present work.

      To address the concern regarding precision, recall, and F1 scores, we performed additional analyses across all training videos and compiled these results in the new Figure 2—Figure Supplement 2. Our earlier lunge classifier had performance metrics obtained after training on a total of 11 videos. Our analysis shows performance metrics for classifiers trained on four independent datasets (Videos 8– 11). We found that the classifier trained on nine videos provided the best balance of precision, recall, and F1 (78.73%, 73.07%, and 75.79%, respectively), which was slightly better than the earlier classifier. We therefore updated the main figure, text, and Materials and Methods to use this version and uploaded the corresponding classifier and training details to the GitHub repository. 

      DANCE is of limited use for neuronal circuit-level enquiries, since mechanisms for intensity and temporally controlled optogenetic manipulations, which are nowadays possible with open-source software and low-cost hardware, were not embedded in its development.

      We thank the reviewer for this valuable point. The primary aim of DANCE is to provide an accessible, modular, and low-cost behavioural recording and analysis platform. It was designed so that users can readily integrate additional components such as optogenetic control when needed. As a proof of concept, we implemented optogenetic silencing of dopaminergic neurons using the DANCE hardware and confirmed that this manipulation increased aggression (Figure 7R). 

      To facilitate adoption, we now provide schematic diagrams, LED control code, and instructions on our GitHub page and setup photographs in the manuscript (see new Figure 7—Figure Supplement 1). The released code allows programmable timing and intensity control, enabling users to reproduce temporally precise optogenetic protocols or extend the system for other stimulation paradigms.

      Reviewer #3 (Public review):

      The preprint by Yadav et al. describes a new setup to quantify a number of aggression and mating behaviors in Drosophila melanogaster. The investigation of these behaviors requires the analysis of a large number of videos to identify each kind of behavior displayed by a fly. Several approaches to automatize this process have been published before, but each of them has its limitations. The authors set out to develop a new setup that includes very low-cost, easy-to-acquire hardware and open-source machine-learning classifiers to identify and quantify the behavior.

      Strengths:

      (1) The study demonstrates that their cheap, simple, and easy-to-obtain hardware works just as well as custom-made, specialized hardware for analyzing aggression and mating behavior. This enables the setup to be used in a wide range of settings, from research with limited resources to classroom teaching.

      (2) The authors used previously published software to train new classifiers for detecting a range of behaviors related to aggression and mating and to make them freely available. The classifiers are very positively benchmarked against a manually acquired ground truth as well as existing algorithms.

      (3) The study demonstrates the applicability of the setup (hardware and classifiers) to common methods in the field by confirming a number of expected phenotypes with their setup.

      We thank the reviewer for the positive assessment of our work and for highlighting its strengths. We appreciate these encouraging comments and suggestions, which have helped improve the clarity and presentation of the manuscript.

      Weaknesses:

      (1) When measuring the performance of the duration-based classifiers, the authors count any bout of behavior as true positive if it overlaps with a ground-truth positive for only 1 frame - despite the minimal duration of a bout is 10 frames, and most bouts are much longer. That way, true positives could contain cases that are almost totally wrong as long there was an overlap of a single frame. For the mating behaviors that are classified in ongoing bouts, I think performance should be evaluated based on the % of correctly classified frames, not bouts.

      We thank the reviewer for raising this concern. In response to this point, and to Reviewer #1’s similar comment, we performed a frame-level evaluation of all duration-based courtship classifiers. The analysis revealed only minor differences compared with the original bout-level metrics (see new Figure 5—Figure Supplement 3), confirming the robustness of our classifiers. We have also added a description of this analysis in the Materials and Methods section.

      (2) In the methods part, only one of the pre-existing algorithms (MateBook), is described. Given that the comparison with those algorithms is a so central part of the manuscript, each of them should be briefly explained and the settings used in this study should be described.

      We thank the reviewer for this helpful suggestion. In the revised manuscript, we expanded the Materials and Methods to include concise descriptions and parameter settings for all pre-existing algorithms used for comparison. This includes dedicated subsections for CADABRA and the Divider assay, with explicit reference to their rulebased or geometric features. For MateBook, we specified the persistence filters used and the adjustments made for fair benchmarking. These changes ensure transparency and reproducibility.

      Taken together, this work can greatly facilitate research on aggression and mating in Drosophila. The combination of low-cost, off-the-shelf hardware and open-source, robust software enables researchers with very little funding or technical expertise to contribute to the scientific process and also allows large-scale experiments, for example in classroom teaching with many students, or for systematic screenings.

      We thank the reviewer for the encouraging comments and for recognizing the accessibility and broad applicability of DANCE. We believe these revisions have further strengthened the manuscript.

      Reviewer #1 (Recommendations for the authors):

      The following comments highlight areas where additional context, clarification, or further analysis could strengthen the manuscript. I hope these suggestions will be useful in refining your work.

      (1) Lines 71-73: The authors state that Ctrax "leads to frequent identity switches among tracked flies, which is not the case while using FlyTracker." However, Ctrax was specifically designed to minimize identity errors, and Kabra et al. (2013) reported a low frequency of such errors-approximately one per five fly-hours in 10-fly videos. In contrast, Caltech FlyTracker does not correct identity errors automatically, requiring manual corrections, as noted in the Methods section of this study. If this is not an oversight, please provide further context to clarify this distinction.

      We thank the reviewer for raising this clarification. As reported by Bentzur et al. (2021), when groups of flies were tracked simultaneously, Ctrax often generated multiple identities for the same individual, sometimes producing more trajectories than the actual number of flies. To prevent ambiguity, we revised the text to read: “While both Ctrax and FlyTracker (Eyjolfsdottir et al., 2014) may produce identity switches, when groups of flies were tracked simultaneously, Ctrax led to inaccuracies that required manual correction using specialized algorithms such as FixTrax (Bentzur et al., 2021).”  We also quantified FlyTracker identity-switch rates in our datasets and report them in new Supplementary File 5, confirming that such events were rare (< 2% of tracked intervals). We believe, this updated version provides the necessary context and ensures accuracy in describing each tracker’s limitations.

      (2) Line 85: Providing additional context on how this study builds on previous work using JAABA-based classifiers for fly social behavior and comparing these classifiers to rule-based methods would more accurately situate it within the field. The authors state that "recently, a few JAABA-based classifiers have been developed for measuring aggression and courtship" and cite four related studies. However, this statement seems to underrepresent the use of JAABA-based classifiers for quantifying fly social behavior, which has become common in the field. Several additional studies (as noted in the public review) have developed JAABA-based classifiers for scoring aggression or courtship. Furthermore, other studies have compared the performance of JAABA-based classifiers with rule-based classifiers like CADABRA (e.g., Chowdhury et al., Comm Biology 2021; Leng et al., PlosOne 2020; Kabra et al., Nat Methods 2013). Mentioning the similar findings in those studies and your own helps strengthen the conclusion that machine-learning-based classifiers outperform rule-based classifiers in several experimental contexts.

      We thank the reviewer for this helpful suggestion. We have revised the Introduction to include additional references to studies that applied JAABA-based classifiers for aggression and courtship and made textual edits to reflect this. We further noted that, unlike several previous studies, all DANCE classifiers and analysis code are publicly available.

      Reviewer #2 (Recommendations for the authors):

      (1) Suggestions for improved or additional experiments, data or analyses: As mentioned in the description of the effect of optogenetic inactivation of dopaminergic neurons, in the conclusion and also reported in the literature, there are other important identified aggression behaviours, such as fencing, wing flick, tussle, and chase. Similarly, for courtship, tapping and licking have also been defined. This study, as opposed to proposed future studies, would benefit from creating opensource classifiers for these established behaviours, which are important for the analysis of aggression and courtship.

      We thank the reviewer for this valuable suggestion. As clarified in the Discussion, this manuscript intentionally focuses on six core, well-validated aggression and courtship behaviors to demonstrate DANCE’s modularity and reproducibility. Developing additional classifiers such as fencing, wing flick, tussle, chase, tapping, and licking would require extensive annotation and validation beyond the present scope. To address this point, we explicitly note in the revised text that the DANCE pipeline is readily extendable, allowing the community to build new classifiers within the same framework.

      In terms of observer bias assessment for ground-truthing in courtship, this was only presented for circling and it would be beneficial to have encompassed all behaviours analysed.

      We thank the reviewer for this suggestion. Observer-bias comparisons for all six classifiers are presented in Figure 2—Figure Supplement 1 (panels A–F). We clarified in the Results that annotations from two independent evaluators were compared for all classifiers, with no significant differences observed, confirming their robustness.

      Finally, intensity and temporal optogenetic control are important for neuronal circuit analysis of underlying behaviour. The authors could embed this aspect in DANCE by integrating control of the green light LED strip used in this study using, for example, the open-source visual reactive programming software Bonsai (Lopes et al., 2015) and open-source electronics platform Arduino. This is an important and valuable addition in line with maintaining low cost.

      We thank the reviewer for this valuable suggestion. DANCE was designed to be modular, allowing integration of temporal optogenetic control. To support immediate adoption, we now provide Arduino LED control code, setup schematics, and photographs (new Figure 7—Figure Supplement 1) along with step-by-step instructions on our GitHub page. We also note that Bonsai and Arduino frameworks are compatible with DANCE, enabling future extensions for closed-loop or behaviortriggered stimulation.

      (2) Minor corrections to the text and figures:

      Figure Supplement 1 refers only to Figure 2, yet panels D-F refer to the behaviour circling in courtship and therefore should be assigned to the respective figure.

      Thanks, we have corrected this.

      In lines 315-316, the cumbersome task of fluon coating for aggression assays seems to be ubiquitous across assays which is not the case, and therefore the sentence should include the word 'some'.

      Thanks, we have edited this.

      The cost of the phone and/or tablet should be included in the DANCE setup costs, as presumably these devices will be dedicated to the behavioural studies, for consistency purposes.

      We thank the reviewer for this comment. We intentionally did not include smartphones or tablets in the setup cost because, in our experiments, these devices were not dedicated exclusively to DANCE but were repurposed from routine personal use. Our aim was to leverage readily available consumer electronics so that their cost does not become a barrier to adoption. We confirmed that commonly available Android phones capable of 30 fps at 1080p in H.264 format, as well as tablets or phones running a simple white-screen light app, are sufficient for reliable behavior classification and illumination. Since these devices can be returned to regular use after recordings, including their cost in the setup would not accurately reflect the intended accessibility of DANCE. For consistency, we now clarify in the Materials and Methods that such devices should be placed in airplane mode during recordings.

      Reviewer #3 (Recommendations for the authors):

      (1) For my taste, the authors put too much emphasis on the point that their method outperforms existing methods. I understand the value in comparing to published methods and it is of course fully justified to state the advantages of the new method. But the whole preprint is set up as a competition with the old algorithms, and the conclusion that the new classifier is better is repeated in each figure caption and after each paragraph of the results. This competitive mindset also extends to the selection of which results are presented as main figures and which as supplements - all cases in which the previous methods actually perform well are only presented in the supplement. I think this is simply unnecessary as the authors' results speak for themselves, and do not need the continuous competitive comparison.

      We thank the reviewer for this thoughtful suggestion. Our intention was to benchmark DANCE rigorously against existing methods, not to frame the study competitively. We agree that repeated emphasis on relative performance was unnecessary. In the revised version, we streamlined figure captions and text throughout the manuscript to balance comparisons and removed redundant phrasing. Instances where other methods performed well are now presented with equal clarity to maintain a neutral and informative tone.

      (2) When describing the DANCE hardware, as a reader I would find it interesting to also read about potential issues that the authors encountered. For example, how difficult is it to handle the materials without breaking or deforming them, which could affect the behavioral assays? How critical is it to use specific blister packs - the availability of which will likely vary strongly between countries? Did the authors try different sizes, and products? Such information, even as a supplement, could be very helpful for the widespread use of the hardware.

      We thank the reviewer for this important point. To address this, we conducted additional tests comparing DANCE arenas of different diameters (new Figure 7— Figure Supplement 3A–C and new Figure 7—Figure Supplement 4A–L). We also consulted colleagues in multiple countries and verified that the blister packs used in our assays are readily available. The Materials and Methods now include practical handling notes: blister foils can be reused ~30–40 times for aggression assays and ~10–15 times for courtship assays before deformation. We also describe how to prevent agar surface damage during assembly and how to wash and dry the arenas for optimal reusability.

      (3) I find the arrows pointing to several videos in a number of figures rather distracting and redundant, and suggest omitting them.

      Thanks, we have omitted these arrows from all relevant figures and clarified the figure legends to enhance readability.

      (4) P8, line 169 ff: this is a very long sentence that should be separated into several sentences.

      We have rewritten this as follows: “DANCE scores remained comparable to groundtruth scores across all categories, whereas CADABRA and Divider underestimated the lunge counts (Figure 2B–E). Correlation analysis revealed a strong relationship between DANCE and ground-truth scores (Figure 2F, Supplementary File 2). In comparison, CADABRA and the Divider assay classifier showed a weaker correlation (Figure 2G-H, Supplementary File 2).”

      (5) P10, line 216: please explain, here and in the methods, how these behavioral indices are calculated. I did not find this information anywhere in the paper.

      We thank the reviewer for pointing this out. We now define the behavioral index explicitly in Materials and Methods: “For each assay, a behavioral index was calculated as the proportion of frames in which the male engaged in the specified behavior. This was obtained by dividing the total number of frames annotated for that behavior by the total number of frames in the recording.”

      (6) P11, line 253: I don't understand the modifications to MateBook regarding attempted copulations, neither in the results nor the methods section. I would ask the authors to explain more explicitly what was done.

      We thank the reviewer for this helpful suggestion. We have re-written several parts of the Materials and methods to clarify these details and streamline the text. To train the attempted copulation classifier, we combined datasets from assays with mated and decapitated virgin females, using manual annotations as ground truth. We also adapted MateBook’s persistence filters (Ribeiro et al., 2018) and defined thresholds explicitly: mounting lasting >45 s (>1350 frames at 30 fps) was defined as copulation, whereas abdominal curling without mounting, or mounting lasting 0.33– 45 s, was defined as attempted copulation.

      (7) Figure 7F: this is the only case with a significant difference between the two setups. What explanations do the authors have for the discrepancy?

      We thank the reviewer for raising this point. After repeating the experiments, we no longer found a significant difference between the setups. Figure 7 and its legend have been updated to reflect these results.

      (8) Figure 2 - Supplement 1: I do not understand why the boxes for Observer 1 have different colors in different figures. Does this have a meaning?

      Thanks for pointing this out. The color differences had no intended meaning, and we have corrected the figure for consistency across panels.

      (9) P22, line 517ff: It would be interesting to know how frequently identity switches occurred. For large-scale, automatic behavioral screenings that step could be a crucial bottleneck.

      We thank the reviewer for this valuable suggestion. We analyzed identity switches using the FlyTracker “Visualizer” package, which flags frames with possible overlaps or jumps. Flagged intervals were manually verified, and we report these data in new Supplementary File 5. Identity switch rates were very low: 0.66% for high-resolution recordings and 1.9% for smartphone DANCE videos in the most challenging decapitated-virgin dataset. These findings demonstrate robust tracking performance under both setups.

    1. Practicing decolonial allyship within a White settler queer family, alsomeans deepening an understanding of the way colonial narratives may beembedded within “social justice,” “intersectional,” or “critical literacy” dis-courses and practices despite their claim to do the opposite. For example,it has been important to Cindy that the story her daughter hears (and tells)about Indigenous people in Canada, is not only a story of oppression butalso of resistance and resilience.

      This passage really made me think. A lot of the time, when we learn about Indigenous people, we often hear about how they were oppressed, and it focuses on their suffering, but it hardly ever mentions their strength to stand tall despite the oppression they face every single day. I think it's important for both facts to coexist.

      As the text mentions, things like social justice aren't always upheld. This is because, inherently, the structures in Western society benefit White colonizers the most. This raises the question of, "What can we do to change the structure that oppresses Indigenous people and people of colour?"

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1

      1. First, the authors have not convincingly shown that skin cells, or more specifically skin ECs, are a major source of circulating G-CSF in the psoriasis model as stated in the title and abstract. The data in Figure 4 show selective upregulation of Csf3 gene in skin ECs and their ability to secrete G-CSF upon IMQ treatment in vitro. However, the provided data do not address to what degree the skin EC-derived G-CSF contributes to the elevated level of circulating G-CSF. Additional experiments to selectively deplete G-CSF in skin ECs, or at least in skin cells of the affected site, are warranted to support the authors' claim. Does intradermal injection of G-CSF neutralizing antibody into the psoriatic skin reduce circulating levels of G-CSF?

      Author's response:

      Thank you for reviewer's comment. We agree with the Reviewer#1 that it is important to directly block G-CSF to the skin via intradermal injection and measure the G-CSF level in the serum afterwards. Therefore, we will perform intradermal injection of IgG-isotype or anti-G-CSF antibody into the IMQ-induced psoriatic mice.

      Another concern is insufficient demonstration of G-CSF-mediated emergency granulopoiesis in the psoriasis model. All data in Figure 5 were obtained from experiments with only n=3, and adding more replicates, in particular to those in Figure 5B, which show quite some variation in MPP numbers, is recommended. The relatively small reduction of BM granulocyte numbers (Figure 5C) compared to greater depletion of circulating granulocytes (Figure S5A) raises the possibility that it is the mobilization effect rather than granulopoiesis-stimulating effect that skin-derived G-CSF exerts to promote supply of circulating neutrophils that eventually infiltrate into the affected skin. This could also explain the negligible effect of IL-1blockade (Figure S4), which selectively shut off myelopoiesis-stimulating effect of IL-1 (Pietras et al. Nat Cell Biol 2016, PMID: 27111842). Are the HSPCs in the psoriasis model more cycling? Do they show myeloid-skewed differentiation when cultured ex vivo or upon transplantation?

      Author's response: Thank you for these critical comments. We agree to do the following experiments to address them:

      1) HSPCs quantification in Figure 5 especially the MPPs will be added with more replicates.

      2) We will assess cycling status of HSPCs by flow cytometric analysis of Ki67and Propidium Iodide to characterize G0, G1 and G2/M cell cycle phase.

      3) To test myeloid-skewed differentiation, Lin- c-Kit+ Sca-1+ cells containing HSPCs will be isolated from bone marrow of Vas/IMQ-treated mice and transplanted into lethally irradiated syngeneic mice.

      The authors' claim that skin-derived G-CSF "induces" neutrophil infiltration warrants further clarification. Alternative explanation is that the upregulated neutrophil-attracting chemokines (Figure S1D) could induce infiltration, whereas G-CSF increase the number of neutrophils to circulate in the vessels near the psoriatic skin. This notion seems supported elsewhere (Moos et al. J Invest Dermatol. 2019, PMID: 30684554). Can the infiltration be inhibited by systemically injecting neutralizing antibody of their receptor, CXCR2?

      Author's response: The manuscript focuses on the skin-derived G-CSF function as a long-distance signal for emergency granulopoiesis in the bone marrow upon psoriasis, not the chemoattractant property of it. The sentence of interest is "We found that upon psoriasis induction, skin-resident endothelial cells are activated to produce G-CSF which activates emergency granulopoiesis in bone marrow and induces cutaneous infiltration and accumulation of neutrophil that are functionally inflammatory." in line 28-30. In agreement with point #2 from Reviewer#2, the fact that neutrophil recruitment factors (CXCL1, CXCL2, and CXCL5) were upregulated in psoriatic skin (Figure S1D), suggesting a CXCL-mediated neutrophil recruitment. The sentence of concern need to be changed to "We found that upon psoriasis induction, skin-resident endothelial cells are activated to produce G-CSF which activates emergency granulopoiesis in bone marrow, leading to cutaneous accumulation of neutrophil that are functionally inflammatory.". This revised sentence has omitted the proposal that G-CSF directly dictates neutrophils mobilization to the skin, which is not the key message of the study. Therefore, we found that the CXCR2 (CXCLs receptor) blockade experiment may be of the benefit of future studies.

      It remains unclear how skin-derived G-CSF accumulates pathogenic neutrophils. The authors state "pathogenic granulopoiesis," but are the circulating neutrophils in the psoriatic mice already "pathogenic" or do they acquire pathogenic phenotype after cutaneous infiltration? Additional RNA-seq to compare circulating and infiltrated neutrophils would answer this question.

      Author's response: We appreciate this valuable comment. We will perform RNA-seq with the peripheral blood-circulating neutrophils (CD45+ CD11b+ Ly6G+ Ly6Cmid) versus skin-infiltrating neutrophils from both Vas/IMQ mice.

      In addition, how the accumulated pathogenic neutrophils exacerbate the psoriatic changes remains obscure. Although the authors have attempted to correlate Il17a gene expression in infiltrated neutrophils with psoriatic skin changes, the data do not address to what degree it contributes to cutaneous IL-17A protein levels. The data that cutaneous neutrophil depletion leads to subtle decrease in skin IL-17A expression (Figure 2H) rather supports alternative possibilities. For instance, as indicated elsewhere, IL-17A cutaneous tone could be enhanced by neutrophil-mediated augmentation of Th17 or gamma/delta T cell function (Lambert et al. J Invest Dermatol. 2019, PMID: 30528823). Does neutrophil depletion or G-CSF neutralization alter cell numbers or function of cutaneous Th17 and gamma/delta T cells?

      Author's response: Thank you for this insightful comment. To better understand the relative contribution of neutrophils to the cutaneous IL-17A tone in the psoriatic skin, we will perform flowcytometric analysis of Th17 and gamma/delta T cells which are widely known as the major source of IL-17 in psoriatic skin of IMQ-induced mice following injection of isotype-matched or anti-Ly6G antibody.

      Finally, as the above conclusions rely solely on the IMQ-induced acute psoriasis model, it would be informative if they could be derived from another psoriasis model. IMQ is known to induce unintended systemic inflammation due to grooming-associated ingestion (Gangwar et al. J Invest Dermatol. 2022, PMID: 34953514), and "pathological crosstalk between skin and BM in psoriatic inflammation" could be strengthened by an intradermal injection model.

      Author's response: We appreciate the reviewer for bringing this important point. Regarding the systemic inflammation upon psoriasis, the above-cited study reported increased IFN-B expression in the intestines of IMQ-ingested animal (Grine L et al. Sci Rep. 2016, PMID: 26818707 in Gangwar et al. J Invest Dermatol. 2022, PMID: 34953514). We examined several pro-inflammatory cytokines including IFN-b, IFN-g, and IL-6 and in contrast, found no systemic increase in all these cytokines, except for IFN-g downregulation (Explanation Figure 1), which suggests no evidence of grooming-associated ingestion.

      We also examined the Csf3 expression across several distinctively located tissues which showed a selective upregulation in the skin (Figure 4C), suggesting a skin-restricted perturbation. In addition, one study showed that IMQ-ingestion didn't alter number of gut injury-associated CXCR3+ macrophages nor did it aggravate skin inflammation (Pinget et al. Cell Reports. 2022, PMID: 35977500). Together, these findings support that IMQ-induced psoriasis by topical cutaneous application used in our study elicit a local inflammation but not systemic inflammation.

      The authors, however, realize that testing alternative psoriasis model such as intradermal injection of IL-23 (Chan et al. J Exp Med. 2006, PMID: 17074928) will strengthen the skin-local insults within the psoriasis model employed, and should be tested in the future.

      Minor comments

      Figure 1E shows multiple elongated Ly6G+ structures in d0-2 control and d0 IMQ skins that do not appear to be neutrophils.

      Author's response: We appreciate the Reviewer#1 pointing this issue. As mentioned by the Reviewer#1, the elongated structures detected in the intravital microscopy are not neutrophils, but autofluorescence from the skin bulge regions (Wun et al. J Invest Dermatol. 2005, PMID: 15816847). We have eliminated these unspecific signals from the transformation and quantification (Figure 1F, S1G, and S1H). We will also add an explanatory sentence in Materials and Methods section "Of note, the fluorescent signal with elongated structures resembling hair bulge were autofluorescence and thus removed from further analysis." to be more precise about our methods.

      In Figure 2C, the bottom GSEA seems to be showing type II IFN response, not type I IFN, according to the text.

      Author's response: Thank you for the comment, we will correct this misspelling.

      Author's response: We appreciate that Reviewer#1 bring up this point. We examined the kinetics of the bone marrow cellularity and GMPs across 4 days of psoriasis induction in mice. The bone marrow cell number was lowered along that span with lowermost count at 2 days. Consistent to the BM-cellularity, the GMP number was also lowered about one-third in the first 2 days of psoriasis. This kinetic is consistent with the previous report showing a rapid reduction of GMPs in the bone marrow within 2 days following systemic G-CSF administration driven emergency granulopoiesis (Hirai et al. Nat. Immunol. 2006, PMID: 16751774). From 2 days to 4 days, the GMP number rapidly increased to slightly above basal number (Explanation Figure 2). This timely coordinated expansion suggests a significant supply of GMPs from the differentiating upstream myeloid progenitors (Figure 3B).

      When the psoriatic mice with elevated G-CSF is injected with anti-G-CSF or IgG-isotype antibody, the bone marrow cellularity and GMP numbers at 4 days were (Explanation Figure 3). Firstly, as psoriasis reduced bone marrow cellularity (Explanation Figure 2), the unchanged number after anti-G-CSF injection indicates that administration of 10µg/day for 4 days does not significantly affect mobilization of psoriatic bone marrow cells. Secondly, the similar GMP numbers at 4 days psoriasis is plausibly due to snapshot analysis when it has already in the numerical recovery period (Explanation Figure 2). Importantly, the notion that anti-G-CSF injection to psoriatic mice reduced granulocytes in the bone marrow, peripheral blood, and skin suggesting G-CSF as a key mediator in psoriatic driven emergency granulopoiesis on top of unlikely case of ineffective anti-G-CSF treatment.

      Taken together, these data suggest a G-CSF mediated emergency granulopoiesis occurrence in the IMQ-induced psoriasis. We will put these data into a revised Figure.

      In Figures 6B, in which cluster of human skin cells IL-17A expression would be enriched?

      Author's response: Thank you for this important point. The IL-17A expression is found in the T-cell cluster (Explanation Figure 4). We also expected to see IL-17A contribution from other cell subset(s), in particular neutrophil. However, due to the fragile nature of neutrophils and thereby, technical difficulty to get their sequencing reads, this dataset (GSE173706) doesn't contain neutrophils, but rather monocytes, macrophages, and dendritic cells among the myeloid subset (Explanation Figure 5). With this, it leaves open the question on what potential contribution of IL-17A produced by neutrophils is in human psoriasis (Reich et al. Exp. Dermatol. 2015, PMID: 25828362).

      Figure 1E shows multiple elongated Ly6G+ structures in d0-2 control and d0 IMQ skins that do not appear to be neutrophils.

      Author's response: We appreciate the Reviewer#1 pointing this issue. As mentioned by the Reviewer#1, the elongated structures detected in the intravital microscopy are not neutrophils, but autofluorescence from the skin bulge regions (Wun et al. J Invest Dermatol. 2005, PMID: 15816847). We have eliminated these unspecific signals from the transformation and quantification (Figure 1F, S1G, and S1H). We will also add an explanatory sentence in Materials and Methods section "Of note, the fluorescent signal with elongated structures resembling hair bulge were autofluorescence and thus removed from further analysis." to be more precise about our methods.

      In Figure 2C, the bottom GSEA seems to be showing type II IFN response, not type I IFN, according to the text.

      Author's response: Thank you for the comment, we will correct this misspelling.

      Reviewer#2

      1. Interpretation of neutrophil transcriptomic changes (Figure 2)

      The RNA-seq analysis reveals substantial downregulation of several canonical pro inflammatory pathways in neutrophils from psoriatic skin, including IL-6, IL-1, and type II interferon signaling. The authors should discuss the functional relevance of this unexpected transcriptional repression. For example, does this indicate a shift toward specialized effector functions rather than classical cytokine responsiveness? More importantly, the most striking transcriptional change is the upregulation of NADPH oxidase-related genes (e.g., Nox1, Nox3, Nox4, Enox2). This suggests an oxidative stress-driven pathogenic mechanism, potentially more relevant than IL-17A production. Yet this aspect is not explored in the manuscript. Assessing ROS levels or oxidative neutrophil effector functions in this model would considerably strengthen the mechanistic link. Conversely, although IL-17A is upregulated in neutrophils, neutrophil depletion reduces total Il17a expression in skin only partially. This indicates that neutrophils are unlikely to be the dominant IL-17A source in the lesion. The authors' focus on neutrophil-derived IL 17A therefore seems overstated. A more rigorous assessment-e.g., conditional deletion of Il17a specifically in neutrophils-would be required to establish its true contribution. Taken together, the data suggest that oxidative programs, rather than IL-17A production, may represent the principal pathogenic axis downstream of neutrophils, and this deserves deeper discussion.

      Author's response: Thank you for raising this valuable views. We have agreed to address these critical points by the following approaches:

      1) To address the changes in NADPH oxidase-related gene signature, we will measure ROS production in the neutrophils from skin and peripheral blood with DHR123.

      2) Responding to the IL17A contribution by neutrophils, we will flow cytometrically assess the Th17 and gamma/delta T cell population in the skin of psoriatic mice treated with anti-Ly6G or isotype-matched antibody as was suggested by Reviewer#1.

      3) We will discuss downregulation of the canonical pro inflammatory and IL-17 pathways in the psoriatic neutrophils in the discussion.

      Human data reanalysis (Figure 6):

      The re-analysis of bulk and single-cell RNA-seq datasets is valuable but incomplete. Several mechanistically relevant questions could be addressed with the available data:

      2.1. GM-CSF (CSF2) is also strongly upregulated in psoriatic lesions (bulk RNA-seq). It would be informative to determine whether endothelial cells also express CSF2 in the scRNA-seq dataset, as this would suggest coordinated regulation of myeloid-supporting cytokines.

      2.2. Myeloid cell subsets should be examined more closely. A comparison of human myeloid transcriptomes with the mouse neutrophil RNA-seq would clarify whether similar IL-17A-related or NADPH oxidase-related signatures occur in human disease. In particular, which cell types express IL17A in human lesions?

      2.3. Chemokine production should be attributed to specific cell types. Bulk RNA-seq confirms strong induction of CXCL1, CXCL2, CXCL5, but the scRNA-seq dataset allows determining whether these chemokines originate from endothelial cells or other stromal/immune populations. This information is important for defining whether endothelial cells coordinate both neutrophil recruitment and granulopoiesis.

      Addressing these points would make the human-mouse comparison substantially stronger.

      Author's response: Thank you for pointing these important issues. By reanalyzing the dataset, we found several points regarding the comments, as follows:

      2.1) CSF2 is expressed by T-cell cluster in the human skin dataset (Explanation Figure 4), in agreement with previous murine study (Hartwig et al. Cell Reports. 2018, PMID: 30590032). We will add this data in the revised manuscript.

      2.2) In line with point#10 from Reviewer#1, the dataset clearly shows T-cell cluster as the main IL17A source (Explanation Figure 4 above). The dataset, however, doesn't contain phenotypic neutrophils (CEACAM (CD66b) and PGLYRP1) but monocytes, macrophages, and dendritic cells (Explanation Figure 5 above). This loss was probably due to a technical limitation given the difficulty in capturing sequencing reads from fragile neutrophils. Therefore, it is no longer possible to reanalyze IL-17 expression in the absence of neutrophils in the datapool.

      2.3) Reanalysis of CXCLs in the human scRNAseq dataset (GSE173706) clarified their secretion dynamics and cellular sources under normal and psoriatic condition. In normal skin, all examined cell subsets show only low CXCLs expression. In contrast, psoriatic skin exhibits significant CXCLs upregulation with distinct cell subsets clearly showing dramatic upregulation, potentially being the major CXCLs source. CXCL1 is markedly upregulated in fibroblasts, myeloid cells, and melanocyte and nerve cells. CXCL2 is strikingly upregulated to myeloid cells, while CXCL5 is hugely increased in fibroblasts, myeloid cells, and mast cells (Explanation Figure 7). Taken together, these results suggest that CXCLs upregulation in the psoriatic skin is coordinatively executed by both stromal and immune compartments. Of note, the endothelial cells show minimal changes in CXCLs expression, even downregulate CXCL2 in psoriasis, indicating that they are unlikely to be the major contributor to CXCL-mediated neutrophil recruitment.

      **Referees cross-commenting**

      I agree with Reviewer 1 that the contribution of EC-derived G-CSF to circulating G-CSF levels and to emergency myelopoiesis requires additional genetic or neutralization experiments to be fully established.

      Author's response: We appreciate that Reviewer#2 raised this key point. In addition to examining the serum G-CSF upon intradermal anti-G-CSF administration in point#1 from Reviewer#1 above, we will also examine the emergency myelopoiesis signs in vivo.

      Minor points

      1. Line 319: the text likely refers to Figure S4, not S3.

      Author's response: Thank you, we will correct the nomenclature.

      Line 338: "psoriatic" is misspelled.

      Author's response: Thank you, we will change this to "psoriatic".

      Reviewer #3

      • Place the work in the context of the existing literature (provide references, where appropriate).

      Psoriasis is extensively studied, a good recent reference- https://doi.org/10.1016/j.mam.2024.101306

      Author's response: Thank you for Reviewer#3's suggestion. The referenced study highlights the current paradigm that largely focus on skin-restricted mechanism and overlook potential cross-organ interaction in the psoriasis inflammation. Our findings provide a new insight into the skin-bone marrow crosstalk in the disease context. In addition, the suggested reference underscores the key roles of diverse innate immune cells including neutrophils, eosinophils, dendritic cells, etc. which is fundamental for our study and might also guide future exploration of additional innate cell subsets beyond neutrophils. We will therefore include the mentioned reference to our revised manuscript.

      • Do you have suggestions that would help the authors improve the presentation of their data and conclusions?

      It is all good. May add graphical-abstract.

      Author's response: Thank you for the reviewer's input, we agree that a graphical-abstract will help the readers more clearly grasp the key messages of our manuscript. We will include it in the revised manuscript.

      Major comments:

      • Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?

      No. It is very solid.

      Author's response: We appreciate the reviewer's view that the claims in our paper are solid.

      • Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.

      Such a discovery clearly opens many options, and it is fascinating to suggest additional experiments for future studies. It is a complete study, best to publish as-is and let many to read and proceed with this new concept.

      Author's response: We thank the reviewer for noting that the current experimental evidence is complete that no additional experiments are necessary at this stage. We agree that the discovery opens prospective directions for future studies.

      • Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.

      N/A - I suggest no additional experiments at this point. Get it published and see how many will follow this new direction!

      Author's response: We thank the reviewer for recognizing that the experimental data has been sufficient to be a foundation for the future research.

      • Are the data and the methods presented in such a way that they can be reproduced?

      Yes.

      Author's response: We thank the reviewer for recognizing that our methods are reproducible.

      • Are the experiments adequately replicated, and is the statistical analysis adequate?

      Yes. The data are of very high quality.

      Author's response: We are grateful that the reviewer view our replication strategy and statistical analysis are of a high quality.

      Minor comments:

      • Specific experimental issues that are easily addressable.

      None. It is good as-is. One may always suggest minor things- but this one is better published so many laboratories may rush for this new direction. I think it will be interesting studying some long-term impacts, and changes not only of neutrophils but also of other innate cells, such as DCs, Macrophages, and Eosinophils - so it is best to let laboratories that focus on these cells know of the discovery and pursue independent studies.

      Author's response: We appreciate the reviewer's assessment that our paper is already well set for the community to explore the newly proposed direction.

      • Are the text and figures clear and accurate?

      Yes.

      Author's response: We thank the reviewer's evaluation. We have ensured that the text and figures in our manuscript are clear and accurate. Once again, we thank the reviewer for the encouraging and constructive appraisal. We are pleased that the reviewer find the manuscript has already been strong and suitable for publication.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary The manuscript by Aarts et al. explores the role of GRHL2 as a regulator of the progesterone receptor (PR) in breast cancer cells. The authors show that GRHL2 and PR interact in a hormone-independent manner and based on genomic analyses, propose that they co-regulate target genes via chromatin looping. To support this model, the study integrates both newly generated and previously published datasets, including ChIP-seq, CUT&RUN, RNA-seq, and chromatin interaction assays, in breast cancer cell models (T47DS and T47D).

      Major comments: R1.1 Novelty of GRHL2 in steroid receptor biology The role of GRHL2 as a co-regulator of steroid hormone receptors has previously been described for ER (J Endocr Soc. 2021;5(Suppl 1):A819) and AR (Cancer Res. 2017;77:3417-3430). In the ER study, the authors also employed a GRHL2 ΔTAD T47D cell model. Therefore, while this manuscript extends GRHL2 involvement to PR, the contribution appears incremental rather than conceptual.

      We are fully aware of the previously described role of GRHL2 as a co-regulator of steroid hormone receptors, particularly ER and AR. As acknowledged in our introduction (lines 104-108), we explicitly state: "Grainyhead-like 2 (GRHL2) has recently emerged as a potential pioneer factor in hormone receptor-positive cancers, including breast cancer21. However, nearly all studies to date have focused on GRHL2 in the context of ER and estrogen signaling, leaving its role in PR- and progesterone-mediated regulation unexplored22-26".

      As for the specific publications that the reviewer refers to: The first refers to an abstract from an annual meeting of the Endocrine Society. As we have been unable to assess the original data underpinning the abstract - including the mentioned GRHL2 DTAD model - we prefer not to cite this particular reference. We do cite other work by the same authors (Reese et al. 2022, our ref. 25). We also cite the AR study mentioned by the reviewer (our ref. 55) in our discussion. As such, we think we do give credit to prior work done in this area.

      By characterizing GRHL2 as a co-regulator of the progesterone receptor (PR), we expand on the current understanding of GRHL2 as a common transcriptional regulator within the broader context of steroid hormone receptor biology. Given that ER and PR are frequently co-expressed and active within the same breast cancer cells, our findings raise the important possibility that GRHL2 may actively coordinate or modulate the balance between ER- and PR-driven transcriptional programs, as postulated in the discussion paragraph.

      Importantly, we also functionally link PR/GRHL2-bound enhancers to their target genes (Fig5), providing novel insights into the downstream regulatory networks influenced by this interaction. These results not only offer a deeper mechanistic understanding of PR signaling in breast cancer but also lay the groundwork for future comparative analyses between GRHL2's role in ER-, AR-, and PR-mediated gene regulation.

      As such, we respectfully suggest that our work offers more than an incremental advance in our knowledge and understanding of GRHL2 and steroid hormone receptor biology.

      R1.2 Mechanistic depth The study provides limited mechanistic insight into how GRHL2 functions as a PR co-regulator. Key mechanistic questions remain unaddressed, such as whether GRHL2 modulates PR activation, the sequential recruitment of co-activators/co-repressors, engages chromatin remodelers, or alters PR DNA-binding dynamics. Incorporating these analyses would considerably strengthen the mechanistic conclusions.

      Although our RNA-seq data demonstrate that GRHL2 modulates the expression of PR target genes, and our CUT&RUN experiments show that GRHL2 chromatin binding is reshaped upon R5020 exposure, we acknowledge that we have not further dissected the molecular mechanisms by which GRHL2 functions as a PR co-regulator.

      We did consider several follow-up experiments to address this, including PR CUT&RUN in GRHL2 knockdown cells, CUT&RUN for known co-activators such as KMT2C/D and P300, as well as functional studies involving GRHL2 TAD and DBD mutants. However, due to technical and logistical challenges, we were unable to carry out these experiments within the timeframe of this study.

      That said, we fully recognize that such approaches would provide deeper mechanistic insight into the interplay between PR and GRHL2. We have therefore explicitly acknowledged this limitation in our limitations of the study section (line 502-507) and mention this as an important avenue for future investigation.

      R1.3 Definition of GRHL2-PR regulatory regions (Figure 2) The 6,335 loci defined as GRHL2-PR co-regulatory regions are derived from a PR ChIP-seq performed in the presence of hormone and a GRHL2 ChIP-seq performed in its absence. This approach raises doubts about whether GRHL2 and PR actually co-occupy these regions under ligand stimulation. GRHL2 ChIP-seq experiments in both hormone-treated and untreated conditions are necessary to provide stronger support for this conclusion.

      Although bulk ChIP-seq cannot definitively demonstrate simultaneous binding of PR and GRHL2 at the same genomic regions, we agree that the ChIP-seq experiments we present do not provide a definitive answer on if GRHL2 and PR co-occupy these regions under ligand stimulation. As a first step to address this, we performed CUT&RUN experiments for both GRHL2 and PR under untreated and R5020-treated conditions. These experiments revealed a subset of overlapping PR and GRHL2 binding sites (approximately {plus minus}5% of the identified PR peaks under ligand stimulation).

      We specifically chose CUT&RUN to minimize artifacts from crosslinking and sonication, thereby reducing background and enabling the mapping of high-confidence direct DNA-binding events: Given that a fraction of GRHL2 physically interacts with PR (Fig1D), it is possible that ChIP-seq detects indirect binding of GRHL2 at PR-bound sites and vice versa. CUT&RUN, by contrast, allows us to identify direct binding sites with higher confidence.

      Nonetheless, although outside the scope of the current manuscript, we agree that a dedicated GRHL2 ChIP with and without ligand stimulation would provide additional insight, and we have accordingly added this suggestion to the discussion (line 502-507).

      R1.4 Cell model considerations The manuscript relies heavily on the T47DS subclone, which expresses markedly higher PR levels than parental T47D cells (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Kalkhoven et al., Int J Cancer 1995). This raises concerns about physiological relevance. Key findings, including co-IP and qPCR-ChIP experiments, should be validated in additional breast cancer models such as parental T47D, BT474, and MCF-7 cells to generalize the conclusions. Furthermore, data obtained from T47D (PR ChIP-seq, HiChIP, CTCF and Rad21 ChIP-seq) and T47DS (RNA-seq, CUT&RUN) are combined along the manuscript. Given the substantial differences in PR expression between these cell lines, this approach is problematic and should be reconsidered.

      We agree that physiological relevance is important to consider. Here, all existing model systems have some limitations. In our experience, it is technically challenging to robustly measure gene expression changes in parental T47D cells (or MCF7 cells, for that matter) in response to progesterone stimulation (Aarts et al., J Mammary Gland Biol Neoplasia 2023). As we set out to integrate PR and GRHL2 binding to downstream target gene induction, we therefore opted for the most progesterone responsive model system (T47DS cells). We agree that observations made in T47D and T47DS cells should not be overinterpreted and require further validation. We have now explicitly acknowledged this and added it to the discussion (line 507-509).

      As for the reviewer's suggestion to use MCF7 cells: apart from its suboptimal PR-responsiveness, this cell line is also known to harbor GRHL2 amplification, resulting in elevated GRHL2 levels (Reese et al., Endocrinology2019). By that line of reasoning, the use of MCF7 cells would also introduce concerns about physiological relevance. That being said, and as noted in the discussion (line 390-391), the study by Mohammed et al. which identified GRHL2 as a PR interactor using RIME, was performed in both MCF7 and T47D cells. This further supports the notion that the PR-GRHL2 interaction is not limited to a single cell line.

      R1.5 CUT&RUN vs ChIP-seq data The CUT&RUN experiments identify fewer than 10% of the PR binding sites reported in the ChIP-seq datasets. This discrepancy likely results from methodological differences (e.g., absence of crosslinking, potential loss of weaker binding events). The overlap of only 158 sites between PR and GRHL2 under hormone treatment (Figure 3B) provides limited support for the proposed model and should be interpreted with greater caution.

      We acknowledge the discrepancy between the number of binding sites between ChIP-seq and CUT&RUN. Indeed, methodological differences likely contribute to the differences in PR binding sites reported between the ChIP-seq and CUT&RUN datasets. As the reviewer correctly notes, the absence of crosslinking and sonication in CUT&RUN reduces detection of weaker binding events. However, it also reduces the detection of indirect binding events which could increase the reported number of peaks in ChIPseq data (e.g. the common presence of "shadow peaks").

      As also discussed in our response to R1.3, we deliberately chose the CUT&RUN approach to enable the identification of high-confidence direct DNA-binding events. Since GRHL2 physically interacts with PR, ChIP-seq could potentially capture indirect binding of GRHL2 at PR-bound sites, and vice versa. By contrast, CUT&RUN primarily captures direct DNA-protein interactions, offering a more specific binding profile. Thus, while the number of CUT&RUN binding sites is much smaller than previously reported by ChIP-seq, we are confident that they represent true, direct binding events.

      We would also like to emphasize that the model presented in figure 6 does not represent a generic or random gene, but rather a specific gene that is co-regulated by both GRHL2 and PR. In this specific case, regulation is proposed to occur via looping interactions from either individual TF-bound sites (e.g., PR-only or GRHL2-only) or shared GRHL2/PR sites. We do not propose that only shared sites are functionally relevant, nor do we assume that GRHL2 and PR must both be directly bound to DNA at these shared sites. Therefore, overlapping sites identified by ChIP-seq-potentially reflecting indirect binding events-could indeed be missed by CUT&RUN, yet still contribute to gene regulation. To clarify this, we have revised the main text (line 331-334) and the legend of Figure 6 to explicitly state that the model refers to a gene with established co-regulation by both GRHL2 and PR.

      R1.6 Gene expression analyses (Figure 4) The RNA-seq analysis after 24 hours of hormone treatment likely captures indirect or secondary effects rather than the direct PR-GRHL2 regulatory program. Including earlier time points (e.g., 4-hour induction) in the analysis would better capture primary transcriptional responses. The criteria used to define PR-GRHL2 co-regulated genes are not convincing and may not reflect the regulatory interactions proposed in the model. Strong basal expression changes in GRHL2-depleted cells suggest that much of the transcriptional response is PR-independent, conflicting with the model (Figure 6). A more straightforward approach would be to define hormone-regulated genes in shControl cells and then examine their response in GRHL2-depleted cells. Finally, integrating chromatin accessibility and histone modification datasets (e.g., ATAC-seq, H3K27ac ChIP-seq) would help establish whether PR-GRHL2-bound regions correspond to active enhancers, providing stronger functional support for the proposed regulatory model.

      We thank the reviewer for pointing this out. We now recognize that our criteria for selecting PR/GRHL2 co-regulated genes were not clearly described. To address this, we have revised our approach as per the reviewer's suggestion: we first identified early and sustained PR target genes based on their response at 4 and 24 hours of induction and subsequently overlaid this list with the gene expression changes observed in GRHL2-depleted cells. This revised approach reduced the amount of PR-responsive, GRHL2 regulated target genes from 549 to 298 (46% reduction). We consequently updated all following analyses, resulting in revised figures 4 and 5 and supplementary figures 2,3 and 4. As a result of this revised approach, the number of genes that are transcriptionally regulated by GRHL2 and PR (RNAseq data) that also harbor a PR loop anchor at or near their TSS after 30 minutes of progesterone stimulation (PR HiChIP data) dropped from 114 to 79 (30% reduction). We thank the reviewer for suggesting this more straightforward approach and want to emphasize that our overall conclusions remain unaltered.

      As above in our response to R1.3, we want to emphasize that the model presented in figure 6 does not depict a generic or randomly chosen gene, but a gene that is specifically co-regulated by both GRHL2 and PR. We also want to emphasize that the majority of GRHL2's transcriptional activity is PR-independent. This is consistent with the limited fraction of GRHL2 that co-immunoprecipitated with PR (Figure 1D), and with the well-established roles of GRHL2 beyond steroid receptor signaling. In fact, the overall importance of GRHL2 for cell viability in T47D(S) cells is underscored by our inability to generate a full knockout (multiple failed attempts of CRISPR/Cas mediated GRHL2 deletion in T47D(S) and MCF7 cells), and by the strong selection we observed against high-level GRHL2 knockdown using shRNA.

      As for the reviewer's suggestion to assess whether GRHL2/PR co-bound regions correspond to active enhancers by integrating H3K27ac and ATAC-seq data: We have re-analyzed publicly available H3K27ac and ATAC-seq datasets from T47D cells (references 42 and 43). These analyses are now added to figure 2 (F and G). The H3K27Ac profile suggests that GRHL2-PR overlapping sites indeed correspond to more active enhancers (Figure 2F), with a proposed role for GRHL2 since siGRHL2 affects the accessibility of these sites (Figure 2G).

      Minor comments Page 19: The statement that "PR and GRHL2 trigger extensive chromatin reorganization" is not experimentally supported. ATAC-seq would be an appropriate method to test this directly.

      We agree with the reviewer and have removed this sentence, as it does not contribute meaningfully to the flow of the manuscript.

      Prior literature on GRHL2 as a steroid receptor co-regulator should be discussed more thoroughly.

      We now added additional literature on GRHL2 as a steroid hormone receptor co-regulator in the discussion (line 397-401) and we cite the papers suggested by R1 in R1.1 (references 25 and 54).

      Reviewer #1 (Significance (Required)):

      The identification of novel PR co-regulators is an important objective, as the mechanistic basis of PR signaling in breast cancer remains incompletely understood. The main strength of this study lies in highlighting GRHL2 as a factor influencing PR genomic binding and transcriptional regulation, thereby expanding the repertoire of regulators implicated in PR biology.

      That said, the novelty is limited, given the established roles of GRHL2 in ER and AR regulation. Mechanistic insight is underdeveloped, and the reliance on an engineered T47DS model with supra-physiological PR levels reduces the general impact. Without validation in physiologically relevant breast cancer models and clearer separation of direct versus indirect effects, the overall advance remains modest.

      The manuscript will be of interest to a specialized audience in the fields of nuclear receptor signaling, breast cancer genomics, and transcriptional regulation. Broader appeal, including translational or clinical relevance, is limited in its current form.

      We have addressed all of these points in our response above and agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      The authors present a study investigating the role of GRHL2 in hormone receptor signaling. Previous research has primarily focused on GRHL2 interaction with estrogen receptor (ER) and androgen receptor (AR). In breast cancer, GRHL2 has been extensively studied in relation to ER, while its potential involvement with the progesterone receptor (PR) remains largely unexplored. This is the rationale of this study to investigate the relation between PR and GRHL2. The authors demonstrate an interaction between GRHL2 and PR and further explore this relationship at the level of genomic binding sites. They also perform GRHL2 knockdown experiments to identify target genes and link these transcriptional changes back to GRHL2-PR chromatin occupancy. However, several conceptual and technical aspects of the study require clarification to fully support the authors' conclusions.

      R2.1 Given the high sequence similarity among GRHL family members, this raises questions about the specificity of the antibody used for GRHL2 RIME. The authors should address whether the antibody cross-reacts with GRHL1 or GRHL3. For example, GRHL1 shows a higher log fold change than GRHL2 in the RIME data.

      Indeed, GRHL1, GRHL2, and GRHL3 are structurally related. They share a similar domain organization and are all {plus minus}70kDa in size. Sequence similarity is primarily confined to the DNA-binding domain, with GRHL2 and GRHL3 showing 81% similarity in this region, and GRHL1 showing 63% similarity to GRHL2/3 (Ming, Nucleic Acids Res 2018).

      The antibody used, sourced from the Human Protein Atlas, is widely used in the field. It targets an epitope within the transactivation domain (TAD) of GRHL2-a region with relatively low sequence similarity to the corresponding domains in GRHL1 and GRHL3.

      We assessed the specificity of the antibody using western blotting (Supplementary Figure 2A) in T47DS wild-type and GRHL2 knockdown cells. As expected, GRHL2 protein levels were reduced in the knockdown cells providing convincing evidence that the antibody recognizes GRHL2. The remaining signal in shGRHL2 knockdown cells could either be due to remaining GRHL2 protein or due to the antibody detecting GRHL1/3. Furthermore, the observed high log-fold enrichment of GRHL1 in our RIME may reflect known heterodimer formation between GRHL1 and GRHL2, rather dan antibody cross-reactivity. As such, we cannot formally rule out cross-reactivity and have mentioned this in the limitations section (line 497-501).

      R2.2 In addition, in RIME experiments, one would typically expect the bait protein to be among the most highly enriched proteins compared to control samples. If this is not the case, it raises questions about the efficiency of the pulldown, antibody specificity, or potential technical issues. The authors should comment on the enrichment level of the bait protein in their data to reassure readers about the quality of the experiment.

      We agree with the reviewer that this information is crucial for assessing the quality of the experiment. We have therefore added the enrichment levels (log₂ fold change of IgG control over pulldown) to the methods section (line 592).

      As the reviewer notes, GRHL2 was not among the top enriched proteins in our dataset. This is due to unexpectedly high background binding of GRHL2 to the IgG control antibody/beads, for which we currently have no explanation. As a result, although we detected many unique GRHL2 peptides, observed high sequence coverage (>70%), and GRHL2 ranked among the highest in both iBAQ and LFQ values, its relative enrichment was reduced due to the elevated background. During our RIME optimization, Coomassie blue staining of input and IP samples revealed a band at the expected molecular weight of GRHL2 in the pull down samples that was absent in the IgG control (see figure 1 for the reviewer below, 4 right lanes), supporting the conclusion that GRHL2 is specifically enriched in our GRHL2 RIME samples. Combined with enrichment of some of the expected interacting proteins (e.g. KMT2C and KMT2D), we are convinced that the experiment of sufficient quality to support our conclusions.

      Figure 1 for reviewer: Coomassie blue staining of input and IP GRHL2 and IgG RIME samples. NT = non-treated, T = treated.

      R2.3 The authors report log2 fold changes calculated using iBAQ values for the bait versus IgG control pulldown. While iBAQ provides an estimate of protein abundance within samples, it is not specifically designed for quantitative comparison between samples without appropriate normalization. It would be helpful to clarify the normalization strategy applied and consider using LFQ intensities.

      We understand the reviewer's concern. Due to the high background observed in the IgG control sample (see R2.2), the LFQ-based normalization did not accurately reflect the enrichment of GRHL2, which was clearly supported by other parameters such as the number of unique peptides (see rebuttal Table 1). After discussions with our Mass Spectrometry facility, we decided to consider the iBAQ values-which reflect the absolute protein abundance within each sample-as a valid and informative measure of enrichment. In the context of elevated background levels, iBAQ provides an alternative and reliable metric for assessing protein enrichment and was therefore used for our interactor analysis.

      Unique peptides

      IBAQ GRHL2

      IBAQ IgG

      LFQ GRHL2

      LFQ IgG

      GRHL2

      52

      1753400.00

      155355.67

      5948666.67

      3085700.00

      GRHL1

      23

      56988.33

      199.03

      334373.33

      847.23

      *Table 1. Unique peptide, IBAQ and LFQ values of the GRHL2 and IgG pulldowns for GRHL2 and GRHL1 *

      R2.4 Other studies have reported PR RIME, which could be a valuable source to investigate whether GRHL proteins were detected.

      We thank the reviewer for pointing this out. We are aware of the PR RIME, generated by Mohammed et al., which we refer to in the discussion (lines 390-391). This study indeed identified GRHL2 as a PR-interacting protein in MCF7 and T47D cells. Although they do not mention this interaction in the text, the interaction is clearly indicated in one of the figures from their paper, which supports our findings. To our knowledge, no other PR RIME datasets in MCF7 or T47D cells have been published to date.

      R2.5 In line 137, the term "protein score" is mentioned. Could the authors please clarify what this means and how it was calculated.

      We agree that this point was not clearly explained in the original text. The scores presented reflect the MaxQuant protein identification confidence, specifically the sum of peptide-level scores (from Andromeda), which indicates the relative confidence in protein detection. We have now added this clarification to line 137 and to the legend of Figure 1.

      R2.6 In line 140-141. The fact that GRHL2 interacts with chromatin remodelers does not by itself prove that GRHL2 acts as a pioneer factor or chromatin modulator. Demonstrating pioneer function typically requires direct evidence of chromatin opening or binding to closed chromatin regions (e.g., ATAC-seq, nucleosome occupancy assays). I recommend revising this statement or providing supporting evidence.

      We agree that the fact that GRHL2 interacts with chromatin remodelers does not by itself prove that GRHL2 acts as a pioneer factor or chromatin modulator. However, a previous study (Jacobs et al, Nature genetics, 2018) has shown directly that the GRHL family members (including GRHL2) have pioneering function and regulate the accessibility of enhancers. We adapted line 140-141 to state this more clearly. In addition, our newly added data in Figure 2G also support the fact that GRHL2 has a role in regulating chromatin accessibility in T47D cells.

      R2.7 The pulldown Western blot lacks an IgG control in panel D.

      This is correct. As the co-IP in Figure 1D served as a validation of the RIME and was specifically aimed at determining the effect of hormone treatment on the observed PR/GRHL2 interaction, we did not perform this control given the scale of the experiment. However, during RIME optimization, we performed GRHL2 staining of the IgG controls by western blot, shown in figure 2 for the reviewer below. As stated above, some background GRHL2 signal was observed in the IgG samples, but a clear enrichment is seen in the GRHL2 IP.

      Taken together, we believe that the well-controlled RIME, combined with the co-IP presented, provides strong evidence that the observed signal reflects a genuine GRHL-PR interaction.

      Figure 2 for reviewer: WB of input and IP GRHL2 and IgG RIME samples stained for GRHL2. NT = non-treated, T = treated

      R2.8 Depending on the journal and target audience, it may be helpful to briefly explain what R5020 is at its first mention (line 146).

      Thank you. We have adapted this accordingly.

      R2.9 The authors state that three technical replicates were performed for each experimental condition. It would be helpful to clarify the expected level of overlap between biological replicates of RIME experiments. This clarification is necessary, especially given the focus on uniquely enriched proteins in untreated versus treated cells, and the observation that some identified proteins in specific conditions are not chromatin-associated. Replicates or validations would strengthen the findings.

      We use the term technical rather than biological replicates because for cell lines, defining true biological replicates is challenging, as most variability arises from experimental rather than biological differences. To introduce some variation, we split our T47DS cells into three parallel dishes 5 days prior to starting the treatment. We purposely did this, to minimize to minimize the likelihood that proteins identified as uniquely enriched are artifacts. Each of the three technical replicates comes from one of these three parallel splits (so equal passage numbers but propagated in parallel dishes for 5 days before the start of the experiment).

      To generate the three technical replicates for our RIME, we plated cells from the parallel grown splits. Treatments for the three replicates were performed per replicate. Samples were crosslinked, harvested and lysed for subsequent RIME analysis, the three replicates were processed in parallel, for technical and logistical reasons. To clarify the experimental setup, we have updated the methods section accordingly (lines 566-568).

      As for the detection of non-chromatin-associated proteins: We cannot rule out that these are artifacts, as they may arise from residual cytosolic lysate during nuclear extraction. Alternatively, they could reflect a more dynamic subcellular localization of these proteins than currently annotated or appreciated.

      R2.10 The volcano plot for the RIME experiment appears to show three distinct clusters of proteins on the right, which is unusual for this type of analysis. The presence of these apparent groupings may suggest an artifact from the data processing, such as imputation. Can the authors clarify the origin of these groupings? If it is due to imputation or missing values, I recommend applying a stricter threshold, such as requiring detection in all three replicates (3/3) to improve the robustness of the enrichment analysis and increase confidence in the identified interactors.

      We thank the reviewer for pointing this out. As suggested, we re-evaluated the imputation and applied a stricter threshold, requiring detection in all three replicates. Indeed, the separate clusters were due to missing values, therefore we now revised the imputation method by imputing values based on the normal distribution. Using this revised analysis, we identify 2352 GRHL2 interactors instead of 1140, but the number of interacting proteins annotated as transcription factors or chromatin-associated/modifying proteins was still 103. Figure 1B, 1E, and Supplementary Figure 4A have been updated accordingly. We also revised the methods section to reflect this change. We think this suggestion has improved our analysis of the data and we thank the reviewer for pointing this out.

      R2.11 The statement that "PR and GRHL2 frequently overlap" may be overstated given that only ~700 overlapping sites are reported (cut&run).

      We have replaced "frequently overlap" by "can overlap" (line 229-230).

      R2.12 The model in Figure 6 suggests limited chromatin occupancy of PR and GRHL2 in hormone-depleted conditions, consistent with the known requirement of ligand for stable PR-DNA binding. However, Figure 1 shows no major difference in GRHL2-PR interaction between untreated and hormone-treated cells. This raises questions about where and how this interaction occurs in the absence of hormone. Since PR binding to chromatin is typically minimal without ligand, can the authors clarify this given that RIME data reflect chromatin-bound interactions.

      Indeed, the model in figure 6 suggests limited chromatin occupancy of PR and GRHL2 under hormone-depleted conditions. It is, however, important to note that the locus shown represents a gene regulated by both PR and GRHL2 - and not just any gene. We recognize that this was not sufficiently clear in the original version, and we have now clarified this in both the main text (line 331-334) and the figure legend.

      We propose that PR and GRHL2 bind or become enriched at enhancer sites associated with their target genes upon ligand stimulation. This is consistent with the known requirement of ligand for stable PR-DNA binding and with our observation that PR/GRHL2 overlapping peaks are detected only in the ligand-treated condition of the CUT&RUN experiment. Given the broader role of GRHL2, it also binds chromatin independently of progesterone and the progesterone receptor, which is why we included-but did not focus on-GRHL2-only binding events in our model.

      We would also like to clarify that, although RIME includes a nuclear enrichment step that enriches for chromatin-associated proteins, the pulldown is performed on nuclear lysates. Therefore, it captures both chromatin-bound protein complexes and freely soluble nuclear complexes, which unfortunately cannot be distinguished. GRHL2 is well established as a nuclear protein (Zeng et al., Cancers 2024; Riethdorf et al., International Journal of Cancer 2015), and although PR is classically described as translocating to the nucleus upon hormone stimulation, several studies-including our own-have shown that PR is continuously present in the nucleus (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Frigo et al., Essays Biochem. 2021).

      We therefore propose that PR and GRHL2 may already interact in the nucleus without directly binding to chromatin. Given our observation that GRHL2 binding sites on the chromatin are redistributed upon R5020 mediated signaling activation, we hypothesize that such pre-formed PR-GRHL2 nuclear complexes may assist the rapid recruitment of GRHL2 to progesterone-responsive chromatin regions.

      We have expanded the discussion to include a dedicated section addressing this point (line 376-388).

      R2.13 It would be of interest to assess the overlap between the proteins identified in the RIME experiment and the motif analysis results.

      In the discussion section of our original manuscript, we highlighted some overlapping proteins in the RIME and motif analysis, including STAT6 and FOXA1. However, we had not yet systematically analyzed overlap in both analyses. To address this, we now compared all enriched motifs (so not only the top 5 as displayed in our figures) under GRHL2, PR, and GRHL2/PR shared sites from both the CUT&RUN and ChIP-seq datasets with the proteins identified as GRHL2 interactors in our RIME. Although we identified numerous GRHL2-associated proteins, relatively few of them were transcription factors whose binding motifs were also enriched under GRHL2 peaks.

      In our revised manuscript we have added a section in the discussion highlighting our systematic overlap of the results of our RIME experiment and the motif enrichment of the ChIP-seq and CUT&RUN analysis (line 415-436).

      R2.14 The authors chose CUT&RUN to assess chromatin binding of PR and GRHL2. Given that RIME is also based on chromatin immunoprecipitation - ChIP protocol, it would be helpful to clarify why CUT&RUN was selected over ChIP-seq for the DNA-binding assays. What is the overlap with published data?

      As also mentioned in our response to R1.3 and R1.5, we deliberately chose the CUT&RUN approach to minimize artifacts introduced by crosslinking and sonication, thereby reducing background and allowing the identification of high-confidence, direct DNA-binding events. Since GRHL2 physically interacts with PR, ChIP-seq could potentially capture indirect binding of GRHL2 at PR-bound sites (and vice versa). In contrast, CUT&RUN primarily detects direct DNA-protein interactions, providing a more specific and accurate binding profile. Additionally, CUT&RUN serves as an independent validation method for data obtained using ChIP-like protocols.

      Since CUT&RUN, similar to ChIP, can show limited reproducibility (Nordin et al., Nucleic Acids Research, 2024), and to our knowledge few PR CUT&RUN and no GRHL2 CUT&RUN datasets are currently available, it is challenging to directly compare our data with published datasets. Nevertheless, studies performing PR or ER CUT&RUN (Gillis et al., Cancer Research, 2024; Reese et al., Molecular and Cellular Biology, 2022) report a comparable number of peaks-in the same range of thousands-as observed in our data. This suggests that a single CUT&RUN experiment in general may detect fewer events than a single ChIP-seq experiment, but that the peaks that are found are likely to reflect direct binding events.

      Reviewer #2 (Significance (Required)):

      General Assessment: This study investigates the role of the transcription factor GRHL2 in modulating PR function, using RIME and CUT&RUN to explore protein-protein and protein-chromatin interactions. GRHL2 have been implicated in epithelial biology and transcriptional regulation and interaction with steroid hormone receptors has been reported. This study extends the field by showing a functional link between GRHL2 and PR, which has implications for understanding hormone-dependent gene regulation.

      The research will primarily interest a specialized audience in transcriptional regulation, chromatin biology, and hormone receptor signaling.

      Key words for this reviewer: chromatin biology, transcription factor function, epigenomics, and proteomics.

      We agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      This study explores the important transcriptional coordination role of Grainyhead-like 2 (GRHL2) on the transcriptional regulatory function of progesterone receptor (PR). In this paper, the authors start with their recruitment characteristics, take into account their regulatory effects on downstream genes and their effects on the occurrence and development of breast cancer, and further clarify the coordination between them in three-dimensional space. The interaction between GRHL2 and PR, and the subsequent important influence on the co-regulated genes by GRHL2 and PR are analyzed. The overall framework of this study is mainly by RNA seq and CUT-TAG analysis, the molecular mechanism underlying the association between GRHL2 and PR and regulation function of two proteins in breast cancer is not clearly clarified. Some details need to be further improved:

      Major comments: R3.1 For Fig.1D, the molecular weight of each protein should be marked in the diagram, and the expression of GRHL2 in the input group should be supplemented.

      We apologize for not including molecular weights in our initial submission. We are not entirely clear what the reviewer means with their statement that "the expression of GRHL2 in the input group should be supplemented". The blot depicted in Figure 1D shows both the input signal and the IP. For the reviewer's information, the full Western blot is depicted below.

      Figure 3 for reviewer: Full WBs of input and IP GRHL2 samples stained for GRHL2 or PR. NT = non-treated, T = treated

      R3.2 In Fig.2B and Fig 5C, it should be describe well whether GRHL2 recruitment is in the absence or presence of R5020? How about the co-occupancy of PR and GRHL2 region, Promoter or enhancer region? It would be better to show histone marks such as H3K27ac and H3K4me1 to annotate the enhancer region.

      As also stated in our response to R1.3, we acknowledge that the ChIP-seq experiments cannot definitively determine whether GRHL2 and PR co-occupy genomic regions under ligand-stimulated conditions, since the GRHL2 dataset was generated in the absence of progesterone stimulation (as indicated in lines 167-169). To clarify this, we have now specified this detail in the legend of figure 2 by noting "untreated GRHL2 ChIP." To directly assess GRHL2 chromatin binding under progesterone-stimulated conditions, we performed CUT&RUN experiments for both GRHL2 and PR under untreated and R5020-treated conditions. These experiments revealed a subset of overlapping PR and GRHL2 binding sites (approximately 5% of all identified PR peaks.

      In our original manuscript, we performed genomic annotation of the GRHL2, PR, and GRHL2/PR overlapping peaks (Figure 2E) and found that most of these sites were located in intergenic regions, where enhancers are typically found, with ~20% located in promoter regions. We appreciate the reviewer's suggestion to further overlap the ChIP-seq peaks with histone marks such as H3K27ac and H3K4me1. We have now incorporated publicly available ATAC-seq and H3K27ac ChIP datasets in our revised manuscript (as also suggested by Reviewer 1) and find that shared GRHL2/PR sites are indeed located in active enhancer regions marked by H3K27ac (see Figure 2F). Additionally, as expected, we find that GRHL2/PR overlapping sites are enriched at open chromatin (Figure 2G).

      R3.3 What is the biological function analysis by KEGG or GO analysis for the overlapping genes from VN plots of RNA-seq with CUT-TAG peaks. The genes co-regulated by GRH2L and PR are further determined.

      For us, it is not entirely clear what reviewer 3 is asking here, but we can explain the following: as it is challenging to integrate HiChIP with CUT&RUN, due to the fundamentally different nature of the two techniques, we chose not to directly assign genes to CUT&RUN peaks. However, we did carefully link the GRHL2, PR, and GRHL2/PR ChIP-seq peaks to their target genes by integrating chromatin looping data from a PR HiChIP analysis. The result from this analysis is depicted in Figure 4B.

      As suggested by this reviewer, we also performed a GO-term analysis on the 79 genes that are regulated by both GRHL2 and PR (we now have 79 genes after the re-analysis as suggested in R1.6). The corresponding results are provided for the reviewer in figure 3 of this rebuttal (below). As this additional analysis does not provide further biological insight beyond what is already presented in Figure 4C, we decided to not include this figure in the manuscript.

      Figure 4 for reviewer: GO-term analysis on the 79 GRHL2-PR co-regulated genes that are transcriptionally regulated by GRHL2 and PR and that also harbor a PR HiChIP loop anchor at or near their TSS

      R3.4 Western blotting should be performed to determine the protein levels of downstream genes co-regulated genes by GRH2L and PR in the absence or presence of R5020.

      We agree that determining the response of co-regulated is important. Therefore, in Figure 4D, we present three representative examples of genes that are directly co-regulated by GRHL2 and PR-specifically, genes that are differentially expressed after 4 hours of R5020 exposure. Although protein levels of the targets are of functional importance, GRHL2 and PR are of transcription factors whose immediate effects are primarily exerted at the level of gene transcription. Therefore, in our opinion, changes in mRNA abundance provide the most direct and mechanistically relevant readout of their regulatory activity.

      R3.5 The author mentioned that this study positions that GRHL2 acts as a crucial modulator of steroid hormone receptor function, while the authors do not provide the evidences that how does GRHL2 regulate PR-mediated transactivation, and how about these two proteins subcellular distribution in breast cancer cells.

      We agree that while our RNA-seq data demonstrate that GRHL2 modulates the expression of PR target genes, and our CUT&RUN experiments show that GRHL2 chromatin binding is reshaped upon R5020 exposure, we have not yet further dissected the molecular mechanism by which GRHL2 functions as a PR co-regulator.

      As also mentioned in our response to R1.2, we did consider several follow-up experiments to address this, including PR CUT&RUN in GRHL2 knockdown cells, CUT&RUN for known co-activators such as KMT2C/D and P300, as well as functional studies involving GRHL2 TAD and DBD mutants. However, due to technical and logistical challenges, we were unable to carry out these experiments within the timeframe of this study.

      That said, we fully recognize that such approaches would provide deeper mechanistic insight into the interplay between PR and GRHL2. We have therefore explicitly acknowledged this limitation in our limitations of the study section (lines 502-507) and consider it an important avenue for future investigation.

      Regarding the subcellular distribution in breast cancer cells: As also mentioned in our response to R2.12, GRHL2 is well established as a nuclear protein (Zeng et al., Cancers 2024; Riethdorf et al., International Journal of Cancer 2015), and although PR is classically described as translocating to the nucleus upon hormone stimulation, several studies-including our own-have shown that PR is continuously present in the nucleus (Aarts et al., J Mammary Gland Biol Neoplasia 2023; Frigo et al., Essays Biochem. 2021). Thus, both proteins mostly reside in the nucleus in breast (cancer) cells both in the absence and presence of hormone stimulation, but dynamic subcellular shuttling is likely to occur.

      Minor comments: Please describe in more detail the relationship between PR and GRHL2 binding independent of the hormone in the discussion section.

      As also mentioned in our response to R2.12, we have expanded the discussion to include a dedicated section addressing this point (lines 376-388).

      Reviewer #3 (Significance (Required)):

      Advance: Compare the study to existing published knowledge, it fills a gap. The authors provide RNA seq and CUT-TAG sequence analysis to show the recruitment of GRHL2 and PR and the co-regulated genes in the absence or presence of progesterone.

      Audience: breast surgery will be interested, and the audiences will cover clinical and basic research.

      My expertise is focused on the epigenetic modulation of steroid hormone receptors in the related cancers, such as breast cancer, prostate cancer, and endometrial carcinoma.

      We agree that with our implemented changes, this study should reach (and appeal to) an audience interested in transcriptional regulation, chromatin biology, hormone receptor signaling and breast cancer.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review): 

      The medicinal leech preparation is an amenable system in which to understand how the underlying cellular networks for locomotion function. A previously identified non-spiking neuron (NS) was studied and found to alter the mean firing frequency of a crawl-related motoneuron (DE-3), which fires during the contraction phase of crawling. The data are mostly solid. Identifying upstream neurons responsible for crawl motor patterning is essential for understanding how rhythmic behavior is controlled.

      Review of Revision: 

      On a positive note, the rationale for the study is clearer to me now after reading the authors' responses to both reviewers, but that information, as described in the authors' responses, is minimally incorporated into the current revised paper. Incorporating a discussion of previous work on the NS cell has, indeed, improved the paper. 

      I suggested earlier that the paper be edited for clarity but not much text has been changed since the first draft. I will provide an example of the types of sentences that are confusing. The title of the paper is: "Phase-specific premotor inhibition modulates leech rhythmic motor output". Are the authors referring to the inhibition created by premotor neurons (e.g., on to the motoneurons) or the inhibition that the premotor neurons receive? 

      In this case, this is an interesting ambiguity: NS is inhibited and that inhibition is directly transmitted to the motoneurons because both cells are electrically coupled.  We believe that the title does not disguise the findings conveyed by the manuscript.

      I also find the paper still confusing with regard to the suggested "functional homology" with the vertebrate Renshaw cells. When the authors set up this expectation of homology (should be analogy) in the introduction and other sections of the paper, one would assume that the NS cell would be directly receiving excitation from a motoneuron (like DE-3) and, in turn, the motoneuron would then receive some sort of inhibitory input to regulate its firing frequency. Essentially, I have always viewed the Renshaw cells as nature's clever way to monitor the ongoing activity of a motoneuron while also providing recurrent feedback or "recurrent inhibition" to modify that cell's excitatory state. The authors present their initial idea below on line 62. Authors write: "These neurons are present as bilateral pairs in each segmental ganglion and are functional homologs of the mammalian Renshaw cells (Szczupak, 2014). These spinal cord cells receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to the motoneurons (Alvarez and Fyffe, 2007)." 

      We agree with Reviewer #2: the correct term is "analogous," not "homologous." Thanks for pointing this out. We changed the term throughout the text.

      The Reviewer is also right in the appreciation of the role of Renshaw cells. NS plays exactly the role that the Reviewer expresses. The ONLY difference is that NS is inhibited by the motoneurons, and in turn transmits this inhibition to the motoneurons via the rectifying electrical junctions. Attending the confusion that our description caused in the Reviewer, we have modified the cited sentence accordingly now in lines 65-67.

      Minor note:

      I suggest re-writing this last sentence as "these" is confusing. Change to: 'In the spinal cord, Renshaw interneurons receive excitatory inputs from motoneurons and, in turn, transmit inhibitory signals to them (Alvarez and Fyffe, 2007).'] 

      Please, see the changes mentioned above.

      Furthermore, the authors note that (line 69 on): "In the context of this circuit the activity of excitatory motoneurons evokes chemically mediated inhibitory synaptic potentials in NS. Additionally, the NS neurons are electrically coupled......In physiological conditions this coupling favors the transmission of inhibitory signals from NS to motoneurons." Based on what is being conveyed here, I see a disconnect with the "functional homology" being presented earlier. I may be missing something, but the Renshaw analogy seems to be quite different compared to what looks like reciprocal inhibition in the leech. If the authors want to make the analogy to Renshaw cells clearer, then they should make a simple ball and stick diagram of the leech system and visually compare it to the Renshaw/motoneuron circuit with regard to functionality. This simple addition would help many readers. 

      We have simplified the description regarding the Renshaw cell (lines 65-67) to avoid the “details” of the connectivity between the two circuits.

      This report focuses on NS neurons and their role in crawling; we mention the analogy with Renshaw cells to widen the interest of the results. We do not think that making a special diagram to compare how the two neurons play a similar role via different connections among the players is useful in the context of this manuscript.

      The Abstract, Authors write (line 19), "Specifically, we analyzed how electrophysiological manipulation of a premotor nonspiking (NS) neuron, that forms a recurrent inhibitory circuit (homologous to vertebrate Renshaw cells)...."

      First, a circuit would not be homologous to a cell, and the term homology implies a strict developmental/evolutionary commonality. At best, I would use the term functionally analogous but even then I am still not sure that they are functionally that similar (see comments above). 

      Reviewer #2 is right. We changed the sentence in line 20.

      Line 22: "The study included a quantitative analysis of motor units active throughout the fictive crawling cycle that shows that the rhythmic motor output in isolated ganglia mirrors the phase relationships observed in vivo." This sentence must be revised to indicate that not all of the extracellular units were demonstrated to be motor units. Revise to: "The study included a quantitative analysis of identified and putative motor units active throughout the fictive crawling cycle that shows.....' 

      Line 187 regarding identifying units as motoneurons: Authors write, "While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of motor units activated throughout the crawling cycle in this type of recordings." The authors cannot assume that the units in the recorded nerves belong only to motoneurons. Based on their first rebuttal, the authors seem to be reluctant to accept the idea that the extracellularly recorded units might represent a different class of neurons. They admit that some sensory neurons (with somata located centrally) do, indeed, travel out the same nerves recorded, but go on to explain why they would not be active. 

      The leech has a variety of sensory organs that are located in the periphery, and some of these sensory neurons do show rhythmic activity correlated with locomotor activity (see Blackshaw's early work). The numerous stretch receptors, in fact, have very large axons that pass through all the nerves recorded in the current paper. 

      In Fig. 4, it is interesting that the waveforms of all the units recorded in the PP nerve exhibit a reversal in waveform as compared to those in the DP nerve, which might indicate (based on bipolar differential recording) that the units in the PP nerve are being propagated in the opposite direction (i.e., are perhaps afferent). Rhythmic presynaptic inhibition and excitation is commonly seen for stretch receptors within the CNS (see the work of Burrows) and many such cells are under modulatory control. 

      Most likely, the majority of the units are from motoneurons, but we do not really know at this point. The authors should reframe their statements throughout the paper as: 'While multiple extracellular recordings have been performed previously (Eisenhart et al., 2000), these results (Figure 4) present the first quantitative analysis of multiple extracellular units, using spike sorting methods, which are activated throughout the crawling cycle.' In cases where the identity of the unit is known, then it is fine to state that, but when the identity of the unit is not known, then there should be some qualification and stated as 'putative motor units' 

      We understand the concern of Reviewer #2 regarding the type of neurons active during dopamine-induced crawling in isolated ganglia. However, we believe there is sufficient evidence to support that the recorded spikes originate from motoneurons. As readers may share the same concern, we have added a paragraph explaining why spikes from somatic sensory neurons such as P or T cells, or from stretch receptors, are unlikely to contribute (lines 206-214). We included the term putative in the abstract.

      The Methods section:

      Needs to include the full parameters that were used to assess whether bursting activity was qualified in ways to be considered crawling activity or not. Typically, crawl-like burst periods of no more than 25 seconds have been the limit for their qualification as crawling activity. In Fig 2F, for example, the inter-burst period is over 35 seconds; that coupled with an average 5 second burst duration would bring the burst period to 40 seconds, which is substantially out of range for there to be bursting relevant to crawl activity. Simply put, long DE-3 burst periods are often observed but may not be indicative of a crawl state as the CV motoneurons are no longer out of phase with DE-3. A number of papers have adopted this criterion. 

      We now indicate in the methods the range of period values measured in our experiments.  For the reviewer informatio we show here histograms depicting the variability of period and duty cycle values recorded in our experiments (control conditions). The Reviewer can see that the bursting activity of DE-3 fall within what has been published.

      Author response image 1.

      Crawling in isolated ganglia. A. Histogram of periods end-to-end during crawling in isolated ganglia. The dotted line indicates the mean obtained from the averages of all experiments. The solid black line represents the mean of all cycles across all experiments. B. As in A, for the duty cycle calculated using end-to-end periods.  (n = 210 cycles from 45 ganglia obtained from 32 leeches in all cases).

      Reviewer #1 (Recommendations for the authors): 

      Minor comments-

      Line 100: "In the frame of the recurrent inhibitory circuit, NS is the target of inhibitory signals". Suggestion: 'Within the framework of the recurrent inhibitory circuit, NS is the target of inhibitory signals.' 

      Changed as suggested (line 107).

      Line 163: "This series of experiments proves that, as predicted based on the known circuit (Figure 164 1C), inhibitory signals onto NS premotor neurons were transmitted to DE-3 motoneurons and counteracted their excitatory drive during crawling, limiting their firing frequency". I think this sentence is too strong plus needs some editing. Suggestion: 'As predicted based on the known circuit (Figure 164 1C), this series of experiments indicates that inhibitory signals onto NS premotor neurons are transmitted to DE-3 motoneurons, thus limiting their firing frequency and counteracting their excitatory drive during crawling."

      Changed as suggested.

      Lines 86, 292 and 304 and Fig 4 legend: "Different from DE-3, In-Phase units showed a marked decrease in the maximum bFF along time." Suggestion: Replace the word "along" with 'across' time. Also replace those words in the Fig 4 legend and Line 80...."along" (replace with 'across') the different stages of crawling. 

      Changed as suggested.

      Line 311: "bursts and a concurrent inhibitory input via NS (Figure 7). Coherent with this interpretation, the activity level of the Anti- Phase units was not influenced by these inhibitory signals". Suggestion: Replace the word "coherent" with 'consistent'. 

      Changed as suggested.

      Line 332: "...offer the particular advantage of allowing electrical manipulation of individual neurons in wildtype adults," I am unsure what the authors are attempting to convey. Not sure what they mean by "wildtype" in this context and why that would matter. 

      “wildtype” was eliminated

      We thank Reviewer #2 for the suggested edits to the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      Summary:

      This study advances the lab's growing body of evidence exploring higher-order learning and its neural mechanisms. They recently found that NMDA receptor activity in the perirhinal cortex was necessary for integrating stimulus-stimulus associations with stimulus-shock associations (mediated learning) to produce preconditioned fear, but it was not necessary for forming stimulus-shock associations. On the other hand, basolateral amygdala NMDA receptor activity is required for forming stimulus-shock memories. Based on these facts, the authors assessed: (1) why the perirhinal cortex is necessary for mediated learning but not direct fear learning, and (2) the determinants of perirhinal cortex versus basolateral amygdala necessity for forming direct versus indirect fear memories. The authors used standard sensory preconditioning and variants designed to manipulate the novelty and temporal relationship between stimuli and shock and, therefore, the attentional state under which associative information might be processed. Under experimental conditions where information would presumably be processed primarily in the periphery of attention (temporal distance between stimulus/shock or stimulus pre-exposure), perirhinal cortex NMDA receptor activation was required for learning indirect associations. On the other hand, when information would likely be processed in focal attention (novel stimulus contiguous with shock), basolateral amygdala NMDA activity was required for learning direct associations. Together, the findings indicate that the perirhinal cortex and basolateral amygdala subserve peripheral and focal attention, respectively. The authors provide support for their conclusions using careful, hypothesis-driven experimental design, rigorous methods, and integrating their findings with the relevant literature on learning theory, information processing, and neurobiology. Therefore, this work will be highly interesting to several fields.

      Strengths:

      (1) The experiments were carefully constructed and designed to test hypotheses that were rooted in the lab's previous work, in addition to established learning theory and information processing background literature.

      (2) There are clear predictions and alternative outcomes. The provided table does an excellent job of condensing and enhancing the readability of a large amount of data.

      (3) In a broad sense, attention states are a component of nearly every behavioral experiment. Therefore, identifying their engagement by dissociable brain areas and under different learning conditions is an important area of research.

      (4) The authors clearly note where they replicated their own findings, report full statistical measures, effect sizes, and confidence intervals, indicating the level of scientific rigor.

      (5) The findings raise questions for future experiments that will further test the authors' hypotheses; this is well discussed.

      Weaknesses:

      As a reader, it is difficult to interpret how first-order fear could be impaired while preconditioned fear is intact; it requires a bit of "reading between the lines".

      We appreciate the Reviewer’s point and have attempted to address on lines 55-63 of the revised paper: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      Reviewer #2 (Public review):

      Summary:

      This paper continues the authors' research on the roles of the basolateral amygdala (BLA) and the perirhinal cortex (PRh) in sensory preconditioning (SPC) and second-order conditioning (SOC). In this manuscript, the authors explore how prior exposure to stimuli may influence which regions are necessary for conditioning to the second-order cue (S2). The authors perform a series of experiments which first confirm prior results shown by the author - that NMDA receptors in the PRh are necessary in SPC during conditioning of the first-order cue (S1) with shock to allow for freezing to S2 at test; and that NMDA receptors in the BLA are necessary for S1 conditioning during the S1-shock pairings. The authors then set out to test the hypothesis that the PRh encodes associations in a peripheral state of attention, whereas the BLA encodes associations in a focal state of attention, similar to the A1 and A2 states in Wagner's theory of SOP. To do this, they show that BLA is necessary for conditioning to S2 when the S2 is first exposed during a serial compound procedure - S2-S1-shock. To determine whether pre-exposure of S2 will shift S2 to a peripheral focal state, the authors run a design in which S2-S1 presentations are given prior to the serial compound phase. The authors show that this restores NMDA receptor activity within the PRh as necessary for the fear response to S2 at test. They then test whether the presence of S1 during the serial compound conditioning allows the PRh to support the fear responses to S2 by introducing a delay conditioning paradigm in which S1 is no longer present. The authors find that PRh is no longer required and suggest that this is due to S2 remaining in the primary focal state.

      Strengths:

      As with their earlier work, the authors have performed a rigorous series of experiments to better understand the roles of the BLA and PRh in the learning of first- and second-order stimuli. The experiments are well-designed and clearly presented, and the results show definitive differences in functionality between the PRh and BLA. The first experiment confirms earlier findings from the lab (and others), and the authors then build on their previous work to more deeply reveal how these regions differ in how they encode associations between stimuli. The authors have done a commendable job of pursuing these questions.

      Table 1 is an excellent way to highlight the results and provide the reader with a quick look-up table of the findings.

      Weaknesses:

      The authors have attempted to resolve the question of the roles of the PRh and BLA in SPC and SOC, which the authors have explored in previous papers. Laudably, the authors have produced substantial results indicating how these two regions function in the learning of first- and second-order cues, providing an opportunity to narrow in on possible theories for their functionality. Yet the authors have framed this experiment in terms of an attentional framework and have argued that the results support this particular framework and hypothesis - that the PRh encodes peripheral and the BLA encodes focal states of learning. This certainly seems like a viable and exciting hypothesis, yet I don't see why the results have been completely framed and interpreted this way. It seems to me that there are still some alternative interpretations that are plausible and should be included in the paper.

      We appreciate the Reviewer’s point and have attempted to address it on lines 566-594 of the Discussion: “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      Reviewer #3 (Public review):

      Summary:

      This manuscript presents a series of experiments that further investigate the roles of the BLA and PRH in sensory preconditioning, with a particular focus on understanding their differential involvement in the association of S1 and S2 with shock.

      Strengths:

      The motivation for the study is clearly articulated, and the experimental designs are thoughtfully constructed. I especially appreciate the inclusion of Table 1, which makes the designs easy to follow. The results are clearly presented, and the statistical analyses are rigorous. My comments below mainly concern areas where the writing could be improved to help readers more easily grasp the logic behind the experiments.

      Weaknesses:

      (1) Lines 56-58: The two previous findings should be more clearly summarized. Specifically, it's unclear whether the "mediated S2-shock" association occurred during Stage 2 or Stage 3. I assume the authors mean Stage 2, but Stage 2 alone would not yet involve "fear of S2," making this expression a bit confusing.

      We apologise for the confusion and have revised the summary of our previous findings on lines 55-63. The revised text now states: “In a recent pair of studies, we extended these findings in two ways. First, we showed that S1 does not just form an association with shock in stage 2; it also mediates an association between S2 and the shock. Thus, S2 enters testing in stage 3 already conditioned, able to elicit fear responses (Wong et al., 2019). Second, we showed that this mediated S2-shock association requires NMDAR-activation in the PRh, as well as communication between the PRh and BLA (Wong et al., 2025). These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (2) Line 61: The phrase "Pavlovian fear conditioning" is ambiguous in this context. I assume it refers to S1-shock or S2-shock conditioning. If so, it would be clearer to state this explicitly.

      Apologies for the ambiguity - we have omitted the term “Pavlovian” which may have been the source of confusion: The revised text on lines 60-63 now states: “These findings raise two critical questions: 1) why is the PRh engaged for mediated conditioning of S2 but not for direct conditioning of S1; and 2) more generally, what determines whether the BLA and/or PRh is engaged for conditioning of the S1 and/or S2?”

      (3) Regarding the distinction between having or not having Stage 1 S2-S1 pairings, is "novel vs. familiar" the most accurate way to frame this? This terminology could be misleading, especially since one might wonder why S2 couldn't just be presented alone on Stage 1 if novelty is the critical factor. Would "outcome relevance" or "predictability" be more appropriate descriptors? If the authors choose to retain the "novel vs. familiar" framing, I suggest providing a clear explanation of this rationale before introducing the predictions around Line 118.

      We have incorporated the suggestion regarding “predictability” while also retaining “novelty” as follows. 

      L76-85: “For example, different types of arrangements may influence the substrates of conditioning to S2 by influencing its novelty and/or its predictive value at the time of the shock, on the supposition that familiar stimuli are processed in the periphery of attention and, thereby, the PRh (Bogacz & Brown, 2003; Brown & Banks, 2015; Brown & Bashir, 2002; Martin et al., 2013; McClelland et al., 2014; Morillas et al., 2017; Murray & Wise, 2012; Robinson et al., 2010; Suzuki & Naya, 2014; Voss et al., 2009; Yang et al., 2023) whereas novel stimuli are processed in the focus of attention and, thereby, the amygdala (Holmes et al., 2018; Qureshi et al., 2023; Roozendaal et al., 2006; Rutishauser et al., 2006; Schomaker & Meeter, 2015; Wright et al., 2003).”

      L116-120: “Subsequent experiments then used variations of this protocol to examine whether the engagement of NMDAR in the PRh or BLA for Pavlovian fear conditioning is influenced by the novelty/predictive value of the stimuli at the time of the shock (second implication of theory) as well as their distance or separation from the shock (third implication of theory; Table 1).”

      (4) Line 121: This statement should refer to S1, not S2.

      (5) Line 124: This one should refer to S2, not S1.

      We have checked the text on these lines for errors and confirmed that the statements are correct. The lines encompassing this text (L121-130) are reproduced here for convenience:

      (1) When rats are exposed to novel S2-S1-shock sequences, conditioning of S2 and S1 will be disrupted by a DAP5 infusion into the BLA but not into the PRh (Experiments 2A and 2B);

      (2) When rats are exposed to S2-S1 pairings and then to S2-S1-shock sequences, conditioning of S2 will be disrupted by a DAP5 infusion into the PRh but not the BLA whereas conditioning of S1 will be disrupted by a DAP5 infusion into the BLA not the PRh (Experiments 3A and 3B);

      (3) When rats are exposed to S2-S1 pairings and then to S2 (trace)-shock pairings, conditioning of S2 will be disrupted by a DAP5 into the BLA not the PRh (Experiments 4A and 4B).

      (6) Additionally, the rationale for Experiment 4 is not introduced before the Results section. While it is understandable that Experiment 4 functions as a follow-up to Experiment 3, it would be helpful to briefly explain the reasoning behind its inclusion.

      Experiment 4 follows from the results obtained in Experiment 3; and, as noted, the reasoning for its inclusion is provided locally in its introduction. We attempted to flag this experiment earlier in the general introduction to the paper; but this came at the cost of clarity to the overall story. As such, our revised paper retains the local introduction to this experiment. It is reproduced here for convenience:

      “In Experiments 3A and 3B, conditioning of the pre-exposed S1 required NMDAR-activation in the BLA and not the PRh; whereas conditioning of the pre-exposed S2 required NMDAR-activation in the PRh and not the BLA. We attributed these findings to the fact that the pre-exposed S2 was separated from the shock by S1 during conditioning of the S2-S1-shock sequences in stage 2: hence, at the time of the shock, S2 was no longer processed in the focal state of attention supported by the BLA; instead, it was processed in the peripheral state of attention supported by the PRh.

      “Experiments 4A and 4B employed a modification of the protocol used in Experiments 3A and 3B to examine whether a pre-exposed S1 influences the processing of a pre-exposed S2 across conditioning with S2-S1-shock sequences. The design of these experiments is shown in Figure 4A. Briefly, in each experiment, two groups of rats were exposed to a session of S2-S1 pairings in stage 1 and, 24 hours later, a session of S2-[trace]-shock pairings in stage 2, where the duration of the trace interval was equivalent to that of S1 in the preceding experiments. Immediately prior to the trace conditioning session in stage 2, one group in each experiment received an infusion of DAP5 or vehicle only into either the PRh (Experiment 4A) or BLA (Experiment 4B). Finally, all rats were tested with presentations of the S2 alone in stage 3. If the substrates of conditioning to S2 are determined only by the amount of time between presentations of this stimulus and foot shock in stage 2, the results obtained in Experiments 4A and 4B should be the same as those obtained in Experiments 3A and 3B: acquisition of freezing to S2 will require activation of NMDARs in the PRh and not the BLA. If, however, the presence of S1 in the preceding experiments (Experiments 3A and 3B) accelerated the rate at which processing of S2 transitioned from the focus of attention to its periphery, the results obtained in Experiments 4A and 4B will differ from those obtained in Experiments 3A and 3B. That is, in contrast to the preceding experiments where acquisition of freezing to S2 required NMDAR-activation in the PRh and not the BLA, here acquisition of freezing to S2 should require NMDAR-activation in the BLA but not the PRh.”

      Reviewer #1 (Recommendations for the authors):

      I greatly enjoyed reading and reviewing this manuscript, and so I only have boilerplate recommendations.

      (1) I might add a couple of sentences discussing how/why preconditioned fear could be intact while first-order fear is impaired. Of course, if I am interpreting the provided interpretation correctly, the reason is that peripheral processing is still intact even when BLA NMDA receptors are blocked, and so mediated conditioning still occurs. Does this mean that mediated conditioning does not require learning the first-order relationship, and that they occur in parallel? Perhaps I just missed this, but I cannot help but wonder whether/how the psychological processes at play might change when first-order learning is impaired, so this would be greatly appreciated.

      As noted above, we have revised the general introduction (around lines 55-59) to clarify that the direct S1-shock and mediated S2-shock associations form in parallel. Hence, manipulations that disrupt first-order fear to the S1 (such as a BLA infusion of the NMDA receptor antagonist, DAP5) do not automatically disrupt the expression of sensory preconditioned fear to the S2.

      (2) Adding to the above - does the SOP or another theory predict serial vs parallel information flow from focal state to peripheral, or perhaps it is both to some extent?

      SOP predicts both serial and parallel processing of information in its focal and peripheral states. That is, some proportion of the elements that comprise a stimulus may decay from the focal state of attention to the periphery (serial processing); hence, at any given moment, the elements that comprise a stimulus can be represented in both focal and peripheral states (parallel processing).

      Given the nature of the designs and tools used in the present study (between-subject assessment of a DAP5 effect in the BLA or PRh), we selected parameters that would maximize the processing of the S2 and S1 stimuli in one or the other state of activation; hence the results of the present study. We are currently examining the joint processing of stimulus elements across focal and peripheral states using simultaneous recordings of activity in the BLA and PRh. These recordings are collected from rats trained in the different stages of a within-subject sensory preconditioning protocol. The present study created the basis for this work, which will be published separately in due course.

      (3) The organization of PRh vs BLA is nice and consistent across each figure, but I would suggest adding any kind of additional demarcation beyond the colors and text, maybe just more space between AB / CD. The figure text indicating PRh/BLA is a bit small.

      Thank you for the suggestion – we have added more space between the top and bottom panels of the figure.

      (4) Line 496 typo ..."in the BLA but not the BLA".

      Apologies for the type - this has been corrected.

      Reviewer #2 (Recommendations for the authors):

      I found the experiments to be extremely well-designed and the results convincing and exciting. The hypothesis of the focal and peripheral states of attention being encoded by BLA and PRh respectively, is enticing, yet as indicated in the public review, this does not seem to be the only possible interpretation. This is my only serious comment for the authors.

      (1) I think it would be worth reframing the article slightly to give credence to alternative hypotheses. Not to say that the authors' intriguing hypothesis shouldn't be an integral part of the introduction, but no alternatives are mentioned. In experiment 2, could the fact that S2 is already being a predictor of S1, not block new learning to S2? In the framework of stimulus-stimulus associations, there would be no surprise in the serial-compound stage of conditioning at the onset of S1. This may prevent direct learning of the S2-shock association within the BLA. This type of association may as well (S2 predicts S1, but it's omitted), which could support learning by S2. fall under the peripheral/focal theory, but I don't think it's necessary to frame this possibility in terms of a peripheral/focal theory. To build on this alternative interpretation, the absence of S1 in experiment 4 may induce a prediction error. The peripheral and focal states appear to correspond to A2 and A1 in SOP extremely well, and I think it would potentially add interest and support. If the authors do intend to make the paper a strong argument for their hypothesis, perhaps a few additional experiments may be introduced. If the novelty of S2 is critical for S2 not to be processed in a focal state during the serial compound stage, could pre-exposure of S2 alone allow for dependence of S2-shock on the PRh? Assuming this is what the authors would predict, this might disentangle the S-S theory mentioned above from the peripheral/focal theory. Or perhaps run an experiment S2-X in stage 1 and S2-S1-shock in stage 2? This said, I think the experiments are more than sufficient for an exciting paper as is, and I don't think running additional experiments is necessary. I would only argue for this if the authors make a hard claim about the peripheral/focal theory, as is the case for the way the paper is currently written.

      We appreciate the reviewer’s excellent point and suggestions. We have included an additional paragraph in the Discussion on page 24 (lines 566-594).  “An additional point to consider in relation to Experiments 3A, 3B, 4A and 4B is the level of surprise that rats experienced following presentations of the familiar S2 in stage 2. Specifically, in Experiments 3A and 3B, S2 was followed by the expected S1 (low surprise) and its conditioning required activation of NMDA receptors in the PRh and not the BLA. By contrast, in Experiments 4A and 4B, S2 was followed by omission of the expected S1 (high surprise) and its conditioning required activation of NMDA receptors in the BLA and not the PRh. This raises the possibility that surprise, or prediction error, also influences the way that S2 is processed in focal and peripheral states of attention. When prediction error is low, S2 is processed in the peripheral state of attention: hence, learning under these circumstances requires NMDA receptor activation in the PRh and not the BLA. By contrast, when prediction error is high, S2 is preserved in the focal state of attention: hence, learning under these circumstances requires NMDA receptor activation in the BLA and not the PRh. The impact of prediction error on the processing of S2 could be assessed using two types of designs. In the first design, rats are pre-exposed to S2-S1 pairings in stage 1 and this is followed by S2-S3-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is followed by surprise in omission of S1 and presentation of S3. Thus, if a large prediction error maintains processing of the familiar S2 in the BLA, we might expect that its conditioning in this design would require NMDA receptor activation in the BLA (in contrast to the results of Experiment 3B) and no longer require NMDA receptor activation in the PRh (in contrast to the results of Experiment 3A). In the second design, rats are pre-exposed to S2 alone in stage 1 and this is followed by S2-[trace]-shock pairings in stage 2. The important feature of this design is that, in stage 2, the S2 is not followed by the surprising omission of any stimulus. Thus, if a small prediction error shifts processing of the familiar S2 to the PRh, we might expect that its conditioning in this design would no longer require NMDA receptor activation in the BLA (in contrast to the results of Experiment 4B) but, instead, require NMDA receptor activation in the PRh (in contrast to the results of Experiment 4A). Future studies will use both designs to determine whether prediction error influences the processing of S2 in the focus versus periphery of attention and, thereby, whether learning about this stimulus requires NMDA receptor activation in the BLA or PRh.”

      (3) I was surprised the authors didn't frame their hypothesis more in terms of Wagner's SOP model. It was minimally mentioned in the introduction or the authors' theory if it were included more in the introduction. I was wondering whether the authors may have avoided this framing to avoid an expectation for modeling SOP in their design. If this were the case, I think the paper stands on its own without modeling, and at least for myself, a comparison to SOP would not require modeling of SOP. If this was the authors' concern for avoiding it, I would suggest to the authors that they need not be concerned about it.

      We appreciate the endorsement of Wagner’s SOP theory as a nice way of framing our results. We are currently working on a paper in which we use simulations to show how Wagner’s theory can accommodate the present findings as well as others in the literature on sensory preconditioning. For this reason, we have not changed the current paper in relation to this point.

    1. Author response:

      Reviewer #1 (Public review)

      I have to preface my evaluation with a disclosure that I lack the mathematical expertise to fully assess what seems to be the authors' main theoretical contribution. I am providing this assessment to the best of my ability, but I cannot substitute for a reviewer with more advanced mathematical/physical training.

      Summary:

      This paper describes a new theoretical framework for measuring parsimony preferences in human judgments. The authors derive four metrics that they associate with parsimony (dimensionality, boundary, volume, and robustness) and measure whether human adults are sensitive to these metrics. In two tasks, adults had to choose one of two flower beds which a statistical sample was generated from, with or without explicit instruction to choose the flower bed perceptually closest to the sample. The authors conduct extensive statistical analyses showing that humans are sensitive to most of the derived quantities, even when the instructions encouraged participants to choose only based on perceptual distance. The authors complement their study with a computational neural network model that learns to make judgments about the same stimuli with feedback. They show that the computational model is sensitive to the tasks communicated by feedback and only uses the parsimony-associated metrics when feedback trains it to do so.

      Strengths:

      (1)  The paper derives and applies new mathematical quantities associated with parsimony. The mathematical rigor is very impressive and is much more extensive than in most other work in the field, where studies often adopt only one metric (such as the number of causes or parameters). These formal metrics can be very useful for the field.

      (2)  The studies are preregistered, and the statistical analyses are strong.

      (3)  The computational model complements the behavioral findings, showing that the derived quantities are not simply equivalent to maximum-likelihood inference in the task.

      (4)  The speculations in the discussion section (e.g., the idea that human sensitivity is driven by the computational demands each metric requires) are intriguing and could usefully guide future work.

      Weaknesses:

      (1) The paper is very hard to understand. Many of the key details of the derived metrics are in the appendix, with very little accessible explanation in the main text. The figures helped me understand the metrics somewhat, although I am still not sure how some of them (such as boundary or robustness as measured here) are linked to parsimony. I understand that this is addressed by the derivations in the appendix, but as a computational cognitive scientist, I would have benefited from more accessible explanations. Important aspects of the human studies are also missing from the main text, such as the sample size for Experiment 2.

      (2) It is not fully clear whether the sensitivity of human participants to some of the quantities convincingly reported here actually means that participants preferred shapes according to the corresponding aspect of parsimony. The title and framing suggest that parsimony "guides" human decision-making, which may lead readers to conclude that humans prefer more parsimonious shapes. I am not sure the sensitivity findings alone support this framing, but it might just be my misunderstanding of the analyses.

      (3) The stimulus set included only four combinations of shapes, each designed to diagnostically target one of the theoretical quantities. It is unclear whether the results are robust or specific to these particular 4 stimuli.

      (4) The study is framed as measuring "decision-making," but the task resembles statistical inference (e.g., which shape generated the data) or perceptual judgment. This is a minor point since "decision-making" is not well defined in the literature, yet the current framing in the title gave me the initial impression that humans would be making preference choices and learning about them over time with feedback.

      We are grateful for the supportive comments highlighting the rigor of our experimental design and data analysis. The Reviewer lists four points under “weaknesses”, to which we reply below. 

      (1)  The paper is very hard to understand

      In the revised version of the paper, we will expand the main text to include a more detailed and intuitive description of the terms of the Fisher Information Approximation, in particular clarifying the interpretation of robustness and boundary as parsimony. We also will include more details that are now given only in Methods, such as the sample size for the second experiment. 

      (2) Sensitivity of human participants 

      We do argue, and believe, that our data show that people tend to prefer simpler shapes. However, giving a well-posed definition of "preference" in this context turns out to be nontrivial.

      At the very least, any statement such as "people prefer shape A over B" should be qualified with something like “when the distance of the data from both shapes is the same.” In other words, one should control for goodness-of-fit. Even before making any reference to our behavioral model, this phenomenon (a preference for the simpler model when goodness of fit is matched between models) is visible in Figure 3a, where the effective decision boundary used by human participants is closer to the more complex model than the cyan line representing the locus of points with equal goodness of fit under the two models (or equivalently, with the same Euclidean distance from the two shapes). The goal of our theory and our behavioral model is precisely to systematize this sort of control, extending it beyond just goodness-of-fit and allowing us to control simultaneously for multiple features of model complexity that may affect human behavior in different ways. In other words, it allows us not only to ask whether people prefer shape A over B after controlling for the distance of the data to the shapes, but also to understand to what extent this preference is driven by important geometrical features such as dimensionality, volume, curvature, and boundaries of the shapes. More specifically, and importantly, our theory makes it possible to measure the strength of the preference, rather than merely asserting its existence. In our modeling framework, the existence of a preference for simpler shapes is captured by the fact that the estimated sensitivities to the complexity penalties are positive (and although they differ in magnitude, all are statistically reliable).

      (3) Generalization to different shapes  

      Thank you for bringing up this important topic. First, note that while dimensionality and volume are global properties of models and only take two possible values in our human tasks, the boundary and robustness penalties depend on the model and on the data and therefore assume a continuum of values through the tasks (note also that the boundary penalty is relevant for all task types, not just the one designed specifically to study it, because all models except the zero-dimensional dot have boundaries). Therefore, our experimental setting is less restrictive of what it may seem, because it explores a range of possible values for two of the four model features. However, we agree that it would be interesting to repeat our experiment with a broader range of models, perhaps allowing their dimensionality and volume to vary more. In the same spirit, it would be interesting to study the dependence of human behavior on the amount of available data. We believe that these are all excellent ideas for further study that exceed the scope of the present paper. We will include these important points in a revised Discussion. 

      (4) Usage of “decision making” vs “perceptual judgment”

      Thank you. We will clarify better in the text that our usage of “decision making” overlaps with the idea of a perceptual judgment and that our experiments do not tackle sequential aspects of repeated decisions. 

      Reviewer #2 (Public review):

      This manuscript presents a sophisticated investigation into the computational mechanisms underlying human decision-making, and it presents evidence for a preference for simpler explanations (Occam's razor). The authors dissect the simplicity bias into four different components, and they design experiments to target each of them by presenting choices whose underlying models differ only in one of these components. In the learning tasks, participants must infer a "law" (a logical rule) from observed data in a way that operationalizes the process of scientific reasoning in a controlled laboratory setting. The tasks are complex enough to be engaging but simple enough to allow for precise computational modeling.

      As a further novel feature, authors derive a further term in the expansion of the logevidence, which arises from boundary terms. This is combined with a choice model, which is the one that is tested in experiments. Experiments are run, but with humans and with artificial intelligence agents, showing that humans have an enhanced preference for simplicity as compared to artificial neural networks.

      Overall, the work is well written, interesting, and timely, bridging concepts in statistical inference and human decision making. Although technical details are rather elaborate, my understanding is that they represent the state of the art.

      I have only one main comment that I think deserves more comments. Computing the complexity penalty of models may be hard. It is unlikely that humans can perform such a calculation on the fly. As authors discuss in the final section, while the dimensionality term may be easier to compute, others (e.g., the volume term, which requires an integral) may be considerably harder to compute (it is true that they should be computed once and for all for each task, but still...). I wonder whether the sensitivity of human decision making with reference to the different terms is so different, and in particular whether it aligns with computational simplicity, or with the possibility of approximating each term by simple heuristics. Indeed, the sensitivity to the volume term is significantly and systematically lower than that of other terms. I wonder whether this relation could be made more quantitative using neural networks, using as a proxy of computational hardness the number of samples needed to reach a given error level in learning each of these terms.

      Thank you. The computational complexity associated with calculating the different terms and its potential connection to human sensitivity to the terms is an intriguing topic. As we hinted at in the discussion, we agree with the reviewer that this is a natural candidate for further research, which likely deserves its own study and exceeds the scope of the present paper. 

      As a minor aside, at least for the present task the volume term may not be that hard to compute, because it can be expressed with the number of distinguishable probability distributions in the model (Balasubramanian 1996). Given the nature of our task, where noise is Gaussian, isotropic and with known variance, the geometry of the model is actually the Euclidean geometry of the plane, and the volume is simply the (log of the) length of the line that represents the one-dimensional models, measured in units of the standard deviation of the noise.

      Reviewer #3 (Public review):

      Summary:

      This is a very interesting paper that documents how humans use a variety of factors that penalize model complexity and integrate over a possible set of parameters within each model. By comparison, trained neural networks also use these biases, but only on tasks where model selection was part of the reward structure. In the situation where training emphasizes maximum-likelihood decisions, only neural networks, but not humans, were able to adapt their decision-making. Humans continue to use model integration simplicity biases.

      Strengths:

      This study used a pre-registered plan for analyzing human data, which exceeds the standards compared to other current studies.

      The results are technically correct.

      Weaknesses:

      The presentation of the results could be improved.

      We thank the reviewer for their appreciation of our experimental design and methodology, and for pointing out (in the separate "recommendations to authors") a few passages of the paper where the presentation could be improved. We will clarify these passages in the revision.

    1. Reviewer #1 (Public review):

      Summary:

      This is a careful, well-powered treatment of age effects in resting-state MEG. Rather than extracting (say) complex connectivity measures, the authors look at the 'simplest possible thing': changes in the overall power spectrum across age.

      Strengths:

      They find significant age-related changes at different frequency bands: broadly, attenuation at low-frequency (alpha) and increased beta. These patterns are identified in a large dataset (CamCAN) and then verified in other public data.

      Weaknesses:

      Some secondary interpretations (what is "unique" to age vs global anatomy) may go beyond what the statistics strictly warrant in the current form, but these can be tightened with (I think, fairly quick) additions already foreshadowed by the authors' own analyses.

      Aims:

      The authors set out to replace piecemeal, band-by-band ageing claims with t-maps, and Cohen's f2 over sensors×frequency ("GLM-Spectrum").

      On CamCAN, six spatio-spectral peaks survive relatively strict statistical controls. The larger effects are in low-frequency and upper-alpha/beta ranges (f2 approx 0.2-0.3), while lower-alpha and gamma reach significance but with small practical impact (f2 < 0.075). A nice finding is that the same qualitative profile appears in three additional independent datasets.

      Two analyses are especially interesting. First, the authors show a difference between absolute and relative spectral magnitude (basically, within-subject normalization). Relative scaling sharpens the spectral specificity of the spatial maps, while absolute magnitude is dominated by a broad spatial mode that correlates positively across frequencies, likely reflecting head-position/field-spread factors. The replication of the main age profile is robust to preprocessing decisions (e.g., SSS movement compensation choices) - the bigger determinant of the effect is whether they apply sensor normalization (relative vs absolute).

      Second, lots of brain-related things might be related to age, and the authors spend some time trying to back out confounds/covariates. This section is handled transparently (in general, I found the writing style very clear throughout) - they examine single covariates (sex, BP, GGMV, etc.) and compare simple vs partial age effects. For example, aging is correlated with reductions in global grey-matter volume (GGMV), but it would be nice to find a measure that is independent of this: controlling for GGMV (via a linear model) reduces age-related effect sizes heterogeneously across space/frequency but does not eliminate them, a nuance the authors treat carefully.

      This is a nice paper, and I have only a few concrete suggestions:

      (1) High-gamma:

      There can be a lot of EMG / eye movement contamination (I know these were RS eyes closed data, but still..) above 30-40 Hz, and these effects are the weakest anyway. Could you add an analysis (e.g., ICA/label-based muscle component removal) and show the gamma band's sensitivity to that step? Or just note this point more clearly?

      (2) GGMV confound control:

      Controlling for GGMV reduces, but does not eliminate, age effects. I have a few questions about this: a) Could we see the residuals as a function of age? I wonder if there are non-linear effects or something else that the regression is not accounting for. Also, b) GGMV and age are highly colinear - is this an issue? Can regression really split them apart robustly? I think by some cunning orthogonalisation, you can compute the effect of age independent of GGVM. I don't think this is the same as the effect 'adjusted' for GGMV (which is what is shown here if I'm reading it correctly). Finally, of course, GGMV might actually be the thing you want to look at (because it might more accurately reflect clinical issues) - so strong correlations are not really a problem: I think really the focus might even be on using MEG to predict GGMV and controlling for age.

    1. experiments and hardware design have a certain “latency” and need to be iterated upon a certain “irreducible” number of times in order to learn things that can’t be deduced logically. But massive parallelism may be possible on top of that

      If it ends up developing this far I think the success rate of what could be discovered is endless but I think the driving factor of that will be the parallelism especially since as mentioned experiments take so long that in the time one is being done there are so many other things that can be ran and tested as well but we don't have the people or resources to do so right now.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      (1)How is this simplified model representative of what is observed biologically? A bump model does not naturally produce oscillations. How would the dynamics of a rhythm generator interact with this simplistic model?

      Bump models naturally produce sequential activity, and can be engineered to repeat this sequential activity periodically (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). This is the basis for the oscillatory behavior in the model presented here. As we describe in our paper, such a model is consistent with numerous neurobiological observations about cell-type-specific connectivity patterns. The reviewer is, however, correct to point out that our model does not incorporate other key neurobiological features--in particular, intracellular dynamical properties--that have been shown to play important roles in rhythm generation. Our aim in this work is to establish a circuit-level mechanism for rhythm generation, complementary to classical models that rely on intracellular dynamics for rhythm generation. Whether and how these mechanisms work together is something that we plan to explore in future work, and we have added a sentence to the Discussion to this effect.

      (2) Would this theoretical construct survive being expressed in a biophysical model? It seems that it should, but even a simple biological model with the basic patterns of connectivity shown here would greatly increase confidence in the biological plausibility of the theory.

      We thank the reviewer for pointing out this way to strengthen our paper. We implemented the connectivity developed in the rate models in a spiking neuron model which used EI-balanced Poisson noise as input drive. We found that we could reproduce all the main results of our analysis. In particular, with a realistic number of neurons, we observed swimming activity characterized by (i) left-right alternation, (ii) rostal-caudal propagation, and (iii) variable speed control with constant phase lag. The spiking model demonstrates that the connectivity-motif based mechanisms for rhythmogenesis that we propose are robust in a biophysical setting.

      We included these results in the updated manuscript in a new Results subsection titled “Robustness in a biophysical model.”

      (3) How stable is this model in its output patterns? Is it robust to noise? Does noise, in fact, smooth out the abrupt transitions in frequency in the middle range?

      The newly added spiking model implementation of the network demonstrates that the core mechanisms of our models are robust to noise,  since the connectivity is randomly chosen and the input drive is Poisson noise.

      To test the effect of noise as it is parametrically varied, we also added noise directly to the rate models in the form of white noise input to each unit. Namely, the rate model was adapted to obey the stochastic differential equation

      \[

      \tau_i \frac{dr_i(t)}{dt} = -r_i(t) + \left[ \sum_j W_{ij} r_j(t - \Delta_{ij}) + D_i + \sigma\xi_t \right]_+

      \]

      Here $\xi_t$ is a standard Gaussian white noise and $\sigma$ sets the strength of the noise. We found that the swimming patterns were robust at all frequencies up to $\sigma =  0.05$. Above this level, coherent oscillations started to break down for some swim frequencies. To investigate whether the noise smoothed out abrupt transitions, we swept through different values of noise and modularity of excitatory connections. The results showed very minor improvement in controllability (see figure below), but this was not significant enough to include in the manuscript.

      Author response image 1.

      (4) All figure captions are inadequate. They should have enough information for the reader to understand the figure and the point that was meant to be conveyed. For example, Figure 1 does not explain what the red dot is, what is black, what is white, or what the gradations of gray are. Or even if this is a representative connectivity of one node, or if this shows all the connections? The authors should not leave the reader guessing.

      All figure captions have been updated to enhance clarity and address these concerns.

      Reviewer #2 (Public review):

      (1) Figure 1A, if I interpret Figure 1B correctly, should there not be long descending projections as well that don't seem to be illustrated?

      Thank you for highlighting this potential point of confusion. The diagram in question was only intended to be a rough schematic of the types of connections present in the model. We have added additional descending connections as requested

      (2)Page 5, It would be good to define what is meant by slow and fast here, as this definition changes with age in zebrafish (what developmental age)?

      We have updated the manuscript to include the sentence: “These values were chosen to coincide with observed ranges from larval zebrafish.” with appropriate citation.

      Reviewer #3 (Public review):

      (1) The authors describe a single unit as a neuron, be it excitatory or inhibitory, and the output of the simulation is the firing rate of these neurons. Experimentally and in other modeling studies, motor neurons are incorporated in the model, and the output of the network is based on motor neuron firing rate, not the interneurons themselves. Why did the authors choose to build the model this way?

      We chose to leave out the motor neurons from our models for a few reasons. While motor neurons read out the rhythmic activity generated by the interneurons and may provide some feedback, they are not required for rhythmogenesis. In fact, interneuron activity (especially in the excitatory V2a neurons (Agha et al., 2024)) is highly correlated with the ventral root bursts within the same segment. This suggests that motor neurons are primarily a local readout of the rhythmic activity of interneurons; therefore, the rhythmic swimming activity can be deduced directly from the interneurons themselves.

      Moreover, there is a lack of experimental observation of the connectivity between all the cell types considered in our model and motor neurons. Hence, it was unclear how we should include them in the model. To address this, we are currently developing a data-driven approach that will determine the proper connectivity between the motor neurons and the interneurons, including intrasegmental connections.

      (2) In the single population model (Figure 1), the authors use ipsilateral inhibitory connections that are long-range in an ascending direction. Experimentally, these connections have been shown to be local, while long-range ipsilateral connections have been shown to be descending. What were the reasons the authors chose this connectivity? Do the authors think local ascending inhibitions contribute to rostrocaudal propagation, and how?

      The long-range ascending ipsilateral inhibitory connections arises from a limitation of our modeling framework. The V1 neurons that provide these connections have been shown experimentally to fire later than other neurons (especially descending V2a  neurons) within the same hemisegment (Jay et al., J Neurosci, 2023); however, our model can only produce synchronized local activity. Hence, we replace local phase offsets with spatial offsets to produce correctly structured recurrent phasic inputs. We are currently investigating a data-driven method for determining intrasegmental connectivity which should be able to produce the local phase offset and address this concern; however, this is beyond the scope of the current paper.

      (3) In the two-population model, the authors show independent control of frequency and rhythm, as has been reported experimentally. However, in these previous experimental studies, frequency and amplitude are regulated by different neurons, suggesting different networks dedicated to frequency and amplitude control. However, in the current model, the same population with the same connections can contribute to frequency or amplitude depending on relative tonic drive. Can the authors please address these differences either by changes in the model or by adding to the Discussion?

      Our prior  experimental results that suggested a separation of frequency and amplitude control circuits focus on motor neuron recruitment, instead of interneuron activity (Jay et al., J Neurosci 2023; Menelaou and McLean, Nat Commun 2019). To avoid potential confusion about amplitudes of interneurons vs. of motor neurons, we have removed the results from Figure 3 about control of amplitude in the 2-population model, instead focusing this figure on the control of frequency via speed-module recruitment. For the same reason, we have removed the panel showing the effects of targeted ablations on interneuron amplitudes in Figure 7. We have kept the result about amplitude control in our Supplemental Figure S2 for the 8-population model, but we try to make it clear in the text that any relationship between interneuron amplitude and motor neuron amplitude would depend on how motor neurons are modeled, which we do not pursue in this work.

      (4) It would be helpful to add a paragraph in the Discussion on how these results could be applicable to other model systems beyond zebrafish. Cell intrinsic rhythmogenesis is a popular concept in the field, and these results show an interesting and novel alternative. It would help to know if there is any experimental evidence suggesting such network-based propagation in other systems, invertebrates, or vertebrates.

      We have expanded a paragraph in the Discussion to address these questions. In particular, we highlight how a recent study of mouse locomotor circuits produced a model with similar key features (Komi et al., 2024). These authors made direct use of experimentally determined connectivity structure and cell-type distributions, which informed a model that produced purely network-based rhythmogenesis. We also point out that inhibition-dominated connectivity has been used for understanding oscillatory behavior in neural circuits outside the context of motor control (Zhang, 1996; Samsonovich and McNaughton, 1997; Murray and Escola, 2017). Finally, we address a study that used the cell-type specific connectivity within the C. Elegans locomotor circuit as the architecture for an artificial motor control system and found that the resulting system could more efficiently learn motor control tasks than general machine learning architectures (Bhattasali et al. 2022). Like our model, the Komi et al. and Bhattasali et al. models generate rhythm via structured connectivity motifs rather than via intracellular dynamical properties, suggesting that these may be a key mechanism underlying locomotion across species.

      Reviewer #1 (Recommendations for the authors):

      (1) Express this modeling construct in a simple biophysical model.

      See the new Results subsection titled “Robustness in a biophysical model.”

      (2) Please cite the classic models of Kopell, Ermentrout, Williams, Sigvardt etc., especially where you say "classic models".

      We have added relevant citations including the mentioned authors.

      (3) "Rhythmogenesis remain incompletely understood" changed to "Rhythmogenesis remains incompletely understood".

      We chose not to make this change since the ‘remain’ refers to the plural ‘core mechanisms’ not the singular ‘rhythmogenesis’.

      Reviewer #3 (Recommendations for the authors):

      (1) The figures are well made; however, it would help to add more details to the figure legends. For example, what neuron's firing rate is shown in Figure 1C? What is the red dot in 1B? Figures 3E,F,G: what is being plotted? Mean and SD? Blue dot in Figure 5C?

      All figure captions have been updated to enhance clarity and address these concerns.

      (2) A, B text missing in Figure 7.

      We have revised this figure and its caption; please see our response to Comment 3 above.

      (3) It would be nice to see the tonic drive pattern that is fed to the model for each case, along with the different firing rates in the figures. It would help understand how the tonic drive is changed to rhythmic activity.

      The tonic drive in the rate models is implemented as a constant excitatory input that is uniform across all units within the same speed-population. There is no patterning in time or location to this drive.

      References

      (1) Moneeza A Agha, Sandeep Kishore, and David L McLean. Cell-type-specific origins of locomotor rhythmicity at different speeds in larval zebrafish. eLife, July 2024

      (2) Nikhil Bhattasali, Anthony M Zador, and Tatiana Engel. Neural circuit architectural priors for embodied control. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 12744–12759. Curran Associates, Inc., 2022.

      (3) Salif Komi, August Winther, Grace A. Houser, Roar Jakob Sørensen, Silas Dalum Larsen, Madelaine C. Adamssom Bonfils, Guanghui Li, and Rune W. Berg. Spatial and network principles behind neural generation of locomotion. bioRxiv, 2024

      (4) James M Murray and G Sean Escola. Learning multiple variable-speed sequences in striatum via cortical tutoring. eLife, 6:e26084, May 2017.

      (5) Alexei Samsonovich and Bruce L McNaughton. Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15):5900–5920, 1997.

      (6) K Zhang. Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. Journal of Neuroscience, 16(6):2112–2126, 1996.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Public Reviews:

      We thank the Reviewers for their thorough attention to our paper and the interesting discussion about the findings. Before responding to more specific comments, here some general points we would like to clarify:

      (1) Ecological niche models are indeed correlative models, and we used them to highlight environmental factors associated with HPAI outbreaks within two host groups. We will further revise the terminology that could still unintentionally suggest causal inference. The few remaining ambiguities were mainly in the Discussion section, where our intent was to interpret the results in light of the broader scientific literature. Particularly, we will change the following expressions:

      -  “Which factors can explain…” to  “Which factors are associated with…” (line 75);

      -  “the environmental and anthropogenic factors influencing” to “the environmental and anthropogenic factors that are correlated with” (line 273);

      -  “underscoring the influence” to “underscoring the strong association” (line 282).

      (2) We respectfully disagree with the suggestion that an ecological niche modelling (ENM) approach is not appropriate for this work and the research question addressed therein. Ecological niche models are specifically designed to estimate the spatial distribution of the environmental suitability of species and pathogens, making them well suited to our research questions. In our study, we have also explicitly detailed the known limitations of ecological niche models in the Discussion section, in line with prior literature, to ensure their appropriate interpretation in the context of HPAI.

      (3) The environmental layers used in our models were restricted to those available at a global scale, as listed in Supplementary Information Resources S1 (https://github.com/sdellicour/h5nx\_risk\_mapping/blob/master/Scripts\_%26\_data/SI\_Resource\_S1.xlsx). Naturally, not all potentially relevant environmental factors could be included, but the selected layers are explicitly documented and only these were assessed for their importance. Despite this limitation, the performance metrics indicate that the models performed well, suggesting that the chosen covariates capture meaningful associations with HPAI occurrence at a global scale.

      Reviewer #1 (Public review):

      The authors aim to predict ecological suitability for transmission of highly pathogenic avian influenza (HPAI) using ecological niche models. This class of models identify correlations between the locations of species or disease detections and the environment. These correlations are then used to predict habitat suitability (in this work, ecological suitability for disease transmission) in locations where surveillance of the species or disease has not been conducted. The authors fit separate models for HPAI detections in wild birds and farmed birds, for two strains of HPAI (H5N1 and H5Nx) and for two time periods, pre- and post-2020. The authors also validate models fitted to disease occurrence data from pre-2020 using post-2020 occurrence data. I thank the authors for taking the time to respond to my initial review and I provide some follow-up below.

      Detailed comments:

      In my review, I asked the authors to clarify the meaning of "spillover" within the HPAI transmission cycle. This term is still not entirely clear: at lines 409-410, the authors use the term with reference to transmission between wild birds and farmed birds, as distinct to transmission between farmed birds. It is implied but not explicitly stated that "spillover" is relevant to the transmission cycle in farmed birds only. The sentence, "we developed separate ecological niche models for wild and domestic bird HPAI occurrences ..." could have been supported by a clear sentence describing the transmission cycle, to prime the reader for why two separate models were necessary.

      We respectfully disagree that the term “spillover” is unclear in the manuscript. In both the Methods and Discussion sections (lines 387-391 and 409-414), we explicitly define “spillover” as the introduction of HPAI viruses from wild birds into domestic poultry, and we distinguish this from secondary farm-to-farm transmission. Our use of separate ecological niche models for wild and domestic outbreaks reflects not only the distinction between primary spillover and secondary transmission, but also the fundamentally different ecological processes, surveillance systems, and management implications that shape outbreaks in these two groups. We will clarify this choice in the revised manuscript when introducing the separate models. Furthermore, on line 83, we will add “as these two groups are influenced by different ecological processes, surveillance biases, and management contexts”.

      I also queried the importance of (dead-end) mammalian infections to a model of the HPAI transmission risk, to which the authors responded: "While spillover events of HPAI into mammals have been documented, these detections are generally considered dead-end infections and do not currently represent sustained transmission chains. As such, they fall outside the scope of our study, which focuses on avian hosts and models ecological suitability for outbreaks in wild and domestic birds." I would argue that any infections, whether they are in dead-end or competent hosts, represent the presence of environmental conditions to support transmission so are certainly relevant to a niche model and therefore within scope. It is certainly understandable if the authors have not been able to access data of mammalian infections, but it is an oversight to dismiss these infections as irrelevant.

      We understand the Reviewer’s point, but our study was designed to model HPAI occurrence in avian hosts only. We therefore restricted our analysis to wild birds and domestic poultry, which represent the primary hosts for HPAI circulation and the focus of surveillance and control measures. While mammalian detections have been reported, they are outside the scope of this work.

      Correlative ecological niche models, including BRTs, learn relationships between occurrence data and covariate data to make predictions, irrespective of correlations between covariates. I am not convinced that the authors can make any "interpretation" (line 298) that the covariates that are most informative to their models have any "influence" (line 282) on their response variable. Indeed, the observation that "land-use and climatic predictors do not play an important role in the niche ecological models" (line 286), while "intensive chicken population density emerges as a significant predictor" (line 282) begs the question: from an operational perspective, is the best (e.g., most interpretable and quickest to generate) model of HPAI risk a map of poultry farming intensity?

      We agree that poultry density may partly reflect reporting bias, but we also assumed it a meaningful predictor of HPAI risk. Its importance in our models is therefore expected. Importantly, our BRT framework does more than reproduce poultry distribution: it captures non-linear relationships and interactions with other covariates, allowing a more nuanced characterisation of risk than a simple poultry density map. Note also that we distinguished in our models intensive and extensive chicken poultry density and duck density. Therefore, it is not a “map of poultry farming intensity”. 

      At line 282, we used the word “influence” while fully recognising that correlative models cannot establish causality. Indeed, in our analyses, “relative influence” refers to the importance metric produced by the BRT algorithm (Ridgeway, 2020), which measures correlative associations between environmental factors and outbreak occurrences. These scores are interpreted in light of the broader scientific literature, therefore our interpretations build on both our results and existing evidence, rather than on our models alone. However, in the next version of the paper, we will revise the sentence as: “underscoring the strong association of poultry farming practices with HPAI spread (Dhingra et al., 2016)”. 

      I have more significant concerns about the authors' treatment of sampling bias: "We agree with the Reviewer's comment that poultry density could have potentially been considered to guide the sampling effort of the pseudo-absences to consider when training domestic bird models. We however prefer to keep using a human population density layer as a proxy for surveillance bias to define the relative probability to sample pseudo-absence points in the different pixels of the background area considered when training our ecological niche models. Indeed, given that poultry density is precisely one of the predictors that we aim to test, considering this environmental layer for defining the relative probability to sample pseudo-absences would introduce a certain level of circularity in our analytical procedure, e.g. by artificially increasing to influence of that particular variable in our models." The authors have elected to ignore a fundamental feature of distribution modelling with occurrence-only data: if we include a source of sampling bias as a covariate and do not include it when we sample background data, then that covariate would appear to be correlated with presence. They acknowledge this later in their response to my review: "...assuming a sampling bias correlated with poultry density would result in reducing its effect as a risk factor." In other words, the apparent predictive capacity of poultry density is a function of how the authors have constructed the sampling bias for their models. A reader of the manuscript can reasonably ask the question: to what degree are is the model a model of HPAI transmission risk, and to what degree is the model a model of the observation process? The sentence at lines 474-477 is a helpful addition, however the preceding sentence, "Another approach to sampling pseudo-absences would have been to distribute them according to the density of domestic poultry," (line 474) is included without acknowledgement of the flow-on consequence to one of the key findings of the manuscript, that "...intensive chicken population density emerges as a significant predictor..." (line 282). The additional context on the EMPRES-i dataset at line 475-476 ("the locations of outbreaks ... are often georeferenced using place name nomenclatures") is in conflict with the description of the dataset at line 407 ("precise location coordinates"). Ultimately, the choices that the authors have made are entirely defensible through a clear, concise description of model features and assumptions, and precise language to guide the reader through interpretation of results. I am not satisfied that this is provided in the revised manuscript.

      We thank the Reviewer for this important point. To address it, we compared model predictive performance and covariate relative influences obtained when pseudo-absences were weighted by poultry density versus human population density (Author response table 1). The results show that differences between the two approaches are marginal, both in predictive performance (ΔAUC ranging from -0.013 to +0.002) and in the ranking of key predictors (see below Author response images 1 and 2). For instance, intensive chicken density consistently emerged as an important predictor regardless of the bias layer used.

      Note: the comparison was conducted using a simplified BRT configuration for computational efficiency (fewer trees, fixed 5-fold random cross-validation, and standardised parameters). Therefore, absolute values of AUC and variable importance may differ slightly from those in the manuscript, but the relative ranking of predictors and the overall conclusions remain consistent.

      Given these small differences, we retained the approach using human population density. We agree that poultry density partly reflects surveillance bias as well as true epidemiological risk, and we will clarify this in the revised manuscript by noting that the predictive role of poultry density reflects both biological processes and surveillance systems. Furthermore, on line 289, we will add “We note, however, that intensive poultry density may reflect both surveillance intensity and epidemiological risk, and its predictive role in our models should be interpreted in light of both processes”.

      Author response table 1.

      Comparison of model predictive performances (AUC) between pseudo-absence sampling were weighted by poultry density and by human population density across host groups, virus types, and time periods. Differences in AUC values are shown as the value for poultry-weighted minus human-weighted pseudo-absences.

      Author response image 1.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for domestic bird outbreaks. Results are shown for four datasets: H5N1 (<2020), H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      Author response image 2.

      Comparison of variable relative influence (%) between models trained with pseudo-absences weighted by poultry density (red) and human population density (blue) for wild bird outbreaks. Results are shown for three datasets: H5N1 (>2020), H5Nx (<2020), and H5Nx (>2020).

      The authors have slightly misunderstood my comment on "extrapolation": I referred to "environmental extrapolation" in my review without being particularly explicit about my meaning. By "environmental extrapolation", I meant to ask whether the models were predicting to environments that are outside the extent of environments included in the occurrence data used in the manuscript. The authors appear to have understood this to be a comment on geographic extrapolation, or predicting to areas outside the geographic extent included in occurrence data, e.g.: "For H5Nx post-2020, areas of high predicted ecological suitability, such as Brazil, Bolivia, the Caribbean islands, and Jilin province in China, likely result from extrapolations, as these regions reported few or no outbreaks in the training data" (lines 195-197). Is the model extrapolating in environmental space in these regions? This is unclear. I do not suggest that the authors should carry out further analysis, but the multivariate environmental similarly surface (MESS; see Elith et al., 2010) is a useful tool to visualise environmental extrapolation and aid model interpretation.

      On the subject of "extrapolation", I am also concerned by the additions at lines 362-370: "...our models extrapolate environmental suitability for H5Nx in wild birds in areas where few or no outbreaks have been reported. This discrepancy may be explained by limited surveillance or underreporting in those regions." The "discrepancy" cited here is a feature of the input dataset, a function of the observation distribution that should be captured in pseudo-absence data. The authors state that Kazakhstan and Central Asia are areas of interest, and that the environments in this region are outside the extent of environments captured in the occurrence dataset, although it is unclear whether "extrapolation" is informed by a quantitative tool like a MESS or judged by some other qualitative test. The authors then cite Australia as an example of a region with some predicted suitability but no HPAI outbreaks to date, however this discussion point is not linked to the idea that the presence of environmental conditions to support transmission need not imply the occurrence of transmission (as in the addition, "...spatial isolation may imply a lower risk of actual occurrences..." at line 214). Ultimately, the authors have not added any clear comment on model uncertainty (e.g., variation between replicated BRTs) as I suggested might be helpful to support their description of model predictions.

      Many thanks for the clarification. Indeed, we interpreted your previous comments in terms of geographic extrapolations. We thank the Reviewer for these observations. We will adjust the wording to further clarify that predictions of ecological suitability in areas with few or no reported outbreaks (e.g., Central Asia, Australia) are not model errors but expected extrapolations, since ecological suitability does not imply confirmed transmission (for instance, on Line 362: “our models extrapolate environmental suitability” will be changed to “Interestingly, our models extrapolate geographical”). These predictions indicate potential environments favorable to circulation if the virus were introduced.

      In our study, model uncertainty is formally assessed when comparing the predictive performances of our models (Fig. S3, Table S1), the relative influence (Table S3) and response curves (Fig. 2) associated with each environmental factor (Table S2). All the results confirming a good converge between these replicates. Finally, we indeed did not use a quantitative tool such as a MESS to assess extrapolation but did rely on qualitative interpretation of model outputs.

      All of my criticisms are, of course, applied with the understanding that niche modelling is imperfect for a disease like HPAI, and that data may be biased/incomplete, etc.: these caveats are common across the niche modelling literature. However, if language around the transmission cycle, the niche, and the interpretation of any of the models is imprecise, which I find it to be in the revised manuscript, it undermines all of the science that is presented in this work.

      We respectfully disagree with this comment. The scope of our study and the methods employed are clearly defined in the manuscript, and the limitations of ecological niche modelling in this context are explicitly acknowledged in the Discussion section. While we appreciate the Reviewer’s concern, the comment does not provide specific examples of unclear or imprecise language regarding the transmission cycle, niche, or interpretation of the models. Without such examples, it is difficult to identify further revisions that would improve clarity.

      Reviewer #2 (Public review):

      The geographic range of highly pathogenic avian influenza cases changed substantially around the period 2020, and there is much interest in understanding why. Since 2020 the pathogen irrupted in the Americas and the distribution in Asia changed dramatically. This study aimed to determine which spatial factors (environmental, agronomic and socio-economic) explain the change in numbers and locations of cases reported since 2020 (2020--2023). That's a causal question which they address by applying correlative environmental niche modelling (ENM) approach to the avian influenza case data before (2015--2020) and after 2020 (2020--2023) and separately for confirmed cases in wild and domestic birds. To address their questions they compare the outputs of the respective models, and those of the first global model of the HPAI niche published by Dhingra et al 2016.

      We do not agree with this comment. In the manuscript, it is well established that we are quantitatively assessing factors that are associated with occurrences data before and after 2020. We do not claim to determine the causality. One sentence of the Introduction section (lines 75-76) could be confusing, so we intend to modify it in the final revision of our manuscript. 

      ENM is a correlative approach useful for extrapolating understandings based on sparse geographically referenced observational data over un- or under-sampled areas with similar environmental characteristics in the form of a continuous map. In this case, because the selected covariates about land cover, use, population and environment are broadly available over the entire world, modelled associations between the response and those covariates can be projected (predicted) back to space in the form of a continuous map of the HPAI niche for the entire world.

      We fully agree with this assessment of ENM approaches.

      Strengths:

      The authors are clear about expected bias in the detection of cases, such geographic variation in surveillance effort (testing of symptomatic or dead wildlife, testing domestic flocks) and in general more detections near areas of higher human population density (because if a tree falls in a forest and there is no-one there, etc), and take steps to ameliorate those. The authors use boosted regression trees to implement the ENM, which typically feature among the best performing models for this application (also known as habitat suitability models). They ran replicate sets of the analysis for each of their model targets (wild/domestic x pathogen variant), which can help produce stable predictions. Their code and data is provided, though I did not verify that the work was reproducible.

      The paper can be read as a partial update to the first global model of H5Nx transmission by Dhingra and others published in 2016 and explicitly follows many methodological elements. Because they use the same covariate sets as used by Dhingra et al 2016 (including the comparisons of the performance of the sets in spatial cross-validation) and for both time periods of interest in the current work, comparison of model outputs is possible. The authors further facilitate those comparisons with clear graphics and supplementary analyses and presentation. The models can also be explored interactively at a weblink provided in text, though it would be good to see the model training data there too.

      The authors' comparison of ENM model outputs generated from the distinct HPAI case datasets is interesting and worthwhile, though for me, only as a response to differently framed research questions.

      Weaknesses:

      This well-presented and technically well-executed paper has one major weakness to my mind. I don't believe that ENM models were an appropriate tool to address their stated goal, which was to identify the factors that "explain" changing HPAI epidemiology.

      Here is how I understand and unpack that weakness:

      (1) Because of their fundamentally correlative nature, ENMs are not a strong candidate for exploring or inferring causal relationships.

      (2) Generating ENMs for a species whose distribution is undergoing broad scale range change is complicated and requires particular caution and nuance in interpretation (e.g., Elith et al, 2010, an important general assumption of environmental niche models is that the target species is at some kind of distributional equilibrium (at time scales relevant to the model application). In practice that means the species has had an opportunity to reach all suitable habitats and therefore its absence from some can be interpreted as either unfavourable environment or interactions with other species). Here data sets for the response (N5H1 or N5Hx case data in domestic or wild birds ) were divided into two periods; 2015--2020, and 2020--2023 based on the rationale that the geographic locations and host-species profile of cases detected in the latter period was suggestive of changed epidemiology. In comparing outputs from multiple ENMs for the same target from distinct time periods the authors are expertly working in, or even dancing around, what is a known grey area, and they need to make the necessary assumptions and caveats obvious to readers.

      We thank the Reviewer for this observation. First, we constrained pseudo-absence sampling to countries and regions where outbreaks had been reported, reducing the risk of interpreting non-affected areas as environmentally unsuitable. Second, we deliberately split the outbreak data into two periods (2015-2020 and 2020-2023) because we do not assume a single stable equilibrium across the full study timeframe. This division reflects known epidemiological changes around 2020 and allows each period to be modeled independently. Within each period, ENM outputs are interpreted as associations between outbreaks and covariates, not as equilibrium distributions. Finally, by testing prediction across periods, we assessed both niche stability and potential niche shifts. These clarifications will be added to the manuscript to make our assumptions and limitations explicit.

      Line 66, we will add: “Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution. To account for this, we analysed two distinct time periods (2015-2020 and 2020-2023).”

      Line 123, we will revise “These findings underscore the ability of pre-2020 models in forecasting the recent geographic distribution of ecological suitability for H5Nx and H5N1 occurrences” to “These results suggest that pre-2020 models captured broad patterns of suitability for H5Nx and H5N1 outbreaks, while post-2020 models provided a closer fit to the more recent epidemiological situation”.

      (3) To generate global prediction maps via ENM, only variables that exist at appropriate resolution over the desired area can be supplied as covariates. What processes could influence changing epidemiology of a pathogen and are their covariates that represent them? Introduction to a new geographic area (continent) with naive population, immunity in previously exposed populations, control measures to limit spread such as vaccination or destruction of vulnerable populations or flocks? Might those control measures be more or less likely depending on the country as a function of its resources and governance? There aren't globally available datasets that speak to those factors, so the question is not why were they omitted but rather was the authors decision to choose ENMs given their question justified? How valuable are insights based on patterns of correlation change when considering different temporal sets of HPAI cases in relation to a common and somewhat anachronistic set of covariates?

      We agree that the ecological niche models trained in our study are limited to environmental and host factors, as described in the Methods section with the selection of predictors. While such models cannot capture causality or represent processes such as immunity, control measures, or governance, they remain a useful tool for identifying broad associations between outbreak occurrence and environmental context. Our study cannot infer the full mechanisms driving changes in HPAI epidemiology, but it does provide a globally consistent framework to examine how associations with available covariates vary across time periods.

      (4) In general the study is somewhat incoherent with respect to time. Though the case data come from different time periods, each response dataset was modelled separately using exactly the same covariate dataset that predated both sets. That decision should be understood as a strong assumption on the part of the authors that conditions the interpretation: the world (as represented by the covariate set) is immutable, so the model has to return different correlative associations between the case data and the covariates to explain the new data. While the world represented by the selected covariates \*may\* be relatively stable (could be statistically confirmed), what about the world not represented by the covariates (see point 3)?

      We used the same covariate layers for both periods, which indeed assumes that these environmental and host factors are relatively stable at the global scale over the short timeframe considered. We believe this assumption is reasonable, as poultry density, land cover, and climate baselines do not change drastically between 2015 and 2023 at the resolution of our analysis. We agree, however, that unmeasured processes such as control measures, immunity, or governance may have changed during this time and are not captured by our covariates.

      Recommendations for the Authors:

      Reviewer #1 (Recommendations for the authors):

      - Line 400-401: "over the 2003-2016 periods" has an extra "s"; "two host species" (with reference to wild and domestic birds) would be more precise as "two host groups".

      - Remove comma line 404

      Many thanks for these comments, we have modified the text accordingly.

      Reviewer #2 (Recommendations for the authors):

      Most of my work this round is encapsulated in the public part of the review.

      The authors responded positively to the review efforts from the previous round, but I was underwhelmed with the changes to the text that resulted. Particularly in regard to limiting assumptions - the way that they augmented the text to refer to limitations raised in review downplayed the importance of the assumptions they've made. So they acknowledge the significance of the limitation in their rejoinder, but in the amended text merely note the limitation without giving any sense of what it means for their interpretation of the findings of this study.

      The abstract and findings are essentially unchanged from the previous draft.

      I still feel the near causal statements of interpretation about the covariates are concerning. These models really are not a good candidate for supporting the inference that they are making and there seem to be very strong arguments in favour of adding covariates that are not globally available.

      We never claimed causal interpretation, and we have consistently framed our analyses in terms of associations rather than mechanisms. We acknowledge that one phrasing in the research questions (“Which factors can explain…”) could be misinterpreted, and we are correcting this in the revised version to read “Which factors are associated with…”. Our approach follows standard ecological niche modelling practice, which identifies statistical associations between occurrence data and covariates. As noted in the Discussion section, these associations should not be interpreted as direct causal mechanisms. Finally, all interpretive points in the manuscript are supported by published literature, and we consider this framing both appropriate and consistent with best practice in ecological niche modelling (ENM) studies.

      We assessed predictor contributions using the “relative influence” metric, the terminology reported by the R package “gbm” (Ridgeway, 2020). This metric quantifies the contribution of each variable to model fit across all trees, rescaled to sum to 100%, and should be interpreted as an association rather than a causal effect.

      L65-66 The general difficulty of interpreting ENM output with range-shifting species should be cited here to alert readers that they should not blithely attempt what follows at home.

      I believe that their analysis is interesting and technically very well executed, so it has been a disappointment and hard work to write this assessment. My rough-cut last paragraph of a reframed intro would go something like - there are many reasons in the literature not to do what we are about to do, but here's why we think it can be instructive and informative, within certain guardrails.

      To acknowledge this comment and the previous one, we revised lines 65-66 to: “However, recent outbreaks raise questions about whether earlier ecological niche models still accurately predict the current distribution of areas ecologically suitable for the local circulation of HPAI H5 viruses. Ecological niche model outputs for range-shifting pathogens must therefore be interpreted with caution (Elith et al., 2010). Despite this limitation, correlative ecological niche models  remain useful for identifying broad-scale associations and potential shifts in distribution.”

      We respectfully disagree with the Reviewer’s statement that “there are many reasons in the literature not to do what we are about to do”. All modeling approaches, including mechanistic ones, have limitations, and the literature is clear on both the strengths and constraints of ecological niche models. Our manuscript openly acknowledges these limits and frames our findings accordingly. We therefore believe that our use of an ENM approach is justified and contributes valuable insights within these well-defined boundaries.

      Reference: Ridgeway, G. (2007). Generalized Boosted Models: A guide to the gbm package. Update, 1(1), 2007.

    1. Author response:

      The following is the authors’ response to the current reviews.

      Reviewer #2 (Public review):

      Summary:

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that age-related changes aside from synaptopathy are responsible for the age-related decline in discrimination.

      Strengths:

      (1) The rationale and hypothesis are well-motivated and clearly presented.

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function.

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.

      Weaknesses:

      (1) I have concerns that the gerbils may not have been performing the behavioral task using temporal fine structure information.

      Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. However, gerbil auditory filters are thought to be broader than those in human. In the revised version of the manuscript, the authors provide modelling results suggesting that the excitation patterns were discriminable for the 4F0 conditions, but may not have been for the 8F0 conditions. These results provide some reassurance that the 8F0 discriminations were dependent on temporal cues, but the description of the model lacks detail. Also, the authors state that "thus, for these two conditions with harmonic number N of 8 the gerbils cannot rely on differences in the excitation patterns but must solve the task by comparing the temporal fine structure." This is too strong. Pulsed tone intensity difference limens (the reference used for establishing whether or not the excitation pattern cues were usable) may not be directly comparable to profile-analysis-like conditions, and it has been argued that frequency discrimination may be more sensitive to excitation pattern cues than predicted from a simple comparison to intensity difference limens (Micheyl et al. 2013, https://doi.org/10.1371/journal.pcbi.1003336

      We can assume that our conclusions based on the excitation patterns are adequate when putting gerbil auditory filter data, frequency difference limens and intensity difference limens together into perspective. Kittel et al. (2002) observed an about factor 2 larger auditory-filter bandwidth in the gerbil than in humans reducing the number of independent frequency channels in the analysis of excitation patterns. The gerbil frequency-difference limen for pure tones being an indicator for the sensitivity to make use of excitation patterns is more than an order of magnitude larger than the corresponding human frequency difference limen (Klinge and Klump 2009, https://doi.org/10.1121/1.3021315). Finally, the gerbil intensity-difference limen of 2.8 dB observed for 1-kHz pure tones is considerably larger than the 0.75 dB observed for humans in the same study (Sinnott et al. 1992). Thus, taken together these lines of evidence indicate that our conclusions regarding the potential use of excitation patterns are not too strong.

      I'm also somewhat concerned that the masking noise used in the present study was too low in level to mask cochlear distortion products. Based on their excitation pattern modelling, the authors state (without citation) that "since the level of excitation produced by the pink noise is less than 30 dB below that produced by the complex tones, distortion products will be masked." The basis for this claim is not clear. In human, distortion products may be only ~20 dB below the levels of the primaries (referenced to an external sound masker / canceller, which is appropriate, assuming that the modelling reported in the present paper did not include middle-ear effects; see Norman-Haignere and McDermott, 2016, doi: 10.1016/j.neuroimage.2016.01.050). Oxenham et al. (2009, doi: 10.1121/1.3089220) provide further cautionary evidence on the potential use of distortion product cues when the background noise level is too low (in their case the relative level of the noise in the compromised condition was only a little below that used in the present study). The masking level used in the present study may have been sufficient, but it would be useful to have some further reassurance on this point.

      In the method section, we provide the citation for estimating the size of the distortion products and the estimated signal-to-noise ratio making the basis for our estimates clear.

      We consulted Oxenham et al. (2009, doi: 10.1121/1.3089220) who suggested that distortion products may have been used in human subjects. However, in Fig. 1 of their paper, they convincingly demonstrate that even for humans that have more narrow auditory filters than gerbils, spectral cues cannot be used to evaluate the frequency shift in harmonic complex tones. We are confident that the same limitation applies to gerbils that have wider auditory filters than humans and a lower ability to use spectral cues as indicated by their higher frequency-difference limens and intensity-difference limens compared to humans.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other age-related deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model.

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age.

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript.

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-z-ratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.

      [Update: The revised manuscript has addressed these issues]

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.

      [Update: The issue of threshold shifts with aging gerbils is still unresolved in my opinion. From the revised manuscript, it appears that aged gerbils have a 36dB shift in thresholds. While the revised manuscript provides convincing evidence that these threshold shifts do not affect the auditory nerve tuning properties, the behavioral paradigm was still presented at the same sound level for young and aged animals. But a potential 36 dB change in sensation level may affect behavioral results. The authors may consider adding thresholds as covariates in analyses or present any evidence that behavioral thresholds are plateaued along that 30dB range].

      Since we do not have behavioural detection thresholds from our individual animals, only CAP thresholds that represent the auditory-nerve data and cannot be translated to behavioural thresholds directly, we want to refrain from using these indirect measures as covariates in the present analysis. In addition, the study by Hamann et al. (2002, https://doi.org/10.1016/S0378-5955(02)00454-9) indicates that age-related behavioural threshold increases are smaller than threshold increases obtained from auditory brainstem response measurements. Finally, statistical analyses on very small samples can be unreliable due to problems of power, generalisability, and susceptibility to outliers.

      Moore and Sek (2009) in their paper on the TFS1 test pointed out that the effect of signal level on the TFS1 threshold in normal hearing human subjects was small when the signal-to-noise ratio between the broadband masking noise and the complex tone was kept constant. Furthermore, the masking noise will raise the thresholds of normal hearing gerbils and old gerbils with an audibility threshold increase to about the same signal-to-noise ratio. Thus, as long as the signal remains audible to the behaviourally tested gerbil which can be expected at an overall signal level of 68 dB SPL, we expect little effect of raised audibility thresholds on the TFS1 threshold. The lack of temporal processing deficits in the auditory-nerve fibers of old, mildly hearing impaired gerbils compared to those in normal hearing young adult gerbils further strengthens this argument.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.

      [Update: The revised manuscript sufficiently addresses these issues, with the caveat of hearing threshold changes affecting behavioral thresholds mentioned above].

      As we argued above, an audibility threshold increase in the old gerbils is unlikely to explain the raised TFS1 thresholds in the old gerbils.

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age.

      [Update: The revised manuscript has addressed these issues]

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.

      [Update: The revised manuscript has addressed these issues]

      Reviewer #3 (Recommendations for the authors):

      Thank you for your revisions. They largely address most of my initial concerns. The issue of threshold shifts potentially affecting behavioral thresholds still remains unresolved in my opinion. The new data about unaltered tuning curves is convincing that the auditory nerve fiber recordings are unaffected by threshold shifts. But am I correct in my understanding that the threshold shift with age was 36 dB relative to the young (L168)? If so, wouldn't the fact that behavior was performed at 68 dB SPL regardless of group affect the behavioral thresholds with age? Is there any additional evidence that suggests that behavioral performance plateaus along that ~30dB range that the authors could include to strengthen this claim?

      In our response above to reviewer #3 and to reviewer #2 we provided additional arguments why we think that an audibility threshold increase in old gerbils cannot explain their compromised TFS1 thresholds.


      The following is the authors’ response to the original reviews.

      Reviewer #1(Public review)  

      Summary:  

      The authors investigate the effects of aging on auditory system performance in understanding temporal fine structure (TFS), using both behavioral assessments and physiological recordings from the auditory periphery, specifically at the level of the auditory nerve. This dual approach aims to enhance understanding of the mechanisms underlying observed behavioral outcomes. The results indicate that aged animals exhibit deficits in behavioral tasks for distinguishing between harmonic and inharmonic sounds, which is a standard test for TFS coding. However, neural responses at the auditory nerve level do not show significant differences when compared to those in young, normalhearing animals. The authors suggest that these behavioral deficits in aged animals are likely attributable to dysfunctions in the central auditory system, potentially as a consequence of aging. To further investigate this hypothesis, the study includes an animal group with selective synaptic loss between inner hair cells and auditory nerve fibers, a condition known as cochlear synaptopathy (CS).CS is a pathology associated with aging and is thought to be an early indicator of hearing impairment. Interestingly, animals with selective CS showed physiological and behavioral TFS coding similar to that of the young normal-hearing group, contrasting with the aged group's deficits. Despite histological evidence of significant synaptic loss in the CS group, the study concludes that CS does not appear to affect TFS coding, either behaviorally or physiologically.  

      We agree with the reviewer’s summary.

      Strengths:  

      This study addresses a critical health concern, enhancing our understanding of mechanisms underlying age-related difficulties in speech intelligibility, even when audiometric thresholds are within normal limits. A major strength of this work is the comprehensive approach, integrating behavioral assessments, auditory nerve (AN) physiology, and histology within the same animal subjects. This approach enhances understanding of the mechanisms underlying the behavioral outcomes and provides confidence in the actual occurrence of synapse loss and its effects. The study carefully manages controlled conditions by including five distinct groups: young normal-hearing animals, aged animals, animals with CS induced through low and high doses, and a sham surgery group. This careful setup strengthens the study's reliability and allows for meaningful comparisons across conditions. Overall, the manuscript is well-structured, with clear and accessible writing that facilitates comprehension of complex concepts.

      Weaknesses:

      The stimulus and task employed in this study are very helpful for behavioral research, and using the same stimulus setup for physiology is advantageous for mechanistic comparisons. However, I have some concerns about the limitations in auditory nerve (AN) physiology. Due to practical constraints, it is not feasible to record from a large enough population of fibers that covers a full range of best frequencies (BFs) and spontaneous rates (SRs) within each animal. This raises questions about how representative the physiological data are for understanding the mechanism in behavioral data. I am curious about the authors' interpretation of how this stimulus setup might influence results compared to methods used by Kale and Heinz (2010), who adjusted harmonic frequencies based on the characteristic frequency (CF) of recorded units. While, the harmonic frequencies in this study are fixed across all CFs, meaning that many AN fibers may not be tuned closely to the stimulus frequencies. If units are not responsive to the stimulus further clarification on detecting mistuning and phase locking to TFS effects within this setup would be valuable. Since the harmonic frequencies in this study are fixed across all CFs, this means that many AN fibers may not be tuned closely to the stimulus frequencies, adding sampling variability to the results.

      We chose the stimuli for the AN recordings to be identical to the stimuli used in the behavioral evaluation of the perceptual sensitivity. Only with this approach can we directly compare the response of the population of AN fibers with perception measured in behavior.

      The stimuli are complex, i.e., comprise of many frequency components AND were presented at 68 dB SPL. Thus, the stimuli excite a given fiber within a large portion of the fiber’s receptive field. Furthermore, during recordings, we assured ourselves that fibers responded to the stimuli by audiovisual control. Otherwise it would have cost valuable recording time to record from a nonresponsive AN fiber.

      Given the limited number of units per condition-sometimes as few as three for certain conditions - I wonder if CF-dependent variability might impact the results of the AN data in this study and discussing this factor can help with better understanding the results. While the use of the same stimuli for both behavioral and physiological recordings is understandable, a discussion on how this choice affects interpretation would be beneficial. In addition a 60 dB stimulus could saturate high spontaneous rate (HSR) AN fibers, influencing neural coding and phase-locking to TFS. Potentially separating SR groups, could help address these issues and improve interpretive clarity.  

      A deeper discussion on the role of fiber spontaneous rate could also enhance the study. How might considering SR groups affect AN results related to TFS coding? While some statistical measures are included in the supplement, a more detailed discussion in the main text could help in interpretation.  We do not think that it will be necessary to conduct any statistical analysis in addition to that already reported in the supplement.  

      We considered moving some supplementary information back into the main manuscript but decided against it. Our single-unit sample was not sufficient, i.e. not all subpopulations of auditory-nerve fibers were sufficiently sampled for all animal treatment groups, to conclusively resolve every aspect that may be interesting to explore. The power of our approach lies in the direct linkage of several levels of investigation – cochlear synaptic morphology, single-unit representation and behavioral performance – and, in the main manuscript, we focus on the core question of synaptopathy and its relation to temporal fine structure perception. This is now spelled out clearly in lines 197 - 203 of the main manuscript.  

      Although Figure S2 indicates no change in median SR, the high-dose treatment group lacks LSR fibers, suggesting a different distribution based on SR for different animal groups, as seen in similar studies on other species. A histogram of these results would be informative, as LSR fiber loss with CS-whether induced by ouabain in gerbils or noise in other animals-is well documented (e.g., Furman et al., 2013).  

      Figure S2 was revised to avoid overlap of data points and show the distributions more clearly. Furthermore, the sample sizes for LSR and HSR fibers are now provided separately.

      Although ouabain effects on gerbils have been explored in previous studies, since these data already seems to be recorded for the animal in this study, a brief description of changes in auditory brainstem response (ABR) thresholds, wave 1 amplitudes, and tuning curves for animals with cochlear synaptopathy (CS) in this study would be beneficial. This would confirm that ouabain selectively affects synapses without impacting outer hair cells (OHCs). For aged animals, since ABR measurements were taken, comparing hearing differences between normal and aged groups could provide insights into the pathologies besides CS in aged animals. Additionally, examining subject variability in treatment effects on hearing and how this correlates with behavior and physiology would yield valuable insights. If limited space maybe a brief clarification or inclusion in supplementary could be good enough.  

      We thank the reviewer for this constructive suggestion. The requested data were added in a new section of the Results, entitled “Threshold sensitivity and frequency tuning were not affected by the synapse loss.” (lines 150 – 174). Our young-adult, ouabain-treated gerbils showed no significant elevations of CAP thresholds and their neural tuning was normal. Old gerbils showed the typical threshold losses for individuals of comparable age, and normal neural tuning, confirming previous reports. Thus, there was no evidence for relevant OHC impairments in any of our animal groups.   

      Another suggestion is to discuss the potential role of MOC efferent system and effect of anesthesia in reducing efferent effects in AN recordings. This is particularly relevant for aged animals, as CS might affect LSR fibers, potentially disrupting the medial olivocochlear (MOC) efferent pathway. Anesthesia could lessen MOC activity in both young and aged animals, potentially masking efferent effects that might be present in behavioral tasks. Young gerbils with functional efferent systems might perform better behaviorally, while aged gerbils with impaired MOC function due to CS might lack this advantage. A brief discussion on this aspect could potentially enhance mechanistic insights.  

      Thank you for this suggestion. The potential role of olivocochlear efferents is now discussed in lines 597 - 613.

      Lastly, although synapse counts did not differ between the low-dose treatment and NH I sham groups, separating these groups rather than combining them with the sham might reveal differences in behavior or AN results, particularly regarding the significance of differences between aged/treatment groups and the young normal-hearing group.  

      For maximizing statistical power, we combined those groups in the statistical analysis. These two groups did not differ in synapse number, threshold sensitivity or neural tuning bandwidths.

      Reviewer #2 (Public review):

      Summary:  

      Using a gerbil model, the authors tested the hypothesis that loss of synapses between sensory hair cells and auditory nerve fibers (which may occur due to noise exposure or aging) affects behavioral discrimination of the rapid temporal fluctuations of sounds. In contrast to previous suggestions in the literature, their results do not support this hypothesis; young animals treated with a compound that reduces the number of synapses did not show impaired discrimination compared to controls. Additionally, their results from older animals showing impaired discrimination suggest that agerelated changes aside from synaptopathy are responsible for the age-related decline in discrimination. 

      We agree with the reviewer’s summary.

      Strengths: 

      (1) The rationale and hypothesis are well-motivated and clearly presented. 

      (2) The study was well conducted with strong methodology for the most part, and good experimental control. The combination of physiological and behavioral techniques is powerful and informative. Reducing synapse counts fairly directly using ouabain is a cleaner design than using noise exposure or age (as in other studies), since these latter modifiers have additional effects on auditory function. 

      (3) The study may have a considerable impact on the field. The findings could have important implications for our understanding of cochlear synaptopathy, one of the most highly researched and potentially impactful developments in hearing science in the past fifteen years.  

      Weaknesses: 

      (1) My main concern is that the stimuli may not have been appropriate for assessing neural temporal coding behaviorally. Human studies using the same task employed a filter center frequency that was (at least) 11 times the fundamental frequency (Marmel et al., 2015; Moore and Sek, 2009). Moore and Sek wrote: "the default (recommended) value of the centre frequency is 11F0." Here, the center frequency was only 4 or 8 times the fundamental frequency (4F0 or 8F0). Hence, relative to harmonic frequency, the harmonic spacing was considerably greater in the present study. By my calculations, the masking noise used in the present study was also considerably lower in level relative to the harmonic complex than that used in the human studies. These factors may have allowed the animals to perform the task using cues based on the pattern of activity across the neural array (excitation pattern cues), rather than cues related to temporal neural coding. The authors show that mean neural driven rate did not change with frequency shift, but I don't understand the relevance of this. It is the change in response of individual fibers with characteristic frequencies near the lowest audible harmonic that is important here.  

      The auditory filter bandwidth of the gerbil is about double that of human subjects. Because of this, the masking noise has a larger overall level than in the human studies in the filter, prohibiting the use of distortion products. The larger auditory filter bandwidth precludes that the gerbils can use excitation patterns, especially in the condition with a center frequency of 1600 Hz and a fundamental of 200 Hz and in the condition with a center frequency of 3200 Hz and a fundamental of 400 Hz. In the condition with a center frequency of 1600 Hz and a fundamental of 400 Hz, it is possible that excitation patterns are exploited. We have now added  modeling of the excitation patterns, and a new figure showing their change at the gerbils’ perception threshold, in the discussion of the revised version (lines 440 - 446 and Fig. 8).

      The case against excitation pattern cues needs to be better made in the Discussion. It could be that gerbil frequency selectivity is broad enough for this not to be an issue, but more detail needs to be provided to make this argument. The authors should consider what is the lowest audible harmonic in each case for their stimuli, given the level of each harmonic and the level of the pink noise. Even for the 8F0 center frequency, the lowest audible harmonic may be as low as the 4th (possibly even the 3rd). In human, harmonics are thought to be resolvable by the cochlea up to at least the 8th.  

      This issue is now covered in the discussion, see response to the previous point.

      (2) The synapse reductions in the high ouabain and old groups were relatively small (mean of 19 synapses per hair cell compared to 23 in the young untreated group). In contrast, in some mouse models of the effects of noise exposure or age, a 50% reduction in synapses is observed, and in the human temporal bone study of Wu et al. (2021, https://doi.org/10.1523/JNEUROSCI.3238-20.2021) the age-related reduction in auditory nerve fibres was ~50% or greater for the highest age group across cochlear location. It could be simply that the synapse loss in the present study was too small to produce significant behavioral effects. Hence, although the authors provide evidence that in the gerbil model the age-related behavioral effects are not due to synaptopathy, this may not translate to other species (including human). This should be discussed in the manuscript. 

      We agree that our results apply to moderate synaptopathy, which predominantly characterizes early stages of hearing loss or aged individuals without confounding noise-induced cochlear damage. This is now discussed in lines 486 – 498.

      It would be informative to provide synapse counts separately for the animals who were tested behaviorally, to confirm that the pattern of loss across the group was the same as for the larger sample.  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      (3) The study was not pre-registered, and there was no a priori power calculation, so there is less confidence in replicability than could have been the case. Only three old animals were used in the behavioral study, which raises concerns about the reliability of comparisons involving this group.  

      The results for the three old subjects differed significantly from those of young subjects and young ouabain-treated subjects. This indicates a sufficient statistical power, since otherwise no significant differences would be observed.

      Reviewer #3 (Public review):

      This study is a part of the ongoing series of rigorous work from this group exploring neural coding deficits in the auditory nerve, and dissociating the effects of cochlear synaptopathy from other agerelated deficits. They have previously shown no evidence of phase-locking deficits in the remaining auditory nerve fibers in quiet-aged gerbils. Here, they study the effects of aging on the perception and neural coding of temporal fine structure cues in the same Mongolian gerbil model. 

      They measure TFS coding in the auditory nerve using the TFS1 task which uses a combination of harmonic and tone-shifted inharmonic tones which differ primarily in their TFS cues (and not the envelope). They then follow this up with a behavioral paradigm using the TFS1 task in these gerbils. They test young normal hearing gerbils, aged gerbils, and young gerbils with cochlear synaptopathy induced using the neurotoxin ouabain to mimic synapse losses seen with age. 

      In the behavioral paradigm, they find that aging is associated with decreased performance compared to the young gerbils, whereas young gerbils with similar levels of synapse loss do not show these deficits. When looking at the auditory nerve responses, they find no differences in neural coding of TFS cues across any of the groups. However, aged gerbils show an increase in the representation of periodicity envelope cues (around f0) compared to young gerbils or those with induced synapse loss. The authors hence conclude that synapse loss by itself doesn't seem to be important for distinguishing TFS cues, and rather the behavioral deficits with age are likely having to do with the misrepresented envelope cues instead.  

      We agree with the reviewer’s summary.

      The manuscript is well written, and the data presented are robust. Some of the points below will need to be considered while interpreting the results of the study, in its current form. These considerations are addressable if deemed necessary, with some additional analysis in future versions of the manuscript. 

      Spontaneous rates - Figure S2 shows no differences in median spontaneous rates across groups. But taking the median glosses over some of the nuances there. Ouabain (in the Bourien study) famously affects low spont rates first, and at a higher degree than median or high spont rates. It seems to be the case (qualitatively) in Figure S2 as well, with almost no units in the low spont region in the ouabain group, compared to the other groups. Looking at distributions within each spont rate category and comparing differences across the groups might reveal some of the underlying causes for these changes. Given that overall, the study reports that low-SR fibers had a higher ENV/TFS log-zratio, the distribution of these fibers across groups may reveal specific effects of TFS coding by group.  

      As the reviewer points out, our sample from the group treated with a high concentration of ouabain showed very few low-spontaneous-rate auditory-nerve fibers, as expected from previous work. However, this was also true, e.g., for our sample from sham-operated animals, and may thus well reflect a sampling bias. We are therefore reluctant to attach much significance to these data distributions. We now point out more clearly the limitations of our auditory-nerve sample for the exploration of  interesting questions beyond our core research aim (see also response to Reviewer 1 above).  

      Threshold shifts - It is unclear from the current version if the older gerbils have changes in hearing thresholds, and whether those changes may be affecting behavioral thresholds. The behavioral stimuli appear to have been presented at a fixed sound level for both young and aged gerbils, similar to the single unit recordings. Hence, age-related differences in behavior may have been due to changes in relative sensation level. Approaches such as using hearing thresholds as covariates in the analysis will help explore if older gerbils still show behavioral deficits.  

      Unfortunately, we did not obtain behavioral thresholds that could be used here. We want to point out that the TFS 1 stimuli had an overall level of 68 dB SPL, and the pink noise masker would have increased the threshold more than expected from the moderate, age-related hearing loss in quiet. Thus, the masked thresholds for all gerbil groups are likely similar and should have no effect on the behavioral results.

      Task learning in aged gerbils - It is unclear if the aged gerbils really learn the task well in two of the three TFS1 test conditions. The d' of 1 which is usually used as the criterion for learning was not reached in even the easiest condition for aged gerbils in all but one condition for the aged gerbils (Fig. 5H) and in that condition, there doesn't seem to be any age-related deficits in behavioral performance (Fig. 6B). Hence dissociating the inability to learn the task from the inability to perceive TFS 1 cues in those animals becomes challenging.  

      Even in the group of gerbils with the lowest sensitivity, for the condition 400/1600 the animals achieved a d’ of on average above 1. Furthermore, stimuli were well above threshold and audible, even when no discrimination could be observed. Finally, as explained in the methods, different stimulus conditions were interleaved in each session, providing stimuli that were easy to discriminate together with those being difficult to discriminate. This approach ensures that the gerbils were under stimulus control, meaning properly trained to perform the task. Thus, an inability to discriminate does not indicate a lack of proper training.  

      Increased representation of periodicity envelope in the AN - the mechanisms for increased representation of periodicity envelope cues is unclear. The authors point to some potential central mechanisms but given that these are recordings from the auditory nerve what central mechanisms these may be is unclear. If the authors are suggesting some form of efferent modulation only at the f0 frequency, no evidence for this is presented. It appears more likely that the enhancement may be due to outer hair cell dysfunction (widened tuning, distorted tonotopy). Given this increased envelope coding, the potential change in sensation level for the behavior (from the comment above), and no change in neural coding of TFS cues across any of the groups, a simpler interpretation may be -TFS coding is not affected in remaining auditory nerve fibers after age-related or ouabain induced synapse loss, but behavioral performance is affected by altered outer hair cell dysfunction with age. 

      A similar point was made by Reviewer #1. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.  

      Emerging evidence seems to suggest that cochlear synaptopathy and/or TFS encoding abilities might be reflected in listening effort rather than behavioral performance. Measuring some proxy of listening effort in these gerbils (like reaction time) to see if that has changed with synapse loss, especially in the young animals with induced synaptopathy, would make an interesting addition to explore perceptual deficits of TFS coding with synapse loss.  

      This is an interesting suggestion that we now explore in the revision of the manuscript. Reaction times can be used as a proxy for listening effort and were recorded for all responses. The the new analysis now reported in lines 378 - 396 compared young-adult control gerbils with young-adult gerbils that had been treated with the high concentration of ouabain. No differences in response latencies was found, indicating that listening effort did not change with synapse loss.  

      Reviewer #1 (Recommendations for the authors): 

      Figure 2: The y-axis labeled as "Frequency" is potentially misleading since there are additional frequency values on the right side of the panels. It would be helpful to clarify more in the caption what these right-side frequency values represent. Additionally, the legend could be positioned more effectively for clarity.

      Thank you for your suggestion. The axis label was rephrased.

      Figure 7: This figure is a bit unclear, as it appears to show two sets of gerbil data at 1500 Hz, yet the difference between them is not explained.  

      We added the following text to the figure legend: „The higher and lower thresholds shown for the gerbil data reflect thresholds at  fc of 1600 Hz for fundamentals f0 of 200 Hz and 400 Hz, respectively.“

      Maybe a short description of fmax that is used in Figure 4 could help or at least point to supplementary for finding the definition.  

      We thank the reviewer for pointing out this typo/inaccuracy. The correct terminology in line with the remainder of the manuscript is “fmaxpeak”. We corrected the caption of figure 5 (previously figure 4) and added the reference pointing to figure 11 (previously figure 9), which explains the terms.

      I couldn't find information about the possible availability of data. 

      The auditory-nerve recordings reported in this paper are part of a larger study of single-unit auditorynerve responses in gerbils, formally described and published by Heeringa (2024) Single-unit data for sensory neuroscience: Responses from the auditory nerve of young-adult and aging gerbils. Scientific Data 11:411, https://doi.org/10.1038/s41597-024-03259-3. As soon as the Version of Record will be submitted, the raw single-unit data can be accessed directly through the following link:  https://doi.org/10.5061/dryad.qv9s4mwn4. The data that are presented in the figures of the present manuscript and were statistically analyzed are uploaded to the Zenodo repository (https://doi.org/10.5281/zenodo.15546625).  

      Reviewer #2 (Recommendations for the authors): 

      L22. The term "hidden hearing loss" is used in many different ways in the literature, from being synonymous with cochlear synaptopathy, to being a description of any listening difficulties that are not accounted for by the audiogram (for which there are many other / older terms). The original usage was much more narrow than your definition here. It is not correct that Schaette and McAlpine defined HHL in the broad sense, as you imply. I suggest you avoid the term to prevent further confusion.  

      We eliminated the term hidden hearing loss.

      L43. SNHL is undefined.

      Thank you for catching that. The term is now spelled out.

      L64. "whether" -> "that"  

      We corrected this issue.

      L102. It would be informative to see the synapse counts (across groups) for the animals tested in the behavioral part of the study. Did these vary between groups in the same way?  

      Yes, the pattern was the same for the subgroup of behaviorally tested animals. We have added this information to the revised version of the manuscript (lines 137 – 141).

      L108. How many tests were considered in the Bonferroni correction? Did this cover all reported tests in the paper?  

      The comparisons of synapse numbers between treatment groups were done with full Bonferroni correction, as in the other tests involving posthoc pair-wise comparisons after an ANOVA.

      Figure 1 and 6 captions. Explain meaning of * and ** (criteria values).  

      The information was added to the figure legends of now Figs. 1 and 7. 

      L139. I don't follow the argument - the mean driven rate is not important. It is the rate at individual CFs and how that changes with frequency shift that provides the cue.

      L142. I don't follow - individual driven rates might have been a cue (some going up, some down, as frequency was shifted).  

      Yes, theoretically it is possible that the spectral pattern of driven rates (i.e., excitation pattern) can be specifically used for profile analysis and subsequently as a strong cue for discriminating the TFS1 stimuli. In order to shed some light on this question with regard to the actual stimuli used in this study, we added a comprehensive figure showing simulated excitation patterns (figure 8). The excitation patterns were generated with a gammatone filter bank and auditory filter bandwidths appropriate for gerbils (Kittel et al. 2002). The simulated excitation patterns allow to draw some at least semi-quantitative conclusions about the possibility of profile analysis: 1. In the 200/1600 Hz and 400/3200 Hz conditions (i.e., harmonic number of fc is 8), the difference between all inharmonic excitation patterns and the harmonic reference excitation pattern is far below the threshold for intensity discrimination (Sinnott et al. 1992). 2. In the same conditions, the statistics of the pink noise make excitation patterns differences at or beyond the filter slopes (on both high and low frequency limits) useless for frequency shift discrimination. 3. In the 400/1600 Hz condition (i.e., harmonic number of fc is 4), there is a non-negligible possibility that excitation pattern differences were a main cue for discrimination. All of these conclusions are compatible with the results of our study.

      L193. Is this p-value Bonferroni corrected across the whole study? If not, the finding could well be spurious given the number of tests reported.  

      Yes, it is Bonferroni corrected

      L330. TFS is already defined.  

      L346. AN is already defined.  

      L408. "temporal fine structure" -> "TFS"  

      It was a deliberate decision to define these terms again in the Discussion, for readers who prefer to skip most of the detailed Results. 

      L364-366. This argument is somewhat misleading. Cochlear resolvability largely depends on the harmonic spacing (i.e., F0) relative to harmonic frequency (in other words, on harmonic rank). Marmel et al. (2015) and Moore and Sek (2009) used a center frequency (at least) 11 times F0. Here, the center frequency was only 4 or 8 times F0. In human, this would not be sufficient to eliminate excitation pattern cues.  

      We have now included results from modeling the excitation patterns in the discussion with a new figure demonstrating that at a center frequency of 8 times F0, excitation patterns provide no useful cue while this is a possibility at  a center frequency of 4 times F0 (Fig. 8, lines 440 - 446).

      L541. Was that a spectrum level of 20 dB SPL (level per 1-Hz wide band) at 1 kHz? Need to clarify.  

      The power spectral density of the pink noise at 1 kHz (i.e., the level in a 1 Hz wide band centered at 1 kHz) was 13.3 dB SPL. The total level of the pink noise (including edge filters at 100 Hz and 11 kHz) was 50 dB SPL.

      L919. So was the correction applied across only the tests within each ANOVA? Don't you need to control the study-wise error rate (across all primary tests) to avoid spurious findings?  

      We added information about the family-wise error rate (line 1077 - 1078). Since the ANOVAs tested different specific research questions, we do not think that we need to control the study-wise error rate.

      Reviewer #3 (Recommendations for the authors): 

      There was no difference in TFS sensitivity in the AN fiber activity across all the groups. Potential deficits with age were only sound in the behavioral paradigm. Given that, it might make it clearer to specify that the deficits or lack thereof are in behavior, in multiple instances in the manuscript where it says synaptopathy showed no decline in TFS sensitivity (For example Line 342-344).  

      We carefully went through the entire text and clarified a couple more instances.

      L353 - this statement is a bit too strong. It implies causality when there is only a co-occurrence of increased f0 representation and age-related behavioral deficits in TFS1 task.  

      The statement was rephrased as “Thus, cue representation may be associated with the perceptual deficits, but not reduced synapse numbers, as originally proposed.”

      L465-467 - while this may be true, I think it is hard to say this with the current dataset where only AN fibers are being recorded from. I don't think we can say anything about afferent central mechanisms with this data set.  

      We agree. However, we refer here to published data on central inhibition to provide a possible explanation. 

      Hearing thresholds with ABRs are mentioned in the methods, but that data is not presented anywhere. Would be nice to see hearing thresholds across the various groups to account or discount outer hair cell dysfunction. 

      This important point was made repeatedly and we thank the Reviewers for it. As indicated above, new data on threshold sensitivity and neural tuning were added in a new section of the Results which indirectly suggest that significant OHC pathologies were not a concern, neither in our young-adult, synaptopathic gerbils nor in the old gerbils.

    1. Author response:

      The following is the authors’ response to the original reviews

      eLife Assessment

      his valuable study presents a theoretical model of how punctuated mutations influence multistep adaptation, supported by empirical evidence from some TCGA cancer cohorts. This solid model is noteworthy for cancer researchers as it points to the case for possible punctuated evolution rather than gradual genomic change. However, the parametrization and systematic evaluation of the theoretical framework in the context of tumor evolution remain incomplete, and alternative explanations for the empirical observations are still plausible.

      We thank the editor and the reviewers for their thorough engagement with our work. The reviewers’ comments have drawn our attention to several important points that we have addressed in the updated version. We believe that these modifications have substantially improved our paper.

      There were two major themes in the reviewers’ suggestions for improvement. The first was that we should demonstrate more concretely how the results in the theoretical/stylized modelling parts of our paper quantitatively relate to dynamics in cancer.

      To this end, we have now included a comprehensive quantification of the effect sizes of our results across large and biologically-relevant parameter ranges. Specifically, following reviewer 1’s suggestion to give more prominence to the branching process, we have added two figures (Fig S3-S4) quantifying the likelihood of multi-step adaptation in a branching process for a large range of mutation rates and birth-death ratios. Formulating our results in terms of birth-death ratios also allowed us to provide better intuition regarding how our results manifest in models with constant population size vs models of growing populations. In particular, the added figure (Fig S3) highlights that the effect size of temporal clustering on the probability of successful 2-step adaptation is very sensitive to the probability that the lineage of the first mutant would go extinct if it did not acquire a second mutation. As a result, the phenomenon we describe is biologically likely to be most effective in those phases during tumor evolution in which tumor growth is constrained. This important pattern had not been described sufficiently clearly in the initial version of our manuscript, and we thank both reviewers for their suggestions to make these improvements.

      The second major theme in the reviewers’ suggestions was focused on how we relate our theoretical findings to readouts in genomic data, with both reviewers pointing to potential alternative explanations for the empirical patterns we describe.

      We have now extended our empirical analyses following some of the reviewers’ suggestions. Specifically, we have included analyses investigating how the contribution of reactive oxygen species (ROS)-related mutation signatures correlates with our proxies for multi-step adaptation; and we have included robustness checks in which we use Spearman instead of Pearson correlations. Moreover, we have included more discussion on potential confounds and the assumptions going into our empirical analyses as well as the challenges in empirically identifying the phenomena we describe.

      Below, we respond in detail to the individual comments made by each reviewer.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Grasper et al. present a combined analysis of the role of temporal mutagenesis in cancer, which includes both theoretical investigation and empirical analysis of point mutations in TCGA cancer patient cohorts. They find that temporally elevated mutation rates contribute to cancer fitness by allowing fast adaptation when the fitness drops (due to previous deleterious mutations). This may be relevant in the case of tumor suppressor genes (TSG), which follow the 2-hit hypothesis (i.e., biallelic 2 mutations are necessary to deactivate TS), and in cases where temporal mutagenesis occurs (e.g., high APOBEC, ROS). They provide evidence that this scenario is likely to occur in patients with some cancer types. This is an interesting and potentially important result that merits the attention of the target audience. Nonetheless, I have some questions (detailed below) regarding the design of the study, the tools and parametrization of the theoretical analysis, and the empirical analysis, which I think, if addressed, would make the paper more solid and the conclusion more substantiated.

      Strengths:

      Combined theoretical investigation with empirical analysis of cancer patients.

      Weaknesses:

      Parametrization and systematic investigation of theoretical tools and their relevance to tumor evolution.

      We sincerely thank Reviewer 1 for their comments. As communicated in more detail in the point-by-point replies to the “Recommendations for the authors”, we have revised the paper to address these comments in various ways. To summarize, Reviewer 1 asked for (1) more comprehensive analyses of the parameter space, especially in ranges of small fitness effects and low mutation rates; (2) additional clarifications on details of mechanisms described in the manuscript; and (3) suggested further robustness checks to our empirical analyses. We have addressed these points as follows: we have added detailed analyses of dynamics and effect sizes for branching processes (see Sections SI2 and SI3 in the Supplementary Information, as well as Figures S3 and S4). As suggested, these additions provide characterizations of effect sizes in biologically relevant parameter ranges (low mutation rates and smaller fitness effect sizes), and extend our descriptions to processes with dynamically changing population sizes. Moreover, we have added further clarifications at suggested points in the manuscript, e.g. to elaborate on the non-monotonicities in Fig 3. Lastly, we have undertaken robustness checks using Spearman rather than Pearson correlation coefficients to quantify relations between TSG deactivation and APOBEC signature contribution, and have performed analyses investigating dynamics of reactive oxygen species-associated mutagenesis instead of APOBEC.

      Reviewer #2 (Public review):

      This work presents theoretical results concerning the effect of punctuated mutation on multistep adaptation and empirical evidence for that effect in cancer. The empirical results seem to agree with the theoretical predictions. However, it is not clear how strong the effect should be on theoretical grounds, and there are other plausible explanations for the empirical observations.

      Thank you very much for these comments. We have now substantially expanded our investigations of the parameter space as outlined in the response to the “eLife Assessment” above and in the detailed comments below (A(1)-A(3)) to convey more quantitative intuition for the magnitude of the effects we describe for different phases of tumor evolution. We agree that there could be potential additional confounders to our empirical investigations besides the challenges regarding quantification that we already described in our initial version of the manuscript. We have thus included further discussion of these in our manuscript (see replies to B(1)-B(3)), and we have expanded our empirical analyses as outlined in the response to the “eLife Assessment”.

      For various reasons, the effect of punctuated mutation may be weaker than suggested by the theoretical and empirical analyses:

      (A1) The effect of punctuated mutation is much stronger when the first mutation of a two-step adaptation is deleterious (Figure 2). For double inactivation of a TSG, the first mutation--inactivation of one copy--would be expected to be neutral or slightly advantageous. The simulations depicted in Figure 4, which are supposed to demonstrate the expected effect for TSGs, assume that the first mutation is quite deleterious. This assumption seems inappropriate for TSGs, and perhaps the other synergistic pairs considered, and exaggerates the expected effects.

      Thank you for highlighting this discrepancy between Figure 2 and Figure 4. For computational efficiency and for illustration purposes, we had opted for high mutation rates and large fitness effects in Figure 2; however, our results are valid even in the setting of lower mutation rates and fitness effects. To improve the connection to Figure 4, and to address other related comments regarding parameter dependencies, we have now added more detailed quantification of the effects we describe (Figures SF3 and SF4) to the revised manuscript. These additions show that the effects illustrated in Figure 2 retain large effect sizes when going to much lower mutation rates and much smaller fitness effects. Indeed, while under high mutation rates we only see the large relative effects if the first mutation is highly deleterious, these large effects become more universal when going to low mutation rates.

      In general, it is correct that the selective disadvantage (or advantage) conveyed by the first mutation affects the likelihood of successful 2-step adaptations. It is also correct that the magnitude of the ‘relative effect’ of temporal clustering on valley-crossing is highest if the lineage with only the first of the two mutations is vanishingly unlikely to produce a second mutant before going extinct. If the first mutation is strongly deleterious, the lineage of such a first mutant is likely to quickly go extinct – and therefore also more likely to do so before producing a second mutant.

      However, this likelihood of producing the second mutant is also low if the mutation rate is low. As our added figure (Figure SF3) illustrates, at low mutation rates appropriate for cancer cells, is insensitive to the magnitude of the fitness disadvantage for large parts of the parameter space. Especially in populations of constant size (approximated by a birth/death ratio of 1), the relative effects for first mutations that reduce the birth rate by 0.5 or by 0.05 are indistinguishable (Figure SF3f).

      Moreover, the absolute effect , as we discuss in the paper (Figures SF2 and SF3) is largest in regions of the parameter space in which the first mutant is not infinitesimally unlikely to produce a second mutant (and 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> would be infinitesimally small), but rather in parameter regions in which this first mutant has a non-negligible chance to produce a second mutant. The absolute effect therefore peaks around fitness-neutral first mutations. While the next comment (below) says that our empirical investigations more closely resemble comparisons of relative effects and not absolute effects, we would expect that the observations in our data come preferentially from multi-step adaptations with large absolute effect since the absolute effect is maximal when both 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub>are relatively high.

      In summary, we believe Figure 2, while having exaggerated parameters for very defendable reasons, is not a misleading illustration of the general phenomenon or of its applicability in biological settings, as effect sizes remain large when moving to biologically realistic parameter ranges. To clarify this issue, we have largely rewritten the relevant paragraphs in the results section and have added two additional figures (Figures SF3 and SF4) as well as a section in the SI with detailed discussion (SI2).

      (A2) More generally, parameter values affect the magnitude of the effect. The authors note, for example, that the relative effect decreases with mutation rate. They suggest that the absolute effect, which increases, is more important, but the relative effect seems more relevant and is what is assessed empirically.

      Thank you for this comment. As noted in the replies to the above comments, we have now included extensive investigations of how sensitive effect sizes are to different parameter choices. We also apologize for insufficiently clearly communicating how the quantities in Figure 4 relate to the findings of our theoretical models.

      The challenge in relating our results to single-timepoint sequencing data is that we only observe the mutations that a tumor has acquired, but we do not directly observe the mutation rate histories that brought about these mutations. As an alternative readout, we therefore consider (through rough proxies: TSGs and APOBEC signatures) the amount of 2-step adaptations per acquired/retained mutation. While we unfortunately cannot control for the average mutation rate in a sample, we motivate using this “TSG-deactivation score” by the hypothesis that for any given mutation rate, we expect a positive relationship between the amount of temporal clustering and the amount of 2-step adaptations per acquired/retained mutation. This hypothesis follows directly from our theoretical model where it formally translates to the statement that for a fixed , is increasing in .

      However, while both quantities 𝑓<sub>𝑘</sub>/𝑓<sub>1</sub>  or from our theoretical model relate to this hypothesis – both are increasing in 𝑘–, neither of them maps directly onto the formulation of our empirical hypothesis.

      We have now rewritten the relevant passages of the manuscript to more clearly convey our motivation for constructing our TSG deactivation score in this form (P. 4-6).

      (A3) Routes to inactivation of both copies of a TSG that are not accelerated by punctuation will dilute any effects of punctuation. An example is a single somatic mutation followed by loss of heterozygosity. Such mechanisms are not included in the theoretical analysis nor assessed empirically. If, for example, 90% of double inactivations were the result of such mechanisms with a constant mutation rate, a factor of two effect of punctuated mutagenesis would increase the overall rate by only 10%. Consideration of the rate of apparent inactivation of just one TSG copy and of deletion of both copies would shed some light on the importance of this consideration.

      This is a very good point, thank you. In our empirical analyses, the main motivation was to investigate whether we would observe patterns that are qualitatively consistent with our theoretical predictions, i.e. whether we would find positive associations between valley-crossing and temporal clustering. Our aim in the empirical analyses was not to provide a quantitative estimate of how strongly temporally clustered mutation processes affect mutation accumulation in human cancers. We hence restricted attention to only one mutation process which is well characterized to be temporally clustered (APOBEC mutagenesis) and to only one category of (epi)genomic changes (SNPs, in which APOBEC signatures are well characterized). Of course, such an analysis ignores that other mutation processes (e.g. LOH, copy number changes, methylation in promoter regions, etc.) may interact with the mechanisms that we consider in deactivating Tumor suppressor genes.

      We have now updated the text to include further discussion of this limitation and further elaboration to convey that our empirical analyses are not intended as a complete quantification of the effect of temporal clustering on mutagenesis in-vivo (P. 10,11).

      Several factors besides the effects of punctuated mutation might explain or contribute to the empirical observations:

      (B1) High APOBEC3 activity can select for inactivation of TSGs (references in Butler and Banday 2023, PMID 36978147). This selective force is another plausible explanation for the empirical observations.

      Thank you for making this point. We agree that increased APOBEC3 activity, or any other similar perturbation, can change the fitness effect that any further changes/perturbations to the cell would bring about. Our empirical analyses therefore rely on the assumption that there are no major confounding structural differences in selection pressures between tumors with different levels of APOBEC signature contributions. We have expanded our discussion section to elaborate on this potential limitation (P. 10-11).

      While the hypothesis that APOBEC3 activity selects for inactivation of TSGSs has been suggested, there remain other explanations. Either way, the ways in which selective pressures have been suggested to change would not interfere relevantly with the effects we describe. The paper cited in the comment argues that “high APOBEC3 activity may generate a selective pressure favoring” TSG mutations as “APOBEC creates a high [mutation] burden, so cells with impaired DNA damage response (DDR) due to tumor suppressor mutations are more likely to avert apoptosis and continue proliferating”. To motivate this reasoning, in the same passage, the authors cite a high prevalence of TP53 mutations across several cancer types with “high burden of APOBEC3-induced mutations”, but also note that “this trend could arise from higher APOBEC3 expression in p53-mutated tumors since p53 may suppress APOBEC3B transcription via p21 and DREAM proteins”.

      Translated to our theoretical framework, this reasoning builds on the idea that APOBEC3 activity increases the selective advantage of mutants with inactivation of both copies of a TSG. In contrast, the mechanism we describe acts by altering the chances of mutants with only one TSG allele inactivated to inactivate the second allele before going extinct. If homozygous inactivation of TSGs generally conveys relatively strong fitness advantages, lineages with homozygous inactivation would already be unlikely to go extinct. Further increasing the fitness advantage of such lineages would thus manifest mostly in a quicker spread of these lineages, rather than in changes in the chance that these lineages survive. In turn, such a change would have limited effect on the “rate” at which such 2-step adaptations occur, but would mostly affect the speed at which they fixate. It would be interesting to investigate these effects empirically by quantifying the speed of proliferation and chance of going extinct for lineages that newly acquired inactivating mutations in TSGs.

      Beyond this explicit mention of selection pressures, the cited paper also discusses high occurrences of mutations in TSGs in relation to APOBEC. These enrichments, however, are not uniquely explained by an APOBEC-driven change in selection pressures. Indeed, our analyses would also predict such enrichments.

      (B2) Without punctuation, the rate of multistep adaptation is expected to rise more than linearly with mutation rate. Thus, if APOBEC signatures are correlated with a high mutation rate due to the action of APOBEC, this alone could explain the correlation with TSG inactivation.

      Thank you for making this point. Indeed, an identifying assumption that we make is that average mutation rates are balanced between samples with a higher vs lower APOBEC signature contribution. We cannot cleanly test this assumption, as we only observe aggregate mutation counts but not mutation rates. However, the fact that we observe an enrichment for APOBEC-associated mutations among the set of TSG-inactivating mutations (see Figure 4F) would be consistent with APOBEC-mutations driving the correlations in Fig 4D, rather than just average mutation rates. We have now added a paragraph to our manuscript to discuss these points (P. 10-11).

      (B3) The nature of mutations caused by APOBEC might explain the results. Notably, one of the two APOBEC mutation signatures, SBS13, is particularly likely to produce nonsense mutations. The authors count both nonsense and missense mutations, but nonsense mutations are more likely to inactivate the gene, and hence to be selected.

      Thank you for making this point.  We have included it in our discussion of potential confounders/limitations in the revised manuscript (P. 10-11).  

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      Specific questions/comments/suggestions:

      (1) For the theoretical investigation, the authors use the Wright-Fisher model with specific parameters for the decrease/increase in the fitness (0.5,1.5). This model is not so relevant to cancer, because it assumes a constant population size, while in cancer, the population is dynamic (increasing, if the tumor grows). Although I see they mention relevance to the branching process (in SI), I think the branching process should be bold in the main text and the Wright-Fisher in SI (or even dropped).

      Thank you for this comment. We agree that too little attention had been given to the branching process in the original version of our manuscript. While the Wright-Fisher process is computationally efficient to simulate and thus lends itself to clean simulations for illustrative examples, it did lead us to put undue emphasis on populations of constant size.

      The added Figures SF2 and SF3 now focus on branching processes, and we have substantially expanded our discussion of how dynamics differ as a function of the population-size trajectory (constant vs growing; SI2, P. 4,9,10). Generally, we do believe that it is appropriate to consider both regimes. If tumors evolve from being confined within their site of origin to progressively invading adjacent tissues and organ compartments, they traverse different regions of the birth-death ratio parameter space. Moreover, the timing of transitions between phases of more or less constrained growth is likely closely tied to adaptation dynamics, since breaching barriers to expansion requires adapting to novel environments and selection pressures.

      We hope that the revised version of the manuscript conveys these points more clearly, and thank you for alerting us to this imbalance in the original version of our manuscript.

      (2) The parameters 0.5 (decrease in fitness) and 1.5 (increase in fitness) seem exaggerated (the typical values for the selective advantage are usually much lower (by an order of magnitude). The same goes for the mutation rate. The authors chose values of the order 0.001, while in cancer (and generally) it is much lower than that (10-5 - 10-6). I think that generally, the authors should present a more systematic analysis of the sensitivity of the results to these parameters.

      Thank you very much for this very important comment. We have made this a major focus in our revisions (see our reply to the editor’s comments). As suggested, we have now added further analyses to explore more biologically relevant parameter regimes. Reviewer 2 has made a similar remark, and to avoid redundancies, we point for a more detailed response to our response to that comment (A1).

      (3) In Figure 3, the authors explore the sensitivity to mu (mutation rate) and k (temporal clustering) and find a non-monotonic behavior (Figure 3C). However, this behavior is not well explained. I think some more explanations are required here.

      Thank you for pointing this out. We had initially relegated the more detailed explanations to the SI2 (which in the revised manuscript became SI4), but are happy to provide more elaboration in the main text, and have done so now (P. 5).

      For , the non-monotonicity reflects the exploration-exploitation tradeoff that this section is dedicated to very small  values (little exploration) prevent the population from finding fitness peaks. In contrast, once a fitness peak is reached, excessively large  values (little exploitation) scatter the population away from this peak to points of lower fitness.

      For , the most relevant dynamic is that at high , the population becomes unable to find close-by fitness improvements (1-step adaptations) if it is not in a burst. As 𝑘 increases, this delay in adaptation (until a burst occurs) eventually comes to outweigh the benefits of high 𝑘 (better ability to undergo multi-step adaptations). Additionally, if 𝑘 ∙ μ becomes very large, clonal interference eventually leads to diminishing exploration-returns when 𝑘 is increased further (Fig 5C), as the per-cell likelihood of finding a specific fitness peak eventually saturates and increasing  only causes multiple cells to find the same peak, rather than one cell finding this peak and its lineage fixating in the population.

      (4) In Figure 5, where the authors show the accumulation of the first (red; deleterious mutation) and second (blue; advantageous mutation), it seems that the fraction of deleterious mutations is much lower than that of advantageous mutations. This is opposite to the case of cancer, where most of the mutations are 'passengers', (slightly) deleterious or neutral mutations. Can the author explain this discrepancy and generally the relation of their parametrization to deleterious vs. advantageous mutations?

      Thank you for this comment. In general, we have focused attention in our paper on sequences of mutations that bring about a fitness increase. We call those sequences ‘adaptations’ and categorize these as one-step or multi-step, depending on whether or not they contain intermediates states with a fitness disadvantage.

      In our modelling, we do not consider mutations that are simply deleterious and are not a necessary part of a multi-step adaptation sequence. The motivation for this abstraction is, firstly, to focus on adaptation dynamics, and secondly, that in certain limits (small mu and large constant population sizes), lineages with only deleterious mutations have a probability close to one of going extinct, so that any emerging deleterious mutant would likely be 'washed out’ of the population before a new mutation emerges.

      However, whether the dynamics of how neutral or deleterious passenger mutations are acquired also vary relevantly with the extent of temporal clustering is a valid and interesting question that would warrant its own study. The types of theoretical arguments for such an investigation would be very similar to the ones we use in our paper.

      (5) The theoretical investigation assumes a multi/2-step adaptation scenario where the first mutation is deleterious and the second is advantageous. I think this should be generalized and further explored. For example, what happens when there are multiple mutations that are slightly deleterious (as probably is the case in cancer) and only much later mutations confer a selective advantage? How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?

      This is also an important point and relates in part to the previous comment (4).  For discussion of interactions with deleterious mutations, please see the reply to comment (4).  

      Regarding generalizations of this valley-crossing scenario, note that any sequence of mutations that increases fitness can be decomposed into sequences of either one-step or multi-step adaptations, as defined  in the paper. Therefore, if all intermediate states before the final selectively advantageous state have a selective disadvantage making the lineages of such cells likely to go extinct, then our derivations in S1 apply, and the relative effect of temporal clustering becomes where n is the number of intermediate states. If, conversely, any of the intermediate states already had a selective advantage, then our model would consider the subsequence until this first mutation with a selective advantage as its individual (one-step or multi-step) “adaptation”.

      The second question, “How stable is the "valley crossing" if more deleterious mutations occur after the 2 steps?”, touches on a different property of the population dynamics, namely on how the fate of a mutant lineage depends on how this lineage emerged. In our paper, we compare different levels of temporal clustering for a fixed average mutation rate. This choice implies that, if we assume that the mutant that emerges from a valley-crossing does not go extinct, then the number of deleterious mutations expected to occur in this lineage, once emerged, will not depend on the extent of temporal clustering. However, if in-burst mutation rates increased the expected burden of early acquired deleterious mutations sufficiently much to affect the probability that the lineage with a multi-step adaptation goes extinct before the burst ends, then there may indeed be an interaction between effects of deleterious passengers and temporal clustering. We would, however, expect effects on this probability of early extinction to be relatively minor, since such a lineage with a selective advantage would quickly grow to large cell-numbers implying that it would require a large number of co-occurring and sufficiently deleterious mutations across these cells for the lineage to go extinct.

      (6) For the empirical analysis of TCGA cohorts, the authors focus on the contribution of APOBEC mutations (via signature analysis) to temporal mutagenesis. They find only a few cancer types (Figure 4D) that follow their prediction (in Figure 4C) of a correlation between TSG deactivation and temporal mutations in bursts. I think two main points should be addressed:

      Thank you for this comment. We will respond in detail to the corresponding points below, but would like to note here that while we find this correlation “in only a few cancer types”, we also show that only few cancer types have relevant proportions of mutations caused by APOBEC, and it is precisely in these cancer types that we find a correlation.  We have clarified this aspect in the revised version of the manuscript (P.7).

      (i) APOBEC is not the only cause for temporal mutagenesis. For example, elevated ROS and hypoxia are also potential contributors - it might therefore be important to extend the signature analysis (to include more possible sources for temporal mutagenesis). Potentially, such an extension may show that more cancer types follow the author's prediction.

      Thank you for this interesting suggestion. We have now included analogous analyses for contributions of signature SBS18 which is associated with ROS mutagenesis, and for the joint contribution of signatures SBS17a, SBS17b, SBS18 and SBS36, which all have been shown (some in a more context-dependent manner) to be associated with ROS mutagenesis. When doing so, we do not find a clear trend. However, we also do not find these signatures to account for substantial proportions of the acquired mutations, meaning that ROS mutagenesis likely also does not account for much of the variation in how temporally clustered the mutation rate trajectories of different tumors are. We have incorporated these results and their discussion in the manuscript (SI5 and Fig S8).

      (ii) The TSG deactivation score used by the authors only counts the number of mutations and does not consider if the 2 mutations are biallelic, which is highly important in this case. There are ways to investigate the specific allele of mutations in TCGA data (for example, see Ciani et al. Cell Sys 2022 PMID: 34731645). Given the focus on TSG of this study, I think it is important to account for this in the analysis.

      Thank you for making this point. We did initially consider inferring allele-specific mutation status, but decided against it as this would have shrunk our dataset substantially, thus potentially introducing unwanted biases. Determining whether two mutations lie on the same or on different alleles requires either (1) observing sequencing reads that either cover the loci of both mutations, or (2) tracing whether (sets of) other SNPs on the same gene co-occur exclusively with one of the two considered mutations. These requirements lead to a substantial filtering of the observed mutations. Moreover, this filtering would be especially strong for tumors with a small overall mutation burden, as these would have fewer co-occurring SNPs to leverage in this inference. We would have hence preferentially filtered out TSG-deactivating mutations in tumors with low mutation burden. We have modified the text to address this point (P.14).

      (7) To continue point 4. I wonder why some known cancer types with high APOBEC signatures (e.g., lung, mentioned in the introduction) do not appear in the results of Figure 4. Can the author explain why it is missed?

      We do provide complete results for all categories in Supplementary Figure 3. To not overwhelm the figure in the main text, we only show the four categories with the highest average APOBEC signature contribution, beyond those four, average APOBEC signature contributions quickly drop. Lung-related categories do not feature in these top four (Lung squamous cell carcinoma are fifth and Lung adenocarcinoma are eighth in this ordering).

      Minors:

      (1) It is worth mentioning the relevance to resistance to treatment (see https://www.nature.com/articles/s41588-025-02187-1).

      Thank you for this suggestion. We have included a mention of the relation to this paper in the discussion section (P. 11).

      (2) Some of the figures' resolution should be improved - specifically, Figures 4, S1, and S5, which are not clear/readable.

      Thank you for pointing this out. This was the result of conversion to a word document. We will provide tif files in the revisions to have better resolution.

      (3) Regarding Figure 3e,f. How come that moving from K=1 to K=I doesn't show any changes in fitness - it looks as if in both cases the value fluctuates around comparable mean fitness? Is that the case?

      While fitness differences between simulations with different k manifest robustly over long time-horizons (see Fig 3C with results over  generations), there are various sources of substantial stochasticity that make the fitness values in these short-term plots (Fig3D-F) imperfect illustrations of how long-term average fitness behaves. For instance, fitness landscapes are drawn randomly which introduces variability in how high and how close-by different fitness peaks are. Similarly, there is substantial randomness since both the type (direction on the 2-D fitness landscape) and the timing of mutation are stochastic.

      The short-term plots in Fig3D-F are intended to showcase representative dynamics of transitions between points on the genotype space with different fitness values following a redrawing of the landscape – but not necessarily to provide a comparison between the height of the attained (local) fitness-maxima.  

      (4) Figures 4c,d - correlation should be Spearman, not Pearson (it's not a linear relationship).

      Thank you for this comment. As a robustness check, we have generated the same figures using Spearman and not Pearson correlations and find results that are qualitatively consistent with the initially shown results. Indeed, using Spearman correlations, all four cancer types from Fig 4D have significant correlations.

      (5) Typo for E) "...in samples of the cancer types in (C) were caused by APOBEC" - it should be D (not C) I guess.

      Thank you for catching this. We fixed the typo.

      (6) Figure 5 - the mutation rate is too high (0.001), sensitivity to that? Also the fitness change is exaggerated (0.5, 1.5), and the division of mutations to 100 and 100 (200 in total) loci is not clear.

      Thank you for making this point. In this simulation setting it is unfortunately computationally prohibitively expensive to perform simulations at biologically realistic mutation rates. Therefore, we have scaled up the mutation rate while scaling down the population size. Moreover, the choice of model here is not meant to resemble a biologically realistic dynamic, but rather to create a stylized setting to be able to consider the interplay between clonal interference and facilitated valley-crossing in isolation. The key result from this figure is the separation of time scales at which low or high temporal clustering maximizes adaptability.

      However, known parameter dependencies in these models allow us to reason about how tuning individual parameters of this stylized model would affect the relative importance of effects of clonal interference. This relative importance is largest when mutants are likely to co-occur on different competing clones in a population. The likelihood of such co-occurrences decreases substantially if decreasing the mutation rate to biologically realistic values. However, this likelihood also sensitively depends on the time that it takes a clone with a one-step adaptation to spread through the population. Smaller fitness advantages, as well as larger population sizes, slow down this process of taking over the population, which increases the likelihood of clonal interference. We now discuss these points in our revised manuscript (P. 8).

      7) In the results text (last section) "Performing simulations for 2-step adaptations, we found that fixation rates are non-monotone in k. While at low k increasing k leads to a steep increase in the fixation rate, this trend eventually levels off and becomes negative, with further increases in k leading to a decrease in the fixation rate". Where are the results of this? It should be bold and apparent.

      Thank you for alerting us that this is unclear. The relevant figure reference is indeed Fig 5C as in the preceding passage in the manuscript. However, we noticed that due to the presence of the steadily decreasing black line for 1-step adaptations, it is not easy to see that also the blue line is downward sloping. We have added a further reference to Fig 5C, and have adapted the grid spacing in the background of that figure-panel to make this trend more easily visible.

      (8) Although not inconceivable, conclusions regarding resistance in the discussion are overstated. If you want to make this statement, you need to show that in resistant tumors, the temporal mutagenesis is responsible for progression vs. non-resistant/sensitive cases (is that the case), otherwise this should be toned down.

      Thank you for pointing this out. We have tempered these conclusions in the revised version of the manuscript (P. 11).

      Reviewer #2 (Recommendations for the authors):

      (1) It might be useful to look specifically at X-linked TSGs. On the authors' interpretation, their relative inactivation rates should not be correlated with APOBEC signatures in males (but should be in females), though the size of the dataset may preclude any definite conclusions.

      Thank you for this suggestion. Indeed, the size of the dataset unfortunately makes such analyses infeasible. Moreover, it is not clear whether X-linked TSGs might have structurally different fitness dynamics than TSGs on other chromosomes. However, this is an interesting suggestion worth following up on as more synergistic pairs confined to the X-chromosome are getting identified.

      (2) Might there be value in distinguishing tumors that carry mutations expected to increase APOBEC expression from those that do not? Among several reasons, an APOBEC signature due to such a mutation and an APOBEC signature due to abortive viral infection may differ with respect to the degree of punctuation.

      This is also an interesting suggestion for future investigations, but for which we unfortunately do not have sufficient information to build a meaningful analysis. In particular, it is unclear to what extent the degree and manifestation of episodicity/punctuation varies between these different mechanisms. Burst duration and intensity, as well as out-of-burst baseline rates of APOBEC mutagenesis likely differ in ways that are yet insufficiently characterized, which would make any result of analyses like these in Fig 4 hard to interpret.

      (3) Also, in that paragraph, is "proportional to" used loosely to mean "an increasing function of"?

      Thank you for this comment. We are not quite sure which paragraph is meant, but we use the term “proportional” in a literal sense at every point it is mentioned in the paper.

      For the occurrences of the term on pages 3, 10 and 11, the word is used in reference to probabilities of reproduction (division in the branching process, or ‘being drawn to populate a spot in the next generation’ in the WF process) being “proportional” to fitness. These probabilities are constructed by dividing each individual cell’s fitness by the total fitness summed across all cells in the population. As the population acquires fitness-enhancing mutations, the resulting proportionality constant (1/total_fitness) changes, so that the mapping from ‘fitness’ to probability of reproduction in the next reproduction event changes over time. Nevertheless, this mapping always remains fitness-proportional.

      On page 4, the term is used as follows: “the absolute rates 𝑓<sub>𝑘</sub> and 𝑓<sub>1</sub> are proportional to µ<sup>n+1”</sup>. Here, proportionality in the literal sense follows from the equations on page 20, when setting , so that the second factor becomes µ<sup>n+1</sup>.  We have included a clarifying sentence to address this in the derivations (SI1).

      (4) It could be mentioned in the main text that the time between bursts (d) must not be too short in order for the effect to be substantial. I would think that the relevant timescale depends on how deleterious the initial mutation is.

      Thank you for making this interesting and very relevant point. We have included a section (SI3) and Figure (Fig S4) in the supplement to investigate the dependence on d. In short, we find that effects are weaker for small inter-burst intervals. The sensitivity to the burst size is highest for inter-burst intervals that are sufficiently small so that the lineage of the first mutant has relevant probability of surviving long enough to experience multiple burst phases.

      (5) Why not report that relative rate for Figure 2E as for 2D, as the former would seem to be more relevant to TSGs? And why was it assumed that the first inactivation is deleterious in the simulations in Figure 4 if the goal is to model TSGs?

      Thank you for noting this. For how we revised the paper to better connect Figures 2 and 4, please see our comment (A1) above. In general, neither 2E nor 2D should serve as quantitative predictions for what effect size we should expect in real world data, but are rather curated illustrations of the general phenomenon that we describe: we chose high mutation rates and exaggerated fitness effects so that dynamics become visually tractable in small simulation examples.

      For figure 4, assuming that the first inactivation is deleterious achieves that the branching process for the mutant lineage becomes subcritical, which keeps the simulation example simple and illustrative. For more comprehensive motivation of the approach in 4D, and especially the discussion of how fitness effects of different magnitudes may or may not be subject to the effects we describe depending on whether the population is in a phase of constant or growing population size, we refer the reader to our added section SI2, and the added discussion on pages 6 and 10.

      (6) Figure 2, D and E. I'm not sure why heatmaps with height one were provided rather than simple plots over time. It is difficult, for example, to determine from a heatmap whether the increase is linear or the relative rates with and without punctuation.

      Thank you for this comment. These are not heatmaps with height one, but rather for every column of pixels, different segments of that column correspond to different clones within that population. This approach is intended to convey the difference in dynamics between the results in Fig 2 and the analogous results for a branching process in Fig S1. In Fig 2, valley-crossings happen sequentially, with subsequent fixations of adapted mutants. In Fig S1, with a growing population size, multiple clones with different numbers of adaptations coexist. We have now adapted the caption of Fig 2 to clarify this point.

      (7) Page 3: "High mutation rates are known to limit the rate of 1-step adaptations due to clonal interference." This is a bit misleading, as it makes it sound like increasing the mutation rate decreases the rate of one-step adaptations.

      Thank you for alerting us to this poor phrasing. We have changed it in the revised version of the manuscript (P. 3).

      (8) Page 4: "proportional to \mu^{n+1}" Is "proportional" being used loosely for "an increasing function of"?

      It is meant in the literal mathematical sense (see response to comment (3))

      (9) Page 5, near bottom: "at least two mutations across the population". In the same genome?

      We counted mutations irrespective of whether they emerged in the same genome, to remain analogous to the TCGA analyses for which we also do not have single cell-resolved information.

      (10) Page 6: "missense or nonsense mutation". What about indels? If these are not affected by APOBEC, omitting them will exaggerate the effect of punctuation.

      Thank you for pointing out that this focus on single nucleotide substitutions conveys an exaggerated image of the importance of this effect of APOBEC-driven mutagenesis. There are of course several other classes of (epi)genomic alterations (e.g. chromatin modifications, methylation changes, copy number changes) that we do not consider in this part of our analysis. APOBEC mutagenesis serves as an example of a temporally clustered mutation process, which we investigate in its domain of action.

      We have added further discussion (P. 10-11) to convey that our empirical results merely constitute an investigation of whether empirical patterns are consistent with our hypothesis, but that the narrow focus on only SNVs, only TSGs, and only APOBEC mutagenesis does not allow for a general quantitative statement about the in-vivo relevance of the phenomena we describe.

      (11) Page 6: "normalized by the total number of single nucleotide substitutions." It is difficult to know how to normalize correctly, but I might think that the square of the number of substitutions would be more appropriate. Perhaps the total numbers are close enough that it matters little.

      Thank you for noting this. In the revised manuscript we have now expanded this passage in the text to more clearly convey our motivations for why we normalize by the total number of single nucleotide substitutions. While the likelihood for crossing a fitness valley with 2 mutations is indeed proportional to the square of the mutation rate, we do not directly observe mutation rates from our data.  Rather, we observe the number of acquired single nucleotide substitutions for every tumor sample, but since tumors in our data differ in the time since initiation and therefore differ in the numbers of divisions their cells have undergone before being sequenced, we cannot directly infer mutation rates. One way to phrase our main result about valley-crossing is that temporally clustered mutation processes have an increased rate of successful valley-crossings per attempted valley crossing. Our TSG deactivation score is constructed to reflect this idea. The number of TSGs serves as a proxy for successful valley-crossings and the total mutation burden serves as a proxy for attempted valley-crossings.

      To convey these points more clearly, we have rewritten the first paragraph in the Section “Proxies for valley crossing and for temporal clustering found in patient data” (P.6)

      (12) Perhaps embed links to the COSMIC web pages for SBS2 and SBS13 in the text.

      Thank you for this suggestion. We have embedded the links at the first mention of SBS2 and SBS13 in the text.

    1. Author response:

      The following is the authors’ response to the original reviews.

      Reviewer #1 (Public review):

      In the Late Triassic and Early Jurassic (around 230 to 180 Ma ago), southern Wales and adjacent parts of England were a karst landscape. The caves and crevices accumulated remains of small vertebrates. These fossil-rich fissure fills are being exposed in limestone quarrying. In 2022 (reference 13 of the article), a partial articulated skeleton and numerous isolated bones from one fissure fill of end-Triassic age (just over 200 Ma) were named Cryptovaranoides microlanius and described as the oldest known squamate - the oldest known animal, by some 20 to 30 Ma, that is more closely related to snakes and some extant lizards than to other extant lizards. This would have considerable consequences for our understanding of the evolution of squamates and their closest relatives, especially for their speed and absolute timing, and was supported in the same paper by phylogenetic analyses based on different datasets.

      In 2023, the present authors published a rebuttal (reference 18) to the 2022 paper, challenging anatomical interpretations and the irreproducible referral of some of the isolated bones to Cryptovaranoides. Modifying the datasets accordingly, they found Cryptovaranoides outside Squamata and presented evidence that it is far outside. In 2024 (reference 19), the original authors defended most of their original interpretation and presented some new data, some of it from newly referred isolated bones. The present article discusses anatomical features and the referral of isolated bones in more detail, documents some clear misinterpretations, argues against the widespread but not justifiable practice of referring isolated bones to the same species as long as there is merely no known evidence to the contrary, further argues against comparing newly recognized fossils to lists of diagnostic characters from the literature as opposed to performing phylogenetic analyses and interpreting the results, and finds Cryptovaranoides outside Squamata again.

      Although a few of the character discussions and the discussion of at least one of the isolated bones can probably still be improved (and two characters are addressed twice), I see no sign that the discussion is going in circles or otherwise becoming unproductive. I can even imagine that the present contribution will end it.

      We appreciate the positive response from reviewer 1!

      Reviewer #2 (Public review):

      Congratulations on this thorough manuscript on the phylogenetic affinities of Cryptovaranoides.

      Thank you.

      Recent interpretations of this taxon, and perhaps some others, have greatly changed the field's understanding of reptile origins- for better and (likely) for worse.

      We agree, and note that while it is possible for challenges to be worse than the original interpretations, both the original and subsequent challenges are essential aspects of what make science, science.

      This manuscript offers a careful review of the features used to place Cryptovaranoides within Squamata and adequately demonstrates that this interpretation is misguided, and therefore reconciles morphological and molecular data, which is an important contribution to the field of paleontology. The presence of any crown squamate in the Permian or Triassic should be met with skepticism, the same sort of skepticism provided in this manuscript.

      We agree and add that every testable hypothesis requires skepticism and testing.

      I have outlined some comments addressing some weaknesses that I believe will further elevate the scientific quality of the work. A brief, fresh read‑through to refine a few phrases, particularly where the discussion references Whiteside et al. could also give the paper an even more collegial tone.

      We have followed Reviewer 2’s recommendations closely (see below) and have justified in our responses if we do not fully follow a particular recommendation.

      This manuscript can be largely improved by additional discussion and figures, where applicable. When I first read this manuscript, I was a bit surprised at how little discussion there was concerning both non-lepidosauromorph lepidosaurs as well as stem-reptiles more broadly. This paper makes it extremely clear that Cryptovaranoides is not a squamate, but would greatly benefit in explaining why many of the characters either suggested by former studies to be squamate in nature or were optimized as such in phylogenetic analyses are rather widespread plesiomorphies present in crownward sauropsids such as millerettids, younginids, or tangasaurids. I suggest citing this work where applicable and building some of the discussion for a greatly improved manuscript. In sum:

      (1) The discussion of stem-reptiles should be improved. Nearly all of the supposed squamate features in Cryptovaranoides are present in various stem-reptile groups. I've noted a few, but this would be a fairly quick addition to this work. If this manuscript incorporates this advice, I believe arguments regarding the affinities of Cryptovaranoides (at least within Squamata) will be finished, and this manuscript will be better off for it.

      (2) I was also surprised at how little discussion there was here of putative stem-squamates or lepidosauromorphs more broadly. A few targeted comparisons could really benefit the manuscript. It is currently unclear as to why Cryptovaranoides could not be a stem-lepidosaur, although I know that the lepidosaur total-group in these manuscripts lacks character sampling due to their scarcity.

      We are responding to (1) and (2) together. We agree with the Reviewer that a thorough comparison of Cryptovaranoides to non-lepidosaurian reptiles is critical. This is precisely what we did in our previous study: Brownstein et al. (2023)— see main text and supplementary information therein. As addressed therein, there is a substantial convergence between early lepidosaurs and some groups of archosauromorphs (our inferred position for Cryptovaranoides). Many of those points are not addressed in detail here in order to avoid redundancy and are simply referenced back to Brownstein et al. (2023). Secondly, stem reptiles (i.e., non-lepidosauromorphs and non-archosauromorphs), such as suggested above (millerettids, younginids, or tangasaurids), are substantially more distantly related to Cryptovaranoides (following any of the published hypotheses). As such, they share fewer traits (either symplesiomorphies or homoplasies), and so, in our opinion, we would risk directing losing the squamate-focus of our study.

      We thus respectfully decline to engage the full scope of the problem in this contribution, but do note that this level of detailed work would make for an excellent student dissertation research program.

      (3) This manuscript can be improved by additional figures, such as the slice data of the humerus. The poor quality of the scan data for Cryptovaranoides is stated during this paper several times, yet the scan data is often used as evidence for the presence or absence of often minute features without discussion, leaving doubts as to what condition is true. Otherwise, several sections can be rephrased to acknowledge uncertainty, and probably change some character scorings to '?' in other studies.

      We strongly agree with the reviewer. Unfortunately, the original publication (Whiteside et al., 2021) did not make available the raw CT scan data to make this possible. As noted below in the Responses to Recommendations Section, we only have access to the mesh files for each segmented element. While one of us has observed the specimens personally, we have not had the opportunity to CT scan the specimens ourselves.

      Reviewer #3 (Public review):

      Summary:

      The study provides an interesting contribution to our understanding of Cryptovaranoides relationships, which is a matter of intensive debate among researchers. My main concerns are in regard to the wording of some statements, but generally, the discussion and data are well prepared. I would recommend moderate revisions.

      Strengths:

      (1) Detailed analysis of the discussed characters.

      (2) Illustrations of some comparative materials.

      Thank you for noting the strengths inherent to our study.

      Weaknesses:

      Some parts of the manuscript require clarification and rewording.

      One of the main points of criticism of Whiteside et al. is using characters for phylogenetic considerations that are not included in the phylogenetic analyses therein. The authors call it a "non-trivial substantive methodological flaw" (page 19, line 531). I would step down from such a statement for the reasons listed below:

      (1) Comparative anatomy is not about making phylogenetic analyses. Comparative anatomy is about comparing different taxa in search of characters that are unique and characters that are shared between taxa. This creates an opportunity to assess the level of similarity between the taxa and create preliminary hypotheses about homology. Therefore, comparative anatomy can provide some phylogenetic inferences.

      That does not mean that tests of congruence are not needed. Such comparisons are the first step that allows creating phylogenetic matrices for analysis, which is the next step of phylogenetic inference. That does not mean that all the papers with new morphological comparisons should end with a new or expanded phylogenetic matrix. Instead, such papers serve as a rationale for future papers that focus on building phylogenetic matrices.

      We agree completely. We would also add that not every study presenting comparative anatomical work need be concluded with a phylogenetic analysis.

      Our criticism of Whiteside et al. (2022) and (2024) is that these studies provided many unsubstantiated claims of having recovered synapomorphies between Cryptovaranoides and crown squamates without actually having done so through the standard empirical means (i.e., phylogenetic analysis and ancestral state reconstruction). Both Whiteside et al. (2022) and (2024) indicate characters presented as ‘shared with squamates’ along with 10 characters presented as synapomorphies (10). However, their actual phylogenetically recovered synapomorphies were few in number (only 3) and these were not discussed.

      Furthermore, Whiteside et al. (2022) and (2024) comparative anatomy was restricted to comparing †Cryptovaranoides to crown squamates., based on the assumption that †Cryptovaranoides was a crown squamate and thus only needed to be compared to crown squamates.

      In conclusion, we respectfully, we maintain such efforts are “non-trivial substantive methodological flaw(s)”.

      (2) Phylogenetic matrices are never complete, both in terms of morphological disparity and taxonomic diversity. I don't know if it is even possible to have a complete one, but at least we can say that we are far from that. Criticising a work that did not include all the possibly relevant characters in the phylogenetic analysis is simply unfair. The authors should know that creating/expanding a phylogenetic matrix is a never-ending work, beyond the scope of any paper presenting a new fossil.

      Respectfully, we did not criticize previous studies for including an incomplete phylogeny. Instead, we criticized the methodology behind the homology statements made in Whiteside et al. (2022) and Whiteside et al. (2024).

      (3) Each additional taxon has the possibility of inducing a rethinking of characters. That includes new characters, new character states, character state reordering, etc. As I said above, it is usually beyond the scope of a paper with a new fossil to accommodate that into the phylogenetic matrix, as it requires not only scoring the newly described taxon but also many that are already scored. Since the digitalization of fossils is still rare, it requires a lot of collection visits that are costly in terms of time.

      We agree on all points, but we are unsure of what the Reviewer is asking us to do relative to this study.

      (4) If I were to search for a true flaw in the Whiteside et al. paper, I would check if there is a confirmation bias. The mentioned paper should not only search for characters that support Cryptovaranoides affinities with Anguimorpha but also characters that deny that. I am not sure if Whiteside et al. did such an exercise. Anyway, the test of congruence would not solve this issue because by adding only characters that support one hypothesis, we are biasing the results of such a test.

      We would refer the Reviewer to their section (1) on comparative anatomy. As we and the Reviewer have pointed out, Whiteside et al. did not perform comparative anatomical statements outside of crown Squamata in their original study. More specifically, Whiteside et al. (2022, Fig. 8) presented a phylogeny where Cryptovaranoides formed a clade with Xenosaurus within the crown of Anguimorpha or what they termed “Anguiformes”, and made comparisons to the anatomies of the legless anguids, Pseudopus and Ophisaurus. Whiteside et al. (2024), abandoned “Anguiformes”, maintained comparisons to Pseudopus and emphasized affinities with Anguimorpha (but almost all of their phylogenies as published, they do not recover a monophyletic Angumimorpha unless amphisbaenians and snakes are considered to be anguimorphans. Thus, we agree that confirmation bias was inherent in their studies.

      To sum up, there is nothing wrong with proposing some hypotheses about character homology between different taxa that can be tested in future papers that will include a test of congruence. Lack of such a test makes the whole argumentation weaker in Whiteside et al., but not unacceptable, as the manuscript might suggest. My advice is to step down from such strong statements like "methodological flaw" and "empirical problems" and replace them with "limitations", which I think better describes the situation.

      We agree with the first sentence in this paragraph – there is nothing wrong with proposing character homologies between different taxa based on comparative anatomical studies. However, that is not what Whiteside et al. (2022) and (2024) did. Instead, they claimed that an ad hoc comparison of Cryptovaranoides to crown Squamata confirmed that Cryptovaranoides is in fact a crown squamate and likely a member of Anguimorpha. Their study did not recognize limitations, but rather, concluded that their new taxon pushed the age of crown Squamata into the Triassic.

      As noted by Reviewer 2, such a claim, and the ‘data’ upon which it is based, should be treated with skepticism. We have elected to apply strong skepticism and stringent tests of falsification to our critique.

      Reviewer #1 (Recommendations for the authors):

      (1) Lines 596-598 promise the following: "we provide a long[-]form review of these and other features in Cryptovaranoides that compare favorably with non-squamate reptiles in Supplementary Material." You have kindly informed me that all this material has been moved into the main text; please amend this passage.

      This has been deleted.

      (2) Comments on science

      41: I would rather say "an additional role".

      This has been edited accordingly.

      43: Reconstructing the tree entirely from extant organisms and adding fossils later is how Hennig imagined it, because he was an entomologist, and fossil insects are, on average,e extremely rare and usually very incomplete (showing a body outline and/or wing venation and little or nothing else). He was wrong, indeed wrong-headed. As a historical matter, phylogenetic hypotheses were routinely built on fossils by the mid-1860s, pretty much as soon as the paleontologists had finished reading On the Origin of Species, and this practice has never declined, let alone been interrupted. As a theoretical matter, including as many extinct taxa as possible in a phylogenetic analysis is desirable because it breaks up long branches (as most recently and dramatically shown by Mongiardino Koch & Parry 2020), and while some methods and some kinds of data are less susceptible to long-branch attraction and long-branch repulsion than others, none are immune; and while missing data (on average more common in fossils) can actively mislead parametric methods, this is not the case with parsimony, and even in Bayesian inference the problem is characters with missing data, not taxa with missing data. Some of you have, moreover, published tip-dated phylogenetic analyses. As a practical matter, molecular data are almost never available from fossils, so it is, of course, true that analyses which only use molecular data can almost never include fossils; but in the very rare exceptions, there is no reason to treat fossil evidence as an afterthought.

      We agree and have changed “have become” to “is.”

      49-50, 59: The ages of individual fissure fills can be determined by biostratigraphy; as far as I understand, all specimens ever referred to Cryptovaranoides [13, 19] come from a single fill that is "Rhaetian, probably late Rhaetian (equivalent of Cotham Member, Lilstock Formation)" [13: pp. 2, 15].

      We appreciate this comment; the recent literature, however, suggests that variable ages are implied by the biostratigraphy at the English Fissure Fills, so we have chosen to keep this as is. Also note that several isolated bones were not recovered with the holotype but were discussed by Whiteside et al. (2024). The provenance of these bones was not clearly discussed in that paper.

      59-60: Why "putative"? Just to express your disagreement? I would do that in a less misleading way, for example: "and found this taxon as a crown-group squamate (squamate hereafter) in their phylogenetic analyses." - plural because [19] presented four different analyses of two matrices just in the main paper.

      We have removed this word.

      121-124: The entepicondylar foramen is homologous all the way down the tree to Eusthenopteron and beyond. It has been lost a quite small number of times. The ectepicondylar foramen - i.e., the "supinator" (brachioradialis) process growing distally to meet the ectepicondyle, fusing with it and thereby enclosing the foramen - goes a bit beyond Neodiapsida and also occurs in a few other amniote clades (...as well as, funnily enough, Eusthenopteron in later ontogeny, but that's independent).

      We agree. However, the important note here is that the features on the humerus of Cryptovaranoides are not comparable (differ in location and morphology) to the ent- and ectepondylar foramina in other reptiles, as we discuss at length. As such, we have kept this sentence as is.

      153: Yes, but you [18] mistakenly wrote "strong anterior emargination of the maxillary nasal process, which is [...] a hallmark feature of archosauromorphs" in the main text (p. 14) - and you make the same mistake again here in lines 200-206! Also, the fact [19: Figure 2a-c] remains that Cryptovaranoides did not have an antorbital fenestra, let alone an antorbital fossa surrounding it (a fossa without a fenestra only occurs in some cases of secondary loss of the fenestra, e.g., in certain ornithischian dinosaurs). Unsurprisingly, therefore, Cryptovaranoides also does not have an orbital-as-opposed-to-nasal process on its maxilla [19: Figure 2a-c].

      Line 243-249 (in original manuscript) deal with the emargination of maxillary nasal process (but this does not imply a full antorbital fenestra).  We explicitly state that this feature alone "has limited utility" for supporting archosauromorph affinity.

      158-173: The problem here is not that the capitellum is not preserved; from amniotes and "microsaurs" to lissamphibians and temnospondyls, capitella ossify late, and larger capitella attach to proportionately larger concave surfaces, so there is nothing wrong with "the cavity in which it sat clearly indicates a substantial condyle in life". Instead, the problem is a lack of quantification (...as has also been the case in the use of the exact same character in the debate on the origin of lissamphibians); your following sentence (lines 173-175) stands. The rest of the paragraph should be drastically shortened.

      We appreciate this comment. We note that the ontogenetic variation of this feature is in part the issue with the interpretation provided by Whiteside et al. (2024). The issue is the lack of consistency on the morphology of the capitellum in that study. We are unclear on what the reviewer means by ‘quantification,’ as the character in question is binary. 

      250-252: It's not going to matter here, but in any different phylogenetic context, "sphenoid" would be confusing given the sphenethmoid, orbitosphenoid, pleurosphenoid, and laterosphenoid. I actually recommend "parabasisphenoid" as used in the literature on early amniotes (fusion of the dermal parasphenoid and the endochondral basisphenoid is standard for amniotes).

      We have added "(=parabasisphenoid)" on first use but retain use of sphenoid because in the squamate and archosauromorph literature, sphenoid (or basisphenoid) is used more frequently.

      314-315: Vomerine teeth are, of course, standard for sarcopterygians. Practically all extant amphibians have a vomerine toothrow, for example. A shagreen of denticles on the vomer is not as widespread but still reaches into the Devonian (Tulerpeton).

      We agree, but vomerine teeth are rare in lepidosaurs and archosaurs and occur only in very recent clades e.g. anguids and one stem scincoid. Their presence in amphibians is not directly relevant to the phylogenetic placement of Cryptovaranoides among reptiles.

      372: Fusion was not scored as present in [13], but as unknown (as "partial" uncertainty between states 0 and 1 [19:8]), and seemingly all three options were explored in [19].

      We politely disagree with the reviewer; state 1 is scored in Whiteside et al. (2024).

      377-383: Together with the partially fused NHMUK PV R37378 [13: Figure 4B, C; 19: 8], this is actually an argument that Cryptovaranoides is outside but close to Unidentata. The components of the astragalus fuse so early in extant amniotes that there is just a single ossification center in the already fused cartilage, but there are Carboniferous and Permian examples of astragali with sutures in the expected places; all of the animals in question (Diadectes, Hylonomus, captorhinids) seem to be close to but outside Amniota. (And yet, the astragalus has come undone in chamaeleons, indicating the components have not been lost.) - Also, if NHMUK PV R37378 doesn't belong to a squamate close to Unidentata, what does it belong to? Except in toothless beaks, premaxillary fusion is really rare; only molgin newts come to mind (and age, tooth size, and tooth number of NHMUK PV R37378 are wholly incompatible with a salamandrid).

      The relevance of the astragalus is to the current discussion is unclear as we do not mention this element in our manuscript.  We discuss the fusion in the premaxillae in response to previous comment. 

      471-474: That thing is concave. (The photo is good enough that you can enlarge it to 800% before it becomes too pixelated.) It could be a foramen filled with matrix; it does not look like a grain sticking to the outside of the bone. Also, spell out that you're talking about "suc.fo" in Figure 3j.

      We are also a bit confused about this comment, as we state:

      “Finally, we note here that Whiteside et al. [19] appear to have labeled a small piece of matrix attached to a coracoid that they refer to †C. microlanius as the supracoroacoid [sic] foramen in their figure 3, although this labeling is inferred because only “suc, supracoroacoid [sic]” is present in their figure 3 caption.” (L. 519-522, P. 17). We cannot verify that this structure is concave, as so we keep this text as is.

      476-489: [19] conceded in their section 4.1 (pp. 11-12) that the atlas pleurocentrum, though fused to the dorsal surface of the axis intercentrum as usual for amniotes and diadectomorphs, was not fused to the axis pleurocentrum.

      This is correct, as we note in the MS. The issue is whether these elements are clearly identifiable.

      506-510: [19:12] did identify what they considered a possible ulnar patella, illustrated it (Figure 4d), scored it as unknown, and devoted the entire section 4.4 to it.<br /> 512-523: What I find most striking is that Whiteside et al., having just discovered a new taxon, feel so certain that this is the last one and any further material from that fissure must be referable to one of the species now known from there.

      We agree with these points and believe we have devoted adequate text to addressing them. Note that the reviewer does not recommend any revisions to these sections.

      553: Not that it matters, but I'm surprised you didn't use TNT 1.6; it came out in 2023 and is free like all earlier versions.

      We have kept this as is following the reviewer comment, and because we were interested in replicating the analyses in the previous publications that have contributed to the debate about the identity of this taxon.  For the present simple analyses both versions should perform identically, as the search algorithms for discrete characters are identical across these versions.

      562: Is "01" a typo, or do you mean "0 or 1"? In that case, rather write "0/1" or "{01}".

      This has been corrected to {01}

      (3) Comments on nomenclature and terminology

      55, 56: Delete both "...".

      This has been corrected.

      100: "ent- and ectepicondylar"

      For clarity, we have kept the full words.

      107-108: I understand that "high" is proximal and "low" is distal, but what is "the distal surface" if it is not the articular surface in the elbow joint?

      This has been corrected.

      120: "stem pan-lepidosaurs, and stem pan-squamates"; Lepidosauria and Squamata are crown groups that don't contain their stems

      This has been corrected.

      122, 123: Italics for Claudiosaurus and Delorhynchus.

      This has been corrected.

      130: Insert a space before "Tianyusaurus" (it's there in the original), and I recommend de-italicizing the two genus names to keep the contrast (as you did in line 162).

      This has been corrected.

      130, 131: Replace both "..." by "[...]", though you can just delete the second one.

      This has been corrected.

      174: Not a capitulum, but a grammatically even smaller (double diminutive) capitellum.

      This has been corrected.

      209, 224, Table 1: Both teams have consistently been doing this wrong. It's "recessus scalae tympani". The scala tympani ("ladder/staircase of the [ear]drum") isn't the recess, it's what the recess is for; therefore, the recess is named "recess of the scala tympani", and because there was no word for "of" in Classical Latin ("de" meant "off" and "about"), the genitive case was the only option. (For the same reason, the term contains "tympani", the genitive of "tympanum".)

      This has been corrected.

      415-425: This is a terminological nightmare. Ribs can have (and I'm not sure this is exhaustive): a) two separate processes (capitulum, tuberculum) that each bear an articulating facet, and a notch in between; b) the same, but with a non-articulating web of bone connecting the processes; c) a single uninterrupted elongate (even angled) articulating facet that articulates with the sutured or fused dia- and parapophysis; d) a single round articulating facet. Certainly, a) is bicapitate and d) is unicapitate, but for b) and c) all bets are off as to how any particular researcher is going to call them. This is a known source of chaos in phylogenetic analyses. I recommend writing a sentence or three on how the terms "unicapitate" & "bicapitate" lack fixed meanings and have caused confusion throughout tetrapod phylogenetics, and that the condition seen in Cryptovaranoides is nonetheless identical to that in archosauromorphs.

      This has been added: “This confusion in part stems from the lack of a fixed meaning for uni- and bicapitate rib heads; in any case, †C. microlanius possesses a condition identical to archosauromorphs as we have shown.”  (L.475-477, P.16).

      439-440: Other than in archosaurs, some squamates and Mesosaurus, in which sauropsids are dorsal intercentra absent?

      We are unclear about the relevance of the question to this section. The issue at hand is that some squamate lineages possess dorsal intercentra, so the absence of dorsal intercentra cannot be considered a squamate synapomorphy without the optimization of this feature along a phylogeny (which was not accomplished by Whiteside et al.).

      458: prezygapophyses.

      This has been corrected.

      516: "[...]".

      This has been corrected.

      566: synapomorphies.

      This has been corrected.

      587: Macrocnemus.

      This has been corrected.

      585: I strongly recommend either taking off and nuking the name Reptilia from orbit (like Pisces) or using it the way it is defined in Phylonyms, namely as the crown group (a subset of Neodiapsida). Either would mean replacing "neodiapsid reptiles" with "neodiapsids".

      This has been corrected to “neodiapsids.”

      625: Replace "inclusive clades" by "included clades", "component clades", "subclades", or "parts," for example.

      This has been kept as is because “inclusive clades” is common terminology and is used extensively in, for example, the PhyloCode. 

      659: Please update.

      References are updated.

      Fig. 8: Typo in Puercosuchus.

      This has been corrected.

      (4) Comments on style and spelling

      You inconsistently use the past and the present tense to describe [13, 19], sometimes both in the same sentence (e.g., lines 323 vs. 325). I recommend speaking of published papers in the past tense to avoid ascribing past views and acts to people in their present state.

      This has been corrected to be more consistent throughout the manuscript.

      48: Remove the second comma.

      This has been corrected.

      91: Replace "[13] and WEA24" by "[13, 19]".

      This has been corrected.

      100: Commas on both sides of "in fact" or on neither

      This has been corrected.

      117: I recommend "the interpretation in [19]". I have nothing against the abbreviation "WEA24", but you haven't defined it, and it seems like a remnant of incomplete editing. - That said, eLife does not impose a format on such things. If you prefer, you can just bring citation by author & year back; in that case, this kind of abbreviation would make perfect sense (though it should still be explicitly defined).<br /> 129, 145: Likewise.

      We have modified this [13] and [19] where necessary.

      192-198: Surely this should be made part of the paragraph in lines 158-175, which has the exact same headline?

      This has been corrected.

      200-206: Surely this should be made part of the paragraph in lines 148-156, which has the exact same headline?

      These sections deal with different issues pertaining to the analyses of Whiteside et al. (2024) and so we have kept to organization as is.

      214: Delete "that".

      This has been deleted.

      312: "Vomer" isn't an adjective; I'd write "main vomer body" or "vomer's main body" or "main body of the vomer".

      This has been corrected.

      350: "figured"

      This has been corrected.

      400: Rather, "rearticulated" or "worked to rearticulate"? - And why "several"? Just write "two". "Several" implies larger numbers.

      These issues have been corrected.

      448, 500: As which? As what kind of feature? I'm aware that "as such" is fairly widely used for "therefore", but it still confuses me every time, and I have to suspect I'm not the only one. I recommend "therefore" or "for this reason" if that is what you mean.

      “As such” has been deleted.

      452: Adobe Reader doesn't let me check, but I think you have two spaces after "of".

      This has been corrected.

      514, 539, 546, 552, 588, Fig. 3, 5, 6, Table 1: "WEA24" strikes again.

      This has been corrected.

      515: Remove the parentheses.

      This has been corrected.

      531: Insert a space after the period.

      This has been corrected.

      532: Remove both commas and the second "that".

      This has been corrected.

      538: Remove the comma.

      This has been kept as is because changing it would render the sentence grammatically incorrect.

      545: "[...]" or, better, nothing.

      This has been corrected.

      547: Spaces on both sides of the dash or on neither (as in line 553).

      This has been corrected.

      552: Rather, "conducted a parsimony analysis".

      This has been corrected.

      556: Space after "[19]".

      This has been corrected.

      560: Comma after "narrow".

      This has been corrected.

      600: Comma after "above" to match the one in the preceding line - there's an insertion in the sentence that must be flanked by commas on both sides.

      This has been corrected.

      603: Compound adjectives like "alpha-taxonomic" need a hyphen to avoid tripping readers up.

      This has been corrected.

      612: Similarly, "ancestral-state reconstruction" needs one to make immediately clear it isn't a state reconstruction that is ancestral but a reconstruction of ancestral states.

      This has been corrected.

      613: If you want to keep this comma, you need to match it with another after "Cryptovaranoides" in line 611.

      We have kept this as is, because removing this comma would render the sentence grammatically incorrect.

      615: Likewise, you need a comma after "and" because "except for a few features" is an insertion. The other comma is actually optional; it depends on how much emphasis you want to place on what comes after it.

      this has been added.

      622: Comma after "[48, 49]".

      this has been added.

      672: Missing italics and two missing spaces.

      This has been corrected.

      678, 680-681, 693, 700-701, 734, 742, 747, 788, 797, 799, 803, 808, 810-811, 814, 817, 820, 823, 828, 841, 843: Missing italics.

      This has been corrected.

      683, 689: These are book chapters. Cite them accordingly.

      This has been corrected.

      737: Missing DOI.

      No DOI is available.

      793: Missing Bolosaurus major; and I'd rather cite it as "2024" than "in press", and "online early" instead of "n/a".

      This has been corrected.

      835: Hoffstetter, RJ?

      This has been corrected.

      836: Is there something missing?

      This has been corrected.

      839: This is the same reference as number 20 (lines 683-684), and it is miscited in a different way...!

      This has been corrected.

      Reviewer #2 (Recommendations for the authors):

      (1) There is a brief mention of a phylogenetic analysis being re-run, but it is unclear if any modifications (changes in scoring) based on the very observations were made. Please state this explicitly.

      This is explained from lines 600-622, P.20-21, in the section “Apomorphic characters not empirically obtained.”  "In order to check the characters listed by Whiteside et al. [19] (p.19) as “two diagnostic characters” and “eight synapomorphies” in support of a squamate identity for †Cryptovaranoides, we conducted a parsimony analysis of the revised version of the dataset [32] provided by Whiteside et al. [19] in TNT v 1.5 [91]. We used Whiteside et al.’s [19] own data version"

      (2) Line 20: There is almost no discussion of non‑lepidosaur lepidosauromorphs. I suggest including this, as the archosauromorph‑like features reported in Cryptovaranoides appear rather plastic. Furthermore, diagnostic features of Archosauromorpha in other datasets (e.g., Ezcurra 2016 or the works of Spiekman) are notably absent (and unsampled) in Cryptovaranoides. Expanding this comparison would greatly strengthen the manuscript.

      The brief discussion (although not absent) of non-lepidosaur lepidosauromorphs is largely a function of the poor fossil record of this grade. But where necessary, we do discuss these taxa. Also see our previous study (Brownstein et al. 2023) for an extensive discussion of characters relevant to archosauromorphs.

      (3) Line 38: I suggest removing "Archosauromorpha" from the keywords. The authors make a compelling case that Cryptovaranoides is not a squamate, yet they do not fully test its placement within Archosauromorpha (as they acknowledge). Perhaps use "Reptilia" instead?

      We have removed this keyword.

      (4) Line 99: The authors' points here are well made and largely valid. The presence of the ent‑ and ectepicondylar foramina is indeed an amniote plesiomorphy and cannot confirm a squamate identity. Their absence, however, can be informative - although it is unclear whether the CT scans of the humerus are of sufficient resolution, and Figure 4 of Brownstein et al. looks hastily reconstructed (perhaps owing to limited resolution). Moreover, the foramina illustrated by Whiteside do resemble those of other reptiles, albeit possibly over‑prepared and exaggerated.

      The issue with the noted figure is indeed due to poor resolution from the scans. Although we agree with the reviewer, we hesitate to talk about absence in this taxon being phylogenetically informative given the confounding influence of ontogeny.

      (5) I encourage the authors to provide slice data to support the claim that the foramina are absent (which could certainly be correct!); otherwise, the assertion remains unsubstantiated.

      We only have access to the mesh files of segmented bones, not the raw (reconstructed slice) data.

      (6) PLEASE NOTE - because the specimen is juvenile, the apparent absence of the ectepicondylar foramen is equivocal: the supinator process develops through ontogeny and encloses this foramen (see Buffa et al. 2025 on Thadeosaurus, for example).

      See above.

      (7) Line 122: Italicize 'Delorhynchus'

      This has been corrected.

      (8) Lines 131‑132: I'd suggest deleting the final sentence; it feels a little condescending, and your argument is already persuasive.

      This has been corrected.

      (9) Line 129: Please note that owenettid "parareptiles" also lack this process, as do several other stem‑saurians. Its absence is therefore not diagnostic of Squamata.<br /> Also: Such plasticity is common outside the crown. Milleropsis and Younginidae develop this process during ontogeny, even though a lower temporal bar never fully forms.

      We appreciate this point. See discussion later in the manuscript.

      (11) Line 172: Consider adding ontogeny alongside taphonomy and preservation. A juvenile would likely have a poorly developed radial condyle, if any. Acknowledging this possibility will add some needed nuance.

      This sentence has been modified, but we have not added in discussion of ontogeny here because it is not immediately relevant to refuting the argument about inference of the presence of this feature when it is not preserved.

      (12) Line 177: The "septomaxilla" in Whiteside et al. (2024, Figure 1C) resembles the contralateral premaxilla in dorsal view, with the maxillary process on the left and the palatal (or vomerine) process on the right (the dorsal process appears eroded). The foramen looks like a prepalatal foramen, common to many stem and crown reptiles. Consequently, scoring the septomaxilla as absent may be premature; this bone often ossifies late. In my experience with stem‑reptile aggregations, only one of several articulated individuals may ossify this element.

      We agree that presence of a late-ossifying septomaxilla cannot be ruled out, but our point remains (and in agreement with Referee) that scoring the septomaxilla as present based on the amorphous fragments is premature.

      (13) Line 200: Tomography data should be shown before citing it. The posterior margin of the maxilla appears rather straight, and the maxilla itself is tall for an archosauromorph. It would be more convincing to score this feature as present only after illustrating the relevant slices - and, as you note, the trait is widespread among non‑archosauromorphs.

      See above and Brownstein et al. (2023).

      (14) Line 208: Well argued: how could Whiteside et al. confidently assign a disarticulated element? Their "vagus" foramen actually resembles a standard hypoglossal foramen - identical to that seen in many stem reptiles, which often have one large and one small opening.

      Thank you!

      (15) Line 248: Again, please illustrate this region. One cannot argue for absence without showing the slice data. Note that millerettids and procolophonians - contemporaneous with Cryptovaranoides - possess an enclosed vidian canal, so the feature is broadly distributed.

      See above.

      (16) Line 258: The choanal fossa is intriguing: originally created for squamate matrices, yet present (to varying degrees) in nearly every reptile I have examined. It is strongly developed in millerettids (see Jenkins et al. 2025 on Milleropsis and Milleretta) and younginids, much like in squamates - Tiago appropriately scores it as present. Thus, it may be more of a "Neodiapsida + millerettids" character. In any case, the feature likely forms an ordered cline rather than a simple binary state.

      We agree and look forward to future study of this feature.

      (17) Line 283: Bolosaurids are not diapsids and, per Simões, myself, and others, "Diapsida" is probably invalid, at least how it is used here. Better to say "neodiapsids" for choristoderes and "stem‑reptiles" or "sauropsids" for bolosaurids. Jenkins et al.'s placement is largely a function of misidentifying the bolosaurid stapes as the opisthotic.

      We are not entirely clear on this point since bolosaurids are not mentioned in this section.

      (18) Line 298: Here, you note that the CT scans are rather coarse, which makes some earlier statements about absence/presence less certain (e.g., humeral foramina). It may strengthen the paper to make fewer definitive claims where resolution limits interpretation.

      We appreciate this point. However, in the case of the humeral foramina the coarseness of the scans is one reason why we question Whiteside et al. scoring of the presence of these features.

      (19) Line 314: Multiple rows of vomerine teeth are standard for amniotes; lepidosauromorphs such as Paliguana and Megachirella also exhibit them (though they may not have been segmented in the latter's description). Only a few groups (e.g., varanopids, some millerettids) have a single medial row.

      We appreciate this point and have added in those citations into the following added sentence: “Multiple rows of vomerine teeth are common in reptiles outside of Squamata [76]; the presence of only one row is restricted to a handful of clades, including millerettids [77,78], †Tanystropheus [49], and some [79], but not all [71,80] choristoderes.” (L. 360-363, P. 12).

      (20) Line 317: This is likely a reptile plesiomorphy - present in all millerettids (e.g., Milleropsis and Milleretta per Jenkins et al.). Citing these examples would clarify that it is not uniquely squamate. Could it be secondarily lost in archosauromorphs?

      We appreciate this point and have cited Jenkins et al. here. It is out of the scope of this discussion to discuss the polarity of this feature relative to Archosauromorpha.

      (21) Line 336: Unfortunately, a distinct quadratojugal facet is usually absent in Neodiapsids and millerettids; where present, the quadratojugal is reduced and simply overlaps the quadrate.

      We appreciate this point but feel that reviewing the distribution of this feature across all reptiles is not relevant to the text noted.

      (22) Line 357: Pterygoid‑quadrate overlap is likely a tetrapod plesiomorphy. Whiteside et al. do not define its functional or phylogenetic significance, and the overlap length is highly variable even among sister taxa.

      We agree, but in any case this feature is impossible to assess in Cryptovaranoides.

      (23) Line 365: Another well‑written section - clear and persuasive.

      Thank you!

      (24) Line 385: The cephalic condyle is widespread among neodiapsids, so it is not uniquely squamate.

      We agree.

      (25) Character 391: Note that the frontal underlapping the parietal is widespread, appearing in both millerettids and neodiapsids such as Youngina.

      We appreciate this point, but the point here deals with the fact that this feature is not observable in the holotype of Cryptovaranoides.

      (26) Line 415: The "anterior process" is actually common among crown reptiles, including sauropterygians, so it cannot by itself place Cryptovaranoides within Archosauromorpha.

      We agree but also note that we do not claim this feature unambiguously unites Cryptovaranoides with Archosauromorpha.

      (28) Line 460: Yes - Whiteside et al. appear to have relabeled the standard amniote coracoid foramen. Excellent discussion.

      Thank you!

      (29) Line 496: While mirroring Whiteside's structure, discussing this mandibular character earlier, before the postcrania, might aid readability.

      We have chosen to keep this structure as is.

      (30) Lines 486-588: This section oversimplifies the quadrate articulation.

      We are unclear how this is an oversimplification.

      (31) Both Prolacerta and Macrocnemus possess a cephalic condyle and some mobility (though less than many squamates). In Prolacerta (Miedema et al. 2020, Figure 4), the squamosal posteroventral process loosely overlaps the quadrate head.

      We assume this comment refers to the section "Peg-in-notch articulation of quadrate head"; we appreciate clarification that this feature occurs in variable extent outside squamates, but this does not affect our statement that the material of Cryptovaranoides is too poorly preserved to confirm its presence.

      (32) Where is this process in Cryptovaranoides? It is not evident in Whiteside's segmentation of the slender squamosal - please illustrate.

      We are unclear as to which section this comment refers.

      (33) Additionally, the quadrate "conch" of Cryptovaranoides is well developed, bearing lateral and medial tympanic crests; the lateral crest is absent in the cited archosauromorphs.

      We note that no vertebrate has a medial tympanic crest (it is always laterally placed for the tympanic membrane, when present). If this is what the reviewer refers to, this is a feature commonly found across all tetrapods bearing a tympanum attached to the quadrate (e.g., most reptiles), and so it is not very relevant phylogenetically. Regarding its presence in Cryptovaranoides, the lateral margin of the quadrate is broken (Brownstein et al., 2023), so it cannot be determined. This incomplete preservation also makes an interpretation of a quadrate conch very hard to determine. But as currently preserved, there is no evidence whatsoever for this feature.

      (34) Line 591: The cervical vertebrae of Cryptovaranoides are not archosauromorph‑like. Archosauromorph cervicals are elongate, parallelogram‑shaped, and carry long cervical ribs-none of which apply here. As the manuscript lacks a phylogenetic analysis, including these features seems unnecessary. Should they be added to other datasets, I suspect Cryptovaranoides would align along the lepidosaur stem (though that remains to be tested).

      We politely disagree. The reviewer here mentions that the cervical vertebrae of archosauromorphs are generally shaped differently from those in Cryptovaranoides. The description provided (“elongate, parallelogram‑shaped, and carry long cervical ribs-none”) is basically limited to protorosaurians (e.g., tanystropheids, Macrocnemus) and early archosauriforms. We note that archosauromorph cervicals are notoriously variable in shape, especially in the crown, but also among early archosauromorphs. Further, the cervical ribs, are notoriously similar among early archosauromorphs (including protorosaurians) and Cryptovaranoides, as discussed and illustrated in Brownstein et al., 2023 (Figs. 2 and 3), especially concerning the presence of the anterior process.

      Further, we do include a phylogenetic analysis of the matrix provided in Whiteside et al. (2024) as noted in our results section. In any case, we direct the reviewer to our previous study (Brownstein et al., 2023), in which we conduct phylogenetic analyses that included characters relevant to this note.

      Reviewer #3 (Recommendations for the authors):

      (1) The authors should use specimen numbers all over the text because we are talking about multiple individuals, and the authors contest the previous affinity of some of them. For example, on page 16, line 447, they mention an isolated vertebra but without any number. The specimen can be identified in the referenced article, but it would be much easier for the reader if the number were also provided here

      Agreed and added.

      (2) Abstract: "Our team questioned this identification and instead suggested Cryptovaranoides had unclear affinities to living reptiles."

      That is very imprecise. The team suggested that it could be an archosauromorph or an indeterminate neodiapsid. Please change accordingly.

      We politely disagree. We stated in our 2023 study that whereas our phylogenetic analyses place this taxon in Archosauromorpha, it remains unclear where it would belong within the latter. This is compatible with “unclear affinities to living reptiles”.

      (3) Page 7, line 172: "Taphonomy and poor preservation cannot be used to infer the presence of an anatomical feature that is absent." Unfortunate wording. Taphonomy always has to be used to infer the presence or absence of anatomical features. Sometimes the feature is not preserved, but it leaves imprints/chemical traces or other taphonomic indicators that it was present in the organism. Please remove or rewrite the sentence.

      We agree and have modified the sentence to read: “Taphonomy and poor preservation cannot be used alone to justify the inference that an anatomical feature was present when it is not preserved and there is no evidence of postmortem damage. In a situation when the absence of a feature is potentially ascribable to preservation, its presence should be considered ambiguous.” (L. 141-145, P.5).

      (4) Page 4, line 91, please explain "WEA24" here, though it is unclear why this abbreviation is used instead of citation in the manuscript.

      This has been corrected to Whiteside et al. [19].

      (5) Page 6, line 144: "Together, these observations suggest that the presence of a jugal posterior process was incorrectly scored in the datasets used by WEA24 (type (ii) error)." That sentence is unclear. Why did the authors use "suggest"? Does it mean that they did not have access to the original data matrix to check it? If so, it should be clearly stated at the beginning of the manuscript.

      See earlier; this has been modified and “suggest” has been removed.

      (6) Page 7, line 174: "Finally, even in the case of the isolated humerus with a preserved capitulum, the condyle illustrated by Whiteside et al. [19] is fairly small compared to even the earliest known pan-squamates, such as Megachirella wachtleri (Figure 4)." Figure 4 does not show any humeri. Please correct.

      The reference to figure 4 has been removed.

      (7) Page 8, line 195-198: "This is not the condition specified in either of the morphological character sets that they cite [18,38], the presence of a distinct condyle that is expanded and is by their own description not homologous to the condition in other squamates." This is a bit unclear. Could the authors explain it a little bit further? How is the condition that is specified in the referred papers different compared to the Whiteside et al. description?

      We appreciate this comment and have broken this sentence up into three sentences to clarify what we mean:

      “The projection of the radial condyle above the adjacent region of the distal anterior extremity is not the condition specified in either of the morphological character sets that Whiteside et al. [19] cite [18,32]. The condition specified in those studies is the presence of a distinct condyle that is expanded. The feature described in Whiteside et al. [19] does not correspond to the character scored in the phylogenetic datasets.” (L.220-225, P.8).

      (8) Page 16, line 446: "they observed in isolated vertebrae that they again refer to C. microlanius without justification". That is not true. The referred paper explains the attribution of these vertebrae to Cryptovaranoides (see section 5.3 therein). The authors do not have to agree with that justification, but they cannot claim that no justification was made. Please correct it here and throughout the text.

      We have modified this sentence but note that the justification in Whiteside et al. (2024) lacked rigor. Whiteside et al. (2024) state: “Brownstein et al. [5] contested the affinities of three vertebrae, cervical vertebra NHMUK PV R37276, dorsal vertebra NHMUK PV R37277 and sacral vertebra NHMUK PV R37275. While all three are amphicoelous and not notochordal, the first two can be directly compared to the holotype. Cervical vertebra NHMUK PV R37276 is of the same form as the holotype CV3 with matching neural spine, ventral keel (=crest) and the posterior lateral ridges or lamina (figure 3c,d) shown by Brownstein et al. [5, fig. 1a]. The difference is that NHMUK PV R37276 has a fused neural arch to the pleurocentrum and a synapophysis rather than separate diapophysis and parapophysis of the juvenile holotype (figure 3c). Neurocentral fusion of the neural arch and centrum can occur late in modern squamates, ‘up to 82% of the species maximum size’ [28].

      The dorsal surface of dorsal vertebra NHMUK PV R37277 (figure 3e) can be matched to the mid-dorsal vertebra in the †Cryptovaranoides holotype (figure 4d, dor.ve) and has the same morphology of wide, dorsally and outwardly directed, prezygapophyses, downwardly directed postzygapophyses and similar neural spine. It is also of similar proportions to the holotype when viewed dorsally (figures 3e and 4d), both being about 1.2 times longer anteroposteriorly than they are wide, measured across the posterior margin. The image in figure 4d demonstrates that the posterior vertebrae are part of the same spinal column as the truncated proximal region but the spinal column between the two parts is missing, probably lost in quarrying or fossil collection.”

      This justification is based on pointing out the presence of supposed shared features between these isolated vertebrae and those in the holotype of Cryptovaranoides, even though none of these features are diagnostic for that taxon. We have changed the sentence in our manuscript to read:

      “Whiteside et al. [19] concur with Brownstein et al. [18] that the diapophyses and parapophyses are unfused in the anterior dorsals of the holotype of †Cryptovaranoides microlanius, and restate that fusion of these structures is based on the condition they observed in isolated vertebrae that they refer to †C. microlanius based on general morphological similarity and without reference to diagnostic characters of †C. microlanius” (L. 502-507, P. 17).

      (9) Figure 2. The figure caption lacks some explanations. Please provide information about affinity (e.g., squamate/gekkotan), ag,e and locality of the taxa presented. Are these left or right palatines? The second one seems to be incomplete, and maybe it is worth replacing it with something else?

      The figure caption has been modified:

      “Figure 2. Comparison of palatine morphologies. Blue shading indicates choanal fossa. Top image of †Cryptovaranoides referred left palatine is from Whiteside et al. [19]. Middle is the left palatine of †Helioscopos dickersonae (Squamata: Pan-Gekkota) from the Late Jurassic Morrison Formation [62]. Bottom is the right palatine of †Eoscincus ornatus (Squamata: Pan-Scincoidea) from the Late Jurassic Morrison Formation [31].”

      (10) Figure 8. The abbreviations are not explained in the figure caption.

      These have been added.

    1. Author response:

      The following is the authors’ response to the previous reviews

      Reviewer #1 (Public review):

      Introduction & Theory

      (1) It is difficult to appreciate why the first trial of extinction in a standard protocol does NOT produce the retrieval-extinction effect. This applies to the present study as well as others that have purported to show a retrieval-extinction effect. The importance of this point comes through at several places in the paper. E.g., the two groups in Study 1 experienced a different interval between the first and second CS extinction trials; and the results varied with this interval: a longer interval (10 min) ultimately resulted in less reinstatement of fear than a shorter interval. Even if the different pattern of results in these two groups was shown/known to imply two different processes, there is nothing in the present study that addresses what those processes might be. That is, while the authors talk about mechanisms of memory updating, there is little in the present study that permits any clear statement about mechanisms of memory. The references to a "short-term memory update" process do not help the reader to understand what is happening in the protocol.

      We agree with the reviewer that whether and how the retrieval-extinction paradigm works is still under debate. Our results provide another line of evidence that such a paradigm is effective in producing long term fear amnesia. The focus of the current manuscript is to demonstrate that the retrieval-extinction paradigm can also facilitate a short-term fear memory deficit measured by SCR. Our TMS study provided some preliminary evidence in terms of the brain mechanisms involved in the causal relationship between the dorsolateral prefrontal cortex (dlPFC) activity and the short-term fear amnesia and showed that both the retrieval interval and the intact dlPFC activity were necessary for the short-term fear memory deficit and accordingly were referred to as the “mechanism” for memory update. We acknowledge that the term “mechanism” might have different connotations for different researchers. We now more explicitly clarify what we mean by “mechanisms” in the manuscript (line 99) as follows:

      “In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      In reply to this point, the authors cite evidence to suggest that "an isolated presentation of the CS+ seems to be important in preventing the return of fear expression." They then note the following: "It has also been suggested that only when the old memory and new experience (through extinction) can be inferred to have been generated from the same underlying latent cause, the old memory can be successfully modified (Gershman et al., 2017). On the other hand, if the new experiences are believed to be generated by a different latent cause, then the old memory is less likely to be subject to modification. Therefore, the way the 1stand 2ndCS are temporally organized (retrieval-extinction or standard extinction) might affect how the latent cause is inferred and lead to different levels of fear expression from a theoretical perspective." This merely begs the question: why might an isolated presentation of the CS+ result in the subsequent extinction experiences being allocated to the same memory state as the initial conditioning experiences? This is not yet addressed in any way.

      As in our previous response, this manuscript is not about investigating the cognitive mechanism why and how an isolated presentation of the CS+ would suppress fear expression in the long term. As the reviewer is aware, and as we have addressed in our previous response letters, both the positive and negative evidence abounds as to whether the retrieval-extinction paradigm can successfully suppress the long-term fear expression. Previous research depicted mechanisms instigated by the single CS+ retrieval at the molecular, cellular, and systems levels, as well as through cognitive processes in humans. In the current manuscript, we simply set out to test that in addition to the long-term fear amnesia, whether the retrieval-extinction paradigm can also affect subjects’ short-term fear memory.

      (2) The discussion of memory suppression is potentially interesting but, in its present form, raises more questions than it answers. That is, memory suppression is invoked to explain a particular pattern of results but I, as the reader, have no sense of why a fear memory would be better suppressed shortly after the retrieval-extinction protocol compared to the standard extinction protocol; and why this suppression is NOT specific to the cue that had been subjected to the retrieval-extinction protocol.

      Memory suppression is the hypothesis we proposed that might be able to explain the results we obtained in the experiments. We discussed the possibility of memory suppression and listed the reasons why such a mechanism might be at work. As we mentioned in the manuscript, our findings are consistent with the memory suppression mechanism on at least two aspects: 1) cue-independence and 2) thought-control ability dependence. We agree that the questions raised by the reviewer are interesting but to answer these questions would require a series of further experiments to disentangle all the various variables and conceptual questions about the purpose of a phenomenon, which we are afraid is out of the scope of the current manuscript. We refer the reviewer to the discussion section where memory suppression might be the potential mechanism for the short-term amnesia we observed (lines 562-569) as follows:

      “Previous studies indicate that a suppression mechanism can be characterized by three distinct features: first, the memory suppression effect tends to emerge early, usually 10-30 mins after memory suppression practice and can be transient (MacLeod and Macrae, 2001; Saunders and MacLeod, 2002); second, the memory suppression practice seems to directly act upon the unwanted memory itself (Levy and Anderson, 2002), such that the presentation of other cues originally associated with the unwanted memory also fails in memory recall (cue-independence); third, the magnitude of memory suppression effects is associated with individual difference in control abilities over intrusive thoughts (Küpper et al., 2014).”

      (3) Relatedly, how does the retrieval-induced forgetting (which is referred to at various points throughout the paper) relate to the retrieval-extinction effect? The appeal to retrieval-induced forgetting as an apparent justification for aspects of the present study reinforces points 2 and 3 above. It is not uninteresting but lacks clarification/elaboration and, therefore, its relevance appears superficial at best.

      We brought the topic of retrieval-induced forgetting (RIF) to stress the point that memory suppression can be unconscious. In a standard RIF paradigm, unlike the think/no-think paradigm, subjects are not explicitly told to suppress the non-target memories. However, to successfully retrieve the target memory, the cognitive system actively inhibits the non-target memories, effectively implementing a memory suppression mechanism (though unconsciously). Therefore, it is possible our results might be explained by the memory suppression framework. We elaborated this point in the discussion section (lines 578-584): 

      “In our experiments, subjects were not explicitly instructed to suppress their fear expression, yet the retrieval-extinction training significantly decreased short-term fear expression. These results are consistent with the short-term amnesia induced with the more explicit suppression intervention (Anderson et al., 1994; Kindt and Soeter, 2018; Speer et al., 2021; Wang et al., 2021; Wells and Davies, 1994). It is worth noting that although consciously repelling unwanted memory is a standard approach in memory suppression paradigm, it is possible that the engagement of the suppression mechanism can be unconscious.”

      (4) I am glad that the authors have acknowledged the papers by Chalkia, van Oudenhove & Beckers (2020) and Chalkia et al (2020), which failed to replicate the effects of retrieval-extinction reported by Schiller et al in Reference 6. The authors have inserted the following text in the revised manuscript: "It should be noted that while our long-term amnesia results were consistent with the fear memory reconsolidation literature, there were also studies that failed to observe fear prevention (Chalkia, Schroyens, et al., 2020; Chalkia, Van Oudenhove, et al., 2020; Schroyens et al., 2023). Although the memory reconsolidation framework provides a viable explanation for the long-term amnesia, more evidence is required to validate the presence of reconsolidation, especially at the neurobiological level (Elsey et al., 2018). While it is beyond the scope of the current study to discuss the discrepancies between these studies, one possibility to reconcile these results concerns the procedure for the retrieval-extinction training. It has been shown that the eligibility for old memory to be updated is contingent on whether the old memory and new observations can be inferred to have been generated by the same latent cause (Gershman et al., 2017; Gershman and Niv, 2012). For example, prevention of the return of fear memory can be achieved through gradual extinction paradigm, which is thought to reduce the size of prediction errors to inhibit the formation of new latent causes (Gershman, Jones, et al., 2013). Therefore, the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause." Firstly, if it is beyond the scope of the present study to discuss the discrepancies between the present and past results, it is surely beyond the scope of the study to make any sort of reference to clinical implications!!!

      As we have clearly stated in our manuscript that this paper was not about discussing why some literature was or was not able to replicate the retrieval-extinction results originally reported by Schiller et al. 2010. Instead, we aimed to report a novel short-term fear amnesia through the retrieval-extinction paradigm, above and beyond the long-term amnesia reported before. Speculating about clinical implications of these finding is unrelated to the long-term, amnesia debate in the reconsolidation world. We now refer the reader to several perspectives and reviews that have proposed ways to resolve these discrepancies as follows (lines 642-673).

      Secondly, it is perfectly fine to state that "the effectiveness of the retrieval-extinction paradigm might depend on the reliability of such paradigm in inferring the same underlying latent cause..." This is not uninteresting, but it also isn't saying much. Minimally, I would expect some statement about factors that are likely to determine whether one is or isn't likely to see a retrieval-extinction effect, grounded in terms of this theory.

      Again, as we have responded many times, we simply do not know why some studies were able to suppress the fear expression using the retrieval-extinction paradigm and other studies weren’t. This is still an unresolved issue that the field is actively engaging with, and we now refer the reader to several papers dealing with this issue. However, this is NOT the focus of our manuscript. Having a healthy debate does not mean that every study using the retrieval-extinction paradigm must address the long-standing question of why the retrieval-extinction paradigm is effective (at least in some studies).

      Clarifications, Elaborations, Edits

      (5) Some parts of the paper are not easy to follow. Here are a few examples (though there are others):

      (a) In the abstract, the authors ask "whether memory retrieval facilitates update mechanisms other than memory reconsolidation"... but it is never made clear how memory retrieval could or should "facilitate" a memory update mechanism.

      We meant to state that the retrieval-extinction paradigm might have effects on fear memory, above and beyond the purported memory reconsolidation effect. Sentence modified (lines 25-26) as follows:

      “Memory reactivation renders consolidated memory fragile and thereby opens the window for memory updates, such as memory reconsolidation.”

      (b) The authors state the following: "Furthermore, memory reactivation also triggers fear memory reconsolidation and produces cue specific amnesia at a longer and separable timescale (Study 2, N = 79 adults)." Importantly, in study 2, the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction. This result is interesting but cannot be easily inferred from the statement that begins "Furthermore..." That is, the results should be described in terms of the combined effects of retrieval and extinction, not in terms of memory reactivation alone; and the statement about memory reconsolidation is unnecessary. One can simply state that the retrieval-extinction protocol produced a cue-specific disruption in responding when testing occurred 24 hours after the end of extinction.

      The sentence the reviewer referred to was in our original manuscript submission but had since been modified based on the reviewer’s comments from last round of revision. Please see the abstract (lines 30-35) of our revised manuscript from last round of revision:

      “Furthermore, across different timescales, the memory retrieval-extinction paradigm triggers distinct types of fear amnesia in terms of cue-specificity and cognitive control dependence, suggesting that the short-term fear amnesia might be caused by different mechanisms from the cue-specific amnesia at a longer and separable timescale (Study 2, N = 79 adults).”

      (c) The authors also state that: "The temporal scale and cue-specificity results of the short-term fear amnesia are clearly dissociable from the amnesia related to memory reconsolidation, and suggest that memory retrieval and extinction training trigger distinct underlying memory update mechanisms." ***The pattern of results when testing occurred just minutes after the retrieval-extinction protocol was different to that obtained when testing occurred 24 hours after the protocol. Describing this in terms of temporal scale is unnecessary; and suggesting that memory retrieval and extinction trigger different memory update mechanisms is not obviously warranted. The results of interest are due to the combined effects of retrieval+extinction and there is no sense in which different memory update mechanisms should be identified with the different pattern of results obtained when testing occurred either 30 min or 24 hours after the retrieval-extinction protocol (at least, not the specific pattern of results obtained here).

      Again, we are afraid that the reviewer referred to the abstract in the original manuscript submission, instead of the revised abstract we submitted in the last round. Please see lines 37-39 of the revised abstract where the sentence was already modified (or the abstract from last round of revision).

      The facts that the 30min, 6hr and 24hr test results are different in terms of their cue-specificity and thought-control ability dependence are, to us, an important discovery in terms of delineating different cognitive processes at work following the retrieval-extinction paradigm. We want to emphasize that the fear memories after going through the retrieval-extinction paradigm showed interesting temporal dynamics in terms of their magnitudes, cue-specificity and thought-control ability dependence.

      (d) The authors state that: "We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory update mechanisms following extinction training, and these mechanisms can be further disentangled through the lens of temporal dynamics and cue-specificities." *** The first part of the sentence is confusing around usage of the term "facilitate"; and the second part of the sentence that references a "lens of temporal dynamics and cue-specificities" is mysterious. Indeed, as all rats received the same retrieval-extinction exposures in Study 2, it is not clear how or why any differences between the groups are attributed to "different memory update mechanisms following extinction"

      The term “facilitate” was used to highlight the fact that the short-term fear amnesia effect is also memory retrieval dependent, as study 1 demonstrated. The novelty of the short-term fear memory deficit can be distinguished from the long-term memory effect via cue-specificity and thought-control ability dependence. Sentence has been modified (lines 97-101) as follows:

      “We hypothesize that the labile state triggered by the memory retrieval may facilitate different memory deficits following extinction training, and these deficits can be further disentangled through the lens of temporal dynamics and cue-specificities. In theory, different cognitive mechanisms underlying specific fear memory deficits, therefore, can be inferred based on the difference between memory deficits.”

      Data

      (6A) The eight participants who were discontinued after Day 1 in Study 1 were all from the no reminder group. The authors should clarify how participants were allocated to the two groups in this experiment so that the reader can better understand why the distribution of non-responders was non-random (as it appears to be).

      (6B) Similarly, in study 2, of the 37 participants that were discontinued after Day 2, 19 were from Group 30 min and 5 were from Group 6 hours. The authors should comment on how likely these numbers are to have been by chance alone. I presume that they reflect something about the way that participants were allocated to groups: e.g., the different groups of participants in studies 1 and 2 could have been run at quite different times (as opposed to concurrently). If this was done, why was it done? I can't see why the study should have been conducted in this fashion - this is for myriad reasons, including the authors' concerns re SCRs and their seasonal variations.

      As we responded in the previous response letters (as well as in the revised the manuscript), subjects were excluded because their SCR did not reach the threshold of 0.02 S when electric shock was applied. Subjects were assigned to different treatments daily (eg. Day 1 for the reminder group and Day 2 for no-reminder group) to avoid potential confusion in switching protocols to different subjects within the same day. We suspect that the non-responders might be related to the body thermal conditions caused by the lack of central heating for specific dates. Please note that the discontinued subjects (non-responders) were let go immediately after the failure to detect their SCR (< 0.02 S) on Day 1 and never invited back on Day 2, so it’s possible that the discontinued subjects were all from certain dates on which the body thermal conditions were not ideal for SCR collection. Despite the number of excluded subjects, we verified the short-term fear amnesia effect in three separate studies, which to us should serve as strong evidence in terms of the validity of the effect.

      (6C) In study 2, why is responding to the CS- so high on the first test trial in Group 30 min? Is the change in responding to the CS- from the last extinction trial to the first test trial different across the three groups in this study? Inspection of the figure suggests that it is higher in Group 30 min relative to Groups 6 hours and 24 hours. If this is confirmed by the analysis, it has implications for the fear recovery index which is partly based on responses to the CS-. If not for differences in the CS- responses, Groups 30 min and 6 hours are otherwise identical. That is, the claim of differential recovery to the CS1 and CS2 across time may simply an artefact of the way that the recovery index was calculated. This is unfortunate but also an important feature of the data given the way in which the fear recovery index was calculated.

      We have provided detailed analysis to this question in our previous response letter, and we are posting our previous response there:

      Following the reviewer’s comments, we went back and calculated the mean SCR difference of CS- between the first test trial and the last extinction trial for all three studies (see Author response image 1 below). In study 1, there was no difference in the mean CS- SCR (between the first test trial and last extinction trial) between the reminder and no-reminder groups (Kruskal-Wallis test , though both groups showed significant fear recovery even in the CS- condition (Wilcoxon signed rank test, reminder: P = 0.0043, no-reminder: P = 0.0037). Next, we examined the mean SCR for CS- for the 30min, 6h and 24h groups in study 2 and found that there was indeed a group difference (one-way ANOVA,F<sub>2.76</sub> = 5.3462, P = 0.0067, panel b), suggesting that the CS- related SCR was influenced by the test time (30min, 6h or 24h). We also tested the CS- related SCR for the 4 groups in study 3 (where test was conducted 1 hour after the retrieval-extinction training) and found that across TMS stimulation types (PFC vs. VER) and reminder types (reminder vs. no-reminder) the ANOVA analysis did not yield main effect of TMS stimulation type (F<sub>1.71</sub> = 0.322, P = 0.572) nor main effect of reminder type (F<sub>1.71</sub> = 0.0499, P = 0.824, panel c). We added the R-VER group results in study 3 (see panel c) to panel b and plotted the CS- SCR difference across 4 different test time points and found that CS- SCR decreased as the test-extinction delay increased (Jonckheere-Terpstra test, P = 0.00028). These results suggest a natural “forgetting” tendency for CS- related SCR and highlight the importance of having the CS- as a control condition to which the CS+ related SCR was compared with.

      Author response image 1.

      (6D) The 6 hour group was clearly tested at a different time of day compared to the 30 min and 24 hour groups. This could have influenced the SCRs in this group and, thereby, contributed to the pattern of results obtained.

      Again, we answered this question in our previous response. Please see the following for our previous response:

      For the 30min and 24h groups, the test phase can be arranged in the morning, in the afternoon or at night. However, for the 6h group, the test phase was inevitably in the afternoon or at night since we wanted to exclude the potential influence of night sleep on the expression of fear memory (see Author response table 1 below). If we restricted the test time in the afternoon or at night for all three groups, then the timing of their extinction training was not matched.

      Author response table 1.

      Nevertheless, we also went back and examined the data for the subjects only tested in the afternoon or at nights in the 30min and 24h groups to match with the 6h group where all the subjects were tested either in the afternoon or at night. According to the table above, we have 17 subjects for the 30min group (9+8),18 subjects for the 24h group (9 + 9) and 26 subjects for the 6h group (12 + 14). As Author response image 2 shows, the SCR patterns in the fear acquisition, extinction and test phases were similar to the results presented in the original figure.

      Author response image 2.

      (6E) The authors find different patterns of responses to CS1 and CS2 when they were tested 30 min after extinction versus 24 h after extinction. On this basis, they infer distinct memory update mechanisms. However, I still can't quite see why the different patterns of responses at these two time points after extinction need to be taken to infer different memory update mechanisms. That is, the different patterns of responses at the two time points could be indicative of the same "memory update mechanism" in the sense that the retrieval-extinction procedure induces a short-term memory suppression that serves as the basis for the longer-term memory suppression (i.e., the reconsolidation effect). My pushback on this point is based on the notion of what constitutes a memory update mechanism; and is motivated by what I take to be a rather loose use of language/terminology in the reconsolidation literature and this paper specifically (for examples, see the title of the paper and line 2 of the abstract).

      As we mentioned previously, the term “mechanism” might have different connotations for different researchers. We aim to report a novel memory deficit following the retrieval-extinction paradigm, which differed significantly from the purported reconsolidation related long-term fear amnesia in terms of its timescale, cue-specificity and thought-control ability. Further TMS study confirmed that the intact dlPFC function is necessary for the short-term memory deficit. It’s based on these results we proposed that the short-term fear amnesia might be related to a different cognitive “mechanism”. As mentioned above, we now clarify what we mean by “mechanism” in the abstract and introduction (lines 31-34, 97-101).

      Reviewer #2 (Public review):

      The fear acquisition data is converted to a differential fear SCR and this is what is analysed (early vs late). However, the figure shows the raw SCR values for CS+ and CS- and therefore it is unclear whether acquisition was successful (despite there being an "early" vs "late" effect - no descriptives are provided).

      (1) There are still no descriptive statistics to substantiate learning in Experiment 1.

      We answered this question in our previous response letter. We are sorry that the definition of “early” and “late” trials was scattered in the manuscript. For example, we wrote “the late phase of acquisition (last 5 trials)” (Line 375-376) in the results section. Since there were 10 trials in total for the acquisition stage, we define the first 5 trials and the last 5 trials as “early” and “late” phases of the acquisition stage and explicitly added them into the first occasion “early” and “late” terms appeared (lines 316-318).

      In the results section, we did test whether the acquisition was successful in our previous manuscript (Line 316-325):

      “To assess fear acquisition across groups (Figure 1B and C), we conducted a mixed two-way ANOVA of group (reminder vs. no-reminder) x time (early vs. late part of the acquisition; first 5 and last 5 trials, correspondingly) on the differential fear SCR. Our results showed a significant main effect of time (early vs. late; F<sub>1,55</sub> \= 6.545, P \= 0.013, η<sup>2</sup> \= 0.106), suggesting successful fear acquisition in both groups. There was no main effect of group (reminder vs. no-reminder) or the group x time interaction (group: F<sub>1,55</sub> \= 0.057, P \= 0.813, η<sup>2</sup> \= 0.001; interaction: F<sub>1,55</sub> \= 0.066, P \= 0.798, η<sup>2</sup> \= 0.001), indicating similar levels of fear acquisition between two groups. Post-hoc t-tests confirmed that the fear responses to the CS+ were significantly higher than that of CS- during the late part of acquisition phase in both groups (reminder group: t<sub>29</sub> \= 6.642, P < 0.001; no-reminder group: t<sub>26</sub> = 8.522, P < 0.001; Figure 1C). Importantly, the levels of acquisition were equivalent in both groups (early acquisition: t<sub>55</sub> \= -0.063, P \= 0.950; late acquisition: t<sub>55</sub> \= -0.318, P \= 0.751; Figure 1C).”

      In Experiment 1 (Test results) it is unclear whether the main conclusion stems from a comparison of the test data relative to the last extinction trial ("we defined the fear recovery index as the SCR difference between the first test trial and the last extinction trial for a specific CS") or the difference relative to the CS- ("differential fear recovery index between CS+ and CS-"). It would help the reader assess the data if Fig 1e presents all the indexes (both CS+ and CS-). In addition, there is one sentence which I could not understand "there is no statistical difference between the differential fear recovery indexes between CS+ in the reminder and no reminder groups (P=0.048)". The p value suggests that there is a difference, yet it is not clear what is being compared here. Critically, any index taken as a difference relative to the CS- can indicate recovery of fear to the CS+ or absence of discrimination relative to the CS-, so ideally the authors would want to directly compare responses to the CS+ in the reminder and no-reminder groups. In the absence of such comparison, little can be concluded, in particular if SCR CS- data is different between groups. The latter issue is particularly relevant in Experiment 2, in which the CS- seems to vary between groups during the test and this can obscure the interpretation of the result.

      (2) In the revised analyses, the authors now show that CS- changes in different groups (for example, Experiment 2) so this means that there is little to conclude from the differential scores because these depend on CS-. It is unclear whether the effects arise from CS+ performance or the differential which is subject to CS- variations.

      There was a typo in the “P = 0.048” sentence and we have corrected it in our last response letter. Also in the previous response letter, we specifically addressed how the fear recovery index was defined (also in the revised manuscript).

      In most of the fear conditioning studies, CS- trials were included as the baseline control. In turn, most of the analyses conducted also involved comparisons between different groups. Directly comparing CS+ trials across groups (or conditions) is rare. In our study 2, we showed that the CS- response decreased as a function of testing delays (30min, 1hr, 6hr and 24hr). Ideally, it would be nice to show that the CS- across groups/conditions did not change. However, even in those circumstances, comparisons are still based on the differential CS response (CS+ minus CS-), that is, the difference of difference. It is also important to note that difference score is important as CS+ alone or across conditions is difficult to interpret, especially in humans, due to noise, signal fluctuations, and irrelevant stimulus features; therefore trials-wise reference is essential to assess the CS+ in the context of a reference stimulus in each trial (after all, the baselines are different). We are listing a few influential papers in the field that the CS- responses were not particularly equivalent across groups/conditions and argue that this is a routine procedure (Kindt & Soeter 2018 Figs. 2-3; Sevenster et al., 2013 Fig. 3; Liu et al., 2014 Fig. 1; Raio et al., 2017 Fig. 2).

      In experiment 1, the findings suggest that there is a benefit of retrieval followed by extinction in a short-term reinstatement test. In Experiment 2, the same effect is observed to a cue which did not undergo retrieval before extinction (CS2+), a result that is interpreted as resulting from cue-independence, rather than a failure to replicate in a within-subjects design the observations of Experiment 1 (between-subjects). Although retrieval-induced forgetting is cue-independent (the effect on items that are suppressed [Rp-] can be observed with an independent probe), it is not clear that the current findings are similar, and thus that the strong parallels made are not warranted. Here, both cues have been extinguished and therefore been equally exposed during the critical stage.

      (3) The notion that suppression is automatic is speculative at best

      We have responded the same question in our previous revision. Please note that our results from study 1 (the comparison between reminder and no-reminder groups) was not set up to test the cue-independence hypothesis for the short-term amnesia with only one CS+. Results from both study 2 (30min condition) and study 3 confirmed the cue-independence hypothesis and therefore we believe interpreting results from study 2 as “a failure to replicate in a within-subject design of the observations of Experiment 1” is not the case.

      We agree that the proposal of automatic or unconscious memory suppression is speculative and that’s why we mentioned it in the discussion. The timescale, cue-specificity and the thought-control ability dependence of the short-term fear amnesia identified in our studies was reminiscent of the memory suppression effects reported in the previous literature. However, memory suppression typically adopted a conscious “suppression” treatment (such as the think/no-think paradigm), which was absent in the current study. However, the retrieval-induced forgetting (RIF), which is also considered a memory suppression paradigm via inhibitory control, does not require conscious effort to suppress any particular thought. Based on these results and extant literature, we raised the possibility of memory suppression as a potential mechanism. We make clear in the discussion that the suppression hypothesis and connections with RIF will require further evidence (lines 615-616):

      “future research will be needed to investigate whether the short-term effect we observed is specifically related to associative memory or the spontaneous nature of suppression as in RIF (Figure 6C).”

      (4) It still struggle with the parallels between these findings and the "limbo" literature. Here you manipulated the retention interval, whereas in the cited studies the number of extinction (exposure) was varied. These are two completely different phenomena.

      We borrowed the “limbo” term to stress the transitioning from short-term to long-term memory deficits (the 6hr test group). Merlo et al. (2014) found that memory reconsolidation and extinction were dissociable processes depending on the extent of memory retrieval. They argued that there was a “limbo” transitional state, where neither the reconsolidation nor the extinction process was engaged. Our results suggest that at the test delay of 6hr, neither the short-term nor the long-term effect was present, signaling a “transitional” state after which the short-term memory deficit wanes and the long-term deficit starts to take over. We make this idea more explicit as follows (lines 622-626):

      “These works identified important “boundary conditions” of memory retrieval in affecting the retention of the maladaptive emotional memories. In our study, however, we showed that even within a boundary condition previously thought to elicit memory reconsolidation, mnemonic processes other than reconsolidation could also be at work, and these processes jointly shape the persistence of fear memory.”

      (5) My point about the data problematic for the reconsolidation (and consolidation) frameworks is that they observed memory in the absence of the brain substrates that are needed for memory to be observed. The answer did not address this. I do not understand how the latent cause model can explain this, if the only difference is the first ITI. Wouldn't participants fail to integrate extinction with acquisition with a longer ITI?

      We take the sentence “they observed memory in the absence of the brain substrates that are needed for memory to be observed” as referring to the long-term memory deficit in our study. As we responded before, the aim of this manuscript was not about investigating the brain substrates involved in memory reconsolidation (or consolidation). Using a memory retrieval-extinction paradigm, we discovered a novel short-term memory effect, which differed from the purported reconsolidation effect in terms of timescale, cue-specificity and thought-control ability dependence. We further showed that both memory retrieval and intact dlPFC functions were necessary to observe the short-term memory deficit effect. Therefore, we conclude that the brain mechanism involved in such an effect should be different from the one related to the purported reconsolidation effect. We make this idea more explicit as follows (lines 546-547):

      “Therefore, findings of the short-term fear amnesia suggest that the reconsolidation framework falls short to accommodate this more immediate effect (Figure 6A and B).”

      Whilst I could access the data in the OFS site, I could not make sense of the Matlab files as there is no signposting indicating what data is being shown in the files. Thus, as it stands, there is no way of independently replicating the analyses reported.

      (6) The materials in the OSF site are the same as before, they haven't been updated.

      Last time we thought the main issue was the OSF site not being publicly accessible and thus made it open to all visitors. We have added descriptive file to explain the variables to help visitors to replicate the analyses we took.

      (7) Concerning supplementary materials, the robustness tests are intended to prove that you 1) can get the same results by varying the statistical models or 2) you can get the same results when you include all participants. Here authors have done both so this does not help. Also, in the rebuttal letter, they stated "Please note we did not include non-learners in these analyses " which contradicts what is stated in the figure captions "(learners + non learners)"

      In the supplementary materials, we did the analyses of varying the statistical models and including both learners and non-learners separately, instead of both. In fact, in the supplementary material Figs. 1 & 2, we included all the participants and performed similar analysis as in the main text and found similar results (learners + non-learners). Also, in the text of the supplementary material, we used a different statistical analysis method to only learners (analyzing subjects reported in the main text using a different method) and achieved similar results. We believe this is exactly what the reviewer suggested us to do. Also there seems to be a misunderstanding for the "Please note we did not include non-learners in these analyses" sentence in the rebuttal letter. As the reviewer can see, the full sentence read “Please note we did not include non-learners in these analyses (the texts of the supplementary materials)”. We meant to express that the Figures and texts in the supplementary material reflect two approaches: 1) Figures depicting re-analysis with all the included subjects (learners + non learners); 2) Text describing different analysis with learners. We added clarifications to emphasize these approaches in the supplementary materials.

      (8) Finally, the literature suggesting that reconsolidation interference "eliminates" a memory is not substantiated by data nor in line with current theorising, so I invite a revision of these strong claims.

      We agree and have toned down the strong claims.

      Overall, I conclude that the revised manuscript did not address my main concerns.

      In both rounds of responses, we tried our best to address the reviewer’s concerns. We hope that the clarifications in this letter and revisions in the text address the remaining concerns. Thank you for your feedback.

      Reference:

      Kindt, M. and Soeter, M. 2018. Pharmacologically induced amnesia for learned fear is time and sleep dependent. Nat Commun, 9, 1316.

      Liu, J., Zhao, L., Xue, Y., Shi, J., Suo, L., Luo, Y., Chai, B., Yang, C., Fang, Q., Zhang, Y., Bao, Y., Pickens, C. L. and Lu, L. 2014. An unconditioned stimulus retrieval extinction procedure to prevent the return of fear memory. Biol Psychiatry, 76, 895-901.

      Raio, C. M., Hartley, C. A., Orederu, T. A., Li, J. and Phelps, E. A. 2017. Stress attenuates the flexible updating of aversive value. Proc Natl Acad Sci U S A, 114, 11241-11246.

      Sevenster, D., Beckers, T., & Kindt, M. 2013. Prediction error governs pharmacologically induced amnesia for learned fear. Science (New York, N.Y.), 339(6121), 830–833.

    1. Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      Strengths

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies<br /> (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies<br /> (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Weaknesses

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Editors’ note: Reviewer #2 was unavailable to re-review the manuscript. Reviewer #3 was added for this round of review to ensure two reviewers and because of their expertise in the computational and modelling aspects of the work.

    2. Author response:

      The following is the authors’ response to the current reviews.

      eLife Assessment<br /> This study offers valuable insights into how humans detect and adapt to regime shifts, highlighting distinct contributions of the frontoparietal network and ventromedial prefrontal cortex to sensitivity to signal diagnosticity and transition probabilities. The combination of an innovative task design, behavioral modeling, and model-based fMRI analyses provides a solid foundation for the conclusions; however, the neuroimaging results have several limitations, particularly a potential confound between the posterior probability of a switch and the passage of time that may not be fully controlled by including trial number as a regressor. The control experiments intended to address this issue also appear conceptually inconsistent and, at the behavioral level, while informing participants of conditional probabilities rather than requiring learning is theoretically elegant, such information is difficult to apply accurately, as shown by well-documented challenges with conditional reasoning and base-rate neglect. Expressing these probabilities as natural frequencies rather than percentages may have improved comprehension. Overall, the study advances understanding of belief updating under uncertainty but would benefit from more intuitive probabilistic framing and stronger control of temporal confounds in future work.

      We thank the editors for the assessment. The editor added several limitations based on the new reviewer 3 in this round, which we address below.

      With regard to temporal confounds, we clarified in the main text and response to Reviewer 3 that we had already addressed the potential confound between posterior probability of a switch and passage of time in GLM-2 with the inclusion of intertemporal prior. After adding intertemporal prior in the GLM, we still observed the same fMRI results on probability estimates. In addition, we did two other robustness checks, which we mentioned in the manuscript.

      With regard to response mode (probability estimation rather than choice or indicating natural frequencies), we wish to point out that the in previous research by Massey and Wu (2005), which the current study was based on, the concern of participants showing system-neglect tendencies due to the mode of information delivery, namely indicating beliefs through reporting probability estimates rather than through choice or other response mode was addressed. Massy and Wu (2005, Study 3) found the same biases when participants performed a choice task that did not require them to indicate probability estimates.

      With regard to the control experiments, the control experiments in fact were not intended to address the confounds between posterior probability and passage of time. Rather, they aimed to address whether the neural findings were unique to change detection (Experiment 2) and to address visual and motor confounds (Experiment 3). These and the results of the control experiments were mentioned on page 18-19.

      Finally, we wish to highlight that we had performed detailed model comparisons after reviewer 2’s suggestions. Although reviewer 2 was unable to re-review the manuscript, we believe this provides insight into the literature on change detection. See “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection” (p.27-30). The model comparison showed that system-neglect models that incorporate signal dependency are better models than the original system-neglect model in describing participants probability estimates. This suggests that people respond to change-consistent and change-inconsistent signals differently when judging whether the regime had changed. This was not reported in previous behavioral studies and was largely inspired by the neural finding on signal dependency in the frontoparietal cortex. It indicates that neural findings can provide novel insights into computational modeling of behavior.           

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      - The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      - The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well. The model is comprehensively validated.

      - The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      We thank the reviewer for the comments.

      Weaknesses:

      The authors have adequately addressed most of my prior concerns.

      We thank the reviewer for recognizing our effort in addressing your concerns.

      My only remaining comment concerns the z-test of the correlations. I agree with the non-parametric test based on bootstrapping at the subject level, providing evidence for significant differences in correlations within the left IFG and IPS.

      However, the parametric test seems inadequate to me. The equation presented is described as the Fisher z-test, but the numerator uses the raw correlation coefficients (r) rather than the Fisher-transformed values (z). To my understanding, the subtraction should involve the Fisher z-scores, not the raw correlations.

      More importantly, the Fisher z-test in its standard form assumes that the correlations come from independent samples, as reflected in the denominator (which uses the n of each independent sample). However, in my opinion, the two correlations are not independent but computed within-subject. In such cases, parametric tests should take into account the dependency. I believe one appropriate method for the current case (correlated correlation coefficients sharing a variable [behavioral slope]) is explained here:

      Meng, X.-l., Rosenthal, R., & Rubin, D. B. (1992). Comparing correlated correlation coefficients. Psychological Bulletin, 111(1), 172-175. https://doi.org/10.1037/0033-2909.111.1.172

      It should be implemented here:

      Diedenhofen B, Musch J (2015) cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10(4): e0121945. https://doi.org/10.1371/journal.pone.0121945

      My recommendation is to verify whether my assumptions hold, and if so, perform a test that takes correlated correlations into account. Or, to focus exclusively on the non-parametric test.

      In any case, I recommend a short discussion of these findings and how the authors interpret that some of the differences in correlations are not significant.

      Thank you for the careful check. Yes. This was indeed a mistake from us. We also agree that the two correlations are not independent. Therefore, we modified the test that accounts for dependent correlations by following Meng et al. (1992) suggested by the reviewer.

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as , and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. To statistically compare these two correlations, we adopted the approach of Meng et al. (1992), which specifically tests differences between dependent correlations according to the following equation

      where  is the number of subjects, 𝑧<sub>𝑟𝑖</sub> is the Fisher z-transformed value of 𝑟<sub>𝑖</sub>, 𝑟<sub>1</sub> = 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> = 𝑟<sub>𝑟𝑒𝑑</sub>. 𝑟<sub>𝑥</sub> is the correlation between the neural sensitivity at change-consistent signals and change-inconsistent signals.

      Where is the mean of the , and 𝑓 should be set to 1 if > 1.

      We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8908, 𝑝 = 0.0293; left IPS: 𝑧 = 2.2584, 𝑝 = 0.0049). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.9522, 𝑝 = 0.1705; right IFG: 𝑧 = 0.9860, 𝑝 = 0.1621; right IPS: 𝑧 = 1.4833, 𝑝 = 0.0690). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0. These updated results are consistent with the nonparametric tests we had already performed and we will update them in the revised manuscript.

      Reviewer #3 (Public review):

      This study concerns how observers (human participants) detect changes in the statistics of their environment, termed regime shifts. To make this concrete, a series of 10 balls are drawn from an urn that contains mainly red or mainly blue balls. If there is a regime shift, the urn is changed over (from mainly red to mainly blue) at some point in the 10 trials. Participants report their belief that there has been a regime shift as a % probability. Their judgement should (mathematically) depend on the prior probability of a regime shift (which is set at one of three levels) and the strength of evidence (also one of three levels, operationalized as the proportion of red balls in the mostly-blue urn and vice versa). Participants are directly instructed of the prior probability of regime shift and proportion of red balls, which are presented on-screen as numerical probabilities. The task therefore differs from most previous work on this question in that probabilities are instructed rather than learned by observation, and beliefs are reported as numerical probabilities rather than being inferred from participants' choice behaviour (as in many bandit tasks, such as Behrens 2007 Nature Neurosci).

      The key behavioural finding is that participants over-estimate the prior probability of regime change when it is low, and under estimate it when it is high; and participants over-estimate the strength of evidence when it is low and under-estimate it when it is high. In other words participants make much less distinction between the different generative environments than an optimal observer would. This is termed 'system neglect'. A neuroeconomic-style mathematical model is presented and fit to data.

      Functional MRI results how that strength of evidence for a regime shift (roughly, the surprise associated with a blue ball from an apparently red urn) is associated with activity in the frontal-parietal orienting network. Meanwhile, at time-points where the probability of a regime shift is high, there is activity in another network including vmPFC. Both networks show individual differences effects, such that people who were more sensitive to strength of evidence and prior probability show more activity in the frontal-parietal and vmPFC-linked networks respectively.

      We thank the reviewer for the overall descriptions of the manuscript.

      Strengths:

      (1) The study provides a different task for looking at change-detection and how this depends on estimates of environmental volatility and sensory evidence strength, in which participants are directly and precisely informed of the environmental volatility and sensory evidence strength rather than inferring them through observation as in most previous studies

      (2) Participants directly provide belief estimates as probabilities rather than experimenters inferring them from choice behaviour as in most previous studies<br /> (3) The results are consistent with well-established findings that surprising sensory events activate the frontal-parietal orienting network whilst updating of beliefs about the word ('regime shift') activates vmPFC.

      Thank you for these assessments.

      Weaknesses:

      (1) The use of numerical probabilities (both to describe the environments to participants, and for participants to report their beliefs) may be problematic because people are notoriously bad at interpreting probabilities presented in this way, and show poor ability to reason with this information (see Kahneman's classic work on probabilistic reasoning, and how it can be improved by using natural frequencies). Therefore the fact that, in the present study, people do not fully use this information, or use it inaccurately, may reflect the mode of information delivery.

      We appreciate the reviewer’s concern on this issue. The concern was addressed in Massey and Wu (2005) as participants performed a choice task in which they were not asked to provide probability estimates (Study 3 in Massy and Wu, 2005). Instead, participants in Study 3 were asked to predict the color of the ball before seeing a signal. This was a more intuitive way of indicating his or her belief about regime shift. The results from the choice task were identical to those found in the probability estimation task (Study 1 in Massey and Wu). We take this as evidence that the system-neglect behavior the participants showed was less likely to be due to the mode of information delivery.

      (2) Although a very precise model of 'system neglect' is presented, many other models could fit the data.

      For example, you would get similar effects due to attraction of parameter estimates towards a global mean - essentially application of a hyper-prior in which the parameters applied by each participant in each block are attracted towards the experiment-wise mean values of these parameters. For example, the prior probability of regime shift ground-truth values [0.01, 0.05, 0.10] are mapped to subjective values of [0.037, 0.052, 0.069]; this would occur if observers apply a hyper-prior that the probability of regime shift is about 0.05 (the average value over all blocks). This 'attraction to the mean' is a well-established phenomenon and cannot be ruled out with the current data (I suppose you could rule it out by comparing to another dataset in which the mean ground-truth value was different).

      We thank the reviewer for this comment. It is true that the system-neglect model is not entirely inconsistent with regression to the mean, regardless of whether the implementation has a hyper prior or not. In fact, our behavioral measure of sensitivity to transition probability and signal diagnosticity, which we termed the behavioral slope, is based on linear regression analysis. In general, the modeling approach in this paper is to start from a generative model that defines ideal performance and consider modifying the generative model when systematic deviations in actual performance from the ideal is observed. In this approach, a generative model with hyper-prior would be more complex to begin with, and a regression to the mean idea by itself does not generate a priori predictions.

      More generally, any model in which participants don't fully use the numerical information they were given would produce apparent 'system neglect'. Four qualitatively different example reasons are: 1. Some individual participants completely ignored the probability values given. 2. Participants did not ignore the probability values given, but combined them with a hyperprior as above. 3. Participants had a reporting bias where their reported beliefs that a regime-change had occurred tend to be shifted towards 50% (rather than reporting 'confident' values such 5% or 95%). 4. Participants underweighted probability outliers resulting in underweighting of evidence in the 'high signal diagnosticity' environment (10.1016/j.neuron.2014.01.020 )

      In summary I agree that any model that fits the data would have to capture the idea that participants don't differentiate between the different environments as much as they should, but I think there are a number of qualitatively different reasons why they might do this - of which the above are only examples - hence I find it problematic that the authors present the behaviour as evidence for one extremely specific model.

      Thank you for raising this point. The modeling principle we adopt is the following. We start from the normative model—the Bayesian model—that defined what normative behavior should look like. We compared participants’ behavior with the Bayesian model and found systematic deviations from it. To explain those systematic deviations, we considered modeling options within the confines of the same modeling framework. In other words, we considered a parameterized version of the Bayesian model, which is the system-neglect model and examined through model comparison the best modeling choice. This modeling approach is not uncommon, and many would agree this is the standard approach in economics and psychology. For example, Kahneman and Tversky adopted this approach when proposing prospect theory, a modification of expected utility theory where expected utility theory can be seen as one specific model for how utility of an option should be computed.

      (3) Despite efforts to control confounds in the fMRI study, including two control experiments, I think some confounds remain.

      For example, a network of regions is presented as correlating with the cumulative probability that there has been a regime shift in this block of 10 samples (Pt). However, regardless of the exact samples shown, doesn't Pt always increase with sample number (as by the time of later samples, there have been more opportunities for a regime shift)? Unless this is completely linear, the effect won't be controlled by including trial number as a co-regressor (which was done).

      Thank you for raising this concern. Yes, Pt always increases with sample number regardless of evidence (seeing change-consistent or change-inconsistent signals). This is captured by the ‘intertemporal prior’ in the Bayesian model, which we included as a regressor in our GLM analysis (GLM-2), in addition to Pt. In short, GLM-1 had Pt and sample number. GLM-2 had Pt, intertemporal prior, and sample number, among other regressors. And we found that, in both GLM-1 and GLM-2, both vmPFC and ventral striatum correlated with Pt.

      To make this clearer, we updated the main text to further clarify this on p.18:

      On the other hand, two additional fMRI experiments are done as control experiments and the effect of Pt in the main study is compared to Pt in these control experiments. Whilst I admire the effort in carrying out control studies, I can't understand how these particular experiment are useful controls. For example in experiment 3 participants simply type in numbers presented on the screen - how can we even have an estimate of Pt from this task?

      We thank the reviewer for this comment. The purpose of Experiment 3 was to control for visual and motor confounds. In other words, if subjects saw the similar visual layout and were just instructed to press numbers, would we observe the vmPFC, ventral striatum, and the frontoparietal network like what we did in the main experiment (Experiment 1)?

      The purpose of Experiment 2 was to establish whether what we found about Pt was unique to change detection. In Experiment 2, subjects estimated the probability that the current regime is the blue regime (just as they did in Experiment 1) except that there were no regime shifts involved. In other words, it is possible that the regions we identified were generally associated with probability estimation and not particularly about change detection. And we used Experiment 2 to examine whether this were true.

      (4) The Discussion is very long, and whilst a lot of related literature is cited, I found it hard to pin down within the discussion, what the key contributions of this study are. In my opinion it would be better to have a short but incisive discussion highlighting the advances in understanding that arise from the current study, rather than reviewing the field so broadly.

      Thank you. We received different feedbacks from previous reviews on what to include in Discussion. To address the reviewer’s concern, we will revise the Discussion to better highlight the key contributions of the current study at the beginning of Discussion.

      Recommendations for the authors:

      Reviewer #3 (Recommendations for the authors):

      Many of the figures are too tiny - the writing is very small, as are the pictures of brains. I'd suggest adjusting these so they will be readable without enlarging.

      Thank you. We will enlarge the figures to make them more readable.


      The following is the authors’ response to the original reviews.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The study examines human biases in a regime-change task, in which participants have to report the probability of a regime change in the face of noisy data. The behavioral results indicate that humans display systematic biases, in particular, overreaction in stable but noisy environments and underreaction in volatile settings with more certain signals. fMRI results suggest that a frontoparietal brain network is selectively involved in representing subjective sensitivity to noise, while the vmPFC selectively represents sensitivity to the rate of change.

      Strengths:

      (1) The study relies on a task that measures regime-change detection primarily based on descriptive information about the noisiness and rate of change. This distinguishes the study from prior work using reversal-learning or change-point tasks in which participants are required to learn these parameters from experiences. The authors discuss these differences comprehensively.

      Thank you for recognizing our contribution to the regime-change detection literature and our effort in discussing our findings in relation to the experience-based paradigms.

      (2) The study uses a simple Bayes-optimal model combined with model fitting, which seems to describe the data well.

      Thank you for recognizing the contribution of our Bayesian framework and systemneglect model.

      (3) The authors apply model-based fMRI analyses that provide a close link to behavioral results, offering an elegant way to examine individual biases.

      Thank you for recognizing our execution of model-based fMRI analyses and effort in using those analyses to link with behavioral biases.

      Weaknesses:

      My major concern is about the correlational analysis in the section "Under- and overreactions are associated with selectivity and sensitivity of neural responses to system parameters", shown in Figures 5c and d (and similarly in Figure 6). The authors argue that a frontoparietal network selectively represents sensitivity to signal diagnosticity, while the vmPFC selectively represents transition probabilities. This claim is based on separate correlational analyses for red and blue across different brain areas. The authors interpret the finding of a significant correlation in one case (blue) and an insignificant correlation (red) as evidence of a difference in correlations (between blue and red) but don't test this directly. This has been referred to as the "interaction fallacy" (Niewenhuis et al., 2011; Makin & Orban de Xivry 2019). Not directly testing the difference in correlations (but only the differences to zero for each case) can lead to wrong conclusions. For example, in Figure 5c, the correlation for red is r = 0.32 (not significantly different from zero) and r = 0.48 (different from zero). However, the difference between the two is 0.1, and it is likely that this difference itself is not significant. From a statistical perspective, this corresponds to an interaction effect that has to be tested directly. It is my understanding that analyses in Figure 6 follow the same approach.

      Relevant literature on this point is:

      Nieuwenhuis, S, Forstmann, B & Wagenmakers, EJ (2011). Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci 14, 11051107. https://doi.org/10.1038/nn.2886

      Makin TR, Orban de Xivry, JJ (2019). Science Forum: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript. eLife 8:e48175. https://doi.org/10.7554/eLife.48175

      There is also a blog post on simulation-based comparisons, which the authors could check out: https://garstats.wordpress.com/2017/03/01/comp2dcorr/

      I recommend that the authors carefully consider what approach works best for their purposes. It is sometimes recommended to directly compare correlations based on Monte-Carlo simulations (cf Makin & Orban). It might also be appropriate to run a regression with the dependent variable brain activity (Y) and predictors brain area (X) and the model-based term of interest (Z). In this case, they could include an interaction term in the model:

      Y = \beta_0 + \beta_1 \cdot X + \beta_2 \cdot Z + \beta_3 \cdot X \cdot Z

      The interaction term reflects if the relationship between the model term Z and brain activity Y is conditional on the brain area of interest X.

      Thank you for the suggestion. In response, we tested for the difference in correlation both parametrically and nonparametrically. The results were identical. In the parametric test, we used the Fisher z transformation to transform the difference in correlation coefficients to the z statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>1</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>2</sub>), the z statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher z transformation 𝑟<sub>1</sub>= 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 =0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under the blue signals was significantly greater than 0.

      In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). We resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution. Consistent with our parametric tests, here we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631).

      In summary, we found that neural sensitivity to signal diagnosticity in the frontoparietal network measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity (𝑟<sub>𝑏𝑙𝑢𝑒</sub>). By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent did not significantly correlate with behavioral sensitivity (𝑟<sub>𝑟𝑒𝑑</sub>). The difference in correlation, 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub>, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.

      To incorporate these updates, we added descriptions of the methods and results in the revised manuscript. In the Results section (p.26-27):

      “We further tested, for each brain region, whether the difference in correlation was significant using both parametric and nonparametric tests (see Parametric and nonparametric tests for difference in correlation coefficients in Methods). The results were identical. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. We found that among the five ROIs in the frontoparietal network, two of them, namely the left IFG and left IPS, the difference in correlation was significant (one-tailed z test; left IFG: 𝑧 = 1.8355, 𝑝 = 0.0332; left IPS: 𝑧 = 2.3782, 𝑝 = 0.0087). For the remaining three ROIs, the difference in correlation was not significant (dmPFC: 𝑧 = 0.7594, 𝑝 = 0.2238; right IFG: 𝑧 = 0.9068, 𝑝 = 0.1822; right IPS: 𝑧 = 1.3764, 𝑝 = 0.0843). We chose one-tailed test because we already know the correlation under change-consistent signals was significantly greater than 0. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation. We referred to the correlation between neural and behavioral sensitivity at change-consistent (blue) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. Consistent with the parametric tests, we also found that the difference in correlation was significant in left IFG and left IPS (left IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.46, 𝑝 = 0.0496; left IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.5306, 𝑝 = 0.0041), but was not significant in dmPFC, right IFG, and right IPS (dmPFC: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \=0.1634, 𝑝 = 0.1919; right IFG: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.2123, 𝑝 = 0.1681; right IPS: 𝑟<sub>𝑏𝑙𝑢𝑒</sub> − 𝑟<sub>𝑟𝑒𝑑</sub> \= 0.3434, 𝑝 = 0.0631). In summary, we found that neural sensitivity to signal diagnosticity measured at change-consistent signals significantly correlated with individual subjects’ behavioral sensitivity to signal diagnosticity. By contrast, neural sensitivity to signal diagnosticity measured at change-inconsistent signals did not significantly correlate with behavioral sensitivity. The difference in correlation, however, was statistically significant in some (left IPS and left IFG) but not all brain regions within the frontoparietal network.”

      In the Methods section, we added on p.53:

      “Parametric and nonparametric tests for difference in correlation coefficients. We implemented both parametric and nonparametric tests to examine whether the difference in Pearson correlation coefficients was significant. In the parametric test, we used the Fisher 𝑧 transformation to transform the difference in correlation coefficients to the 𝑧 statistic. That is, for two correlation coefficients, 𝑟<sub>1</sub> (with sample size 𝑛<sub>2</sub>) and 𝑟<sub>2</sub>, (with sample size 𝑛<sub>1</sub>), the 𝑧 statistic of the difference in correlation is given by

      We referred to the correlation between neural and behavioral sensitivity at changeconsistent (blue balls) signals as 𝑟<sub>𝑏𝑙𝑢𝑒</sub>, and that at change-inconsistent (red balls) signals as 𝑟<sub>𝑟𝑒𝑑</sub>. For the Fisher 𝑧 transformation, 𝑟<sub>1</sub> \= 𝑟 𝑟<sub>𝑏𝑙𝑢𝑒</sub> and 𝑟<sub>2</sub> \= 𝑟<sub>𝑟𝑒𝑑</sub>. In the nonparametric test, we performed nonparametric bootstrapping to test for the difference in correlation (Efron & Tibshirani, 1994). That is, we resampled with replacement the dataset (subject-wise) and used the resampled dataset to compute the difference in correlation. We then repeated the above for 100,000 times so as to estimate the distribution of the difference in correlation coefficients, tested for significance and estimated p-value based on this distribution.”

      Another potential concern is that some important details about the parameter estimation for the system-neglect model are missing. In the respective section in the methods, the authors mention a nonlinear regression using Matlab's "fitnlm" function, but it remains unclear how the model was parameterized exactly. In particular, what are the properties of this nonlinear function, and what are the assumptions about the subject's motor noise? I could imagine that by using the inbuild function, the assumption was that residuals are Gaussian and homoscedastic, but it is possible that the assumption of homoscedasticity is violated, and residuals are systematically larger around p=0.5 compared to p=0 and p=1. Relatedly, in the parameter recovery analyses, the authors assume different levels of motor noise. Are these values representative of empirical values?

      We thank the reviewer for this excellent point. The reviewer touched on model parameterization, assumption of noise, and parameter recovery analysis. We answered these questions point-by-point below.

      On how our model was parameterized

      We parameterized the model according to the system-neglect model in Eq. (2) and estimated the alpha parameter separately for each level of transition probability and the beta parameter separately for each level of signal diagnosticity. As a result, we had a total of 6 parameters (3 alpha and 3 beta parameters) in the model. The system-neglect model is then called by fitnlm so that these parameters can be estimated. The term ‘nonlinear’ regression in fitnlm refers to the fact that you can specify any model (in our case the system-neglect model) and estimate its parameters when calling this function. In our use of fitnlm, we assume that the noise is Gaussian and homoscedastic (the default option).

      On the assumptions about subject’s motor noise

      We actually never called the noise ‘motor’ because it can be estimation noise as well. In the context of fitnlm, we assume that the noise is Gaussian and homoscedastic.

      On the possibility that homoscedasticity is violated

      We take the reviewer’s point. In response, we separately estimated the residual standard deviation at different probability intervals ([0.0–0.2), [0.2–0.4), [0.4–0.6), [0.6– 0.8), and [0.8–1.0]). The result is shown in the figure below. The black data points are the average residual standard deviation (across subjects) and the error bars are the standard error of the mean. The residual standard deviation is indeed heteroscedastic— smallest at 0.1 probability and increasing as probability increases and asymptote at 0.5 (Fig. S4).

      To examine how this would affect model fitting (parameter estimation), we performed parameter recovery analysis based on these empirically estimated, probabilitydependent residual standard deviation. That is, we simulated subjects’ probability estimates using the system-neglect model and added the heteroscedastic noise according to the empirical values and then estimated the parameter estimates of the system-neglect model. The recovered parameter estimates did not seem to be affected by the heteroscedasticity of the variance. The parameter recovery results were identical to the parameter recovery results when homoscedasticity was assumed. This suggested that although homoscedasticity was violated, it did not affect the accuracy of the parameter estimates (Fig.S4).

      We added a section ‘Impact of noise homoscedasticity on parameter estimation’ in Methods section (p.47-48) and a figure in the supplement (Fig. S4) to describe this:

      On whether the noise levels in parameter recovery analysis are representative of empirical values

      To address the reviewer’s question, we conducted a new analysis using maximum likelihood estimation to simultaneously estimate the system-neglect model and the noise level of each individual subject. To estimate each subject’s noise level, we incorporated a noise parameter into the system-neglect model. We assumed that probability estimates are noisy and modeled them with a Gaussian distribution where the noise parameter (𝜎,-./&) is the standard deviation. At each period, a probability estimate of regime shift was computed according to the system-neglect model where Θ is the set of parameters including parameters in the system-neglect model and the noise parameter. The likelihood function, 𝐿(Θ), is the probability of observing the subject’s actual probability estimate at period 𝑡, 𝑝), given Θ, 𝐿(Θ) = 𝑃(𝑝)|Θ). Since we modeled the noisy probability estimates with a Gaussian distribution, we can therefore express 𝐿(Θ) as 𝐿(Θ)~𝑁(𝑝); 𝑝)*+, 𝜎,-./&) where 𝑝)*+ is the probability estimate predicted by the system-neglect (SN) model at period 𝑡. As a reminder, we referred to a ‘period’ as the time when a new signal appeared during a trial (for a given transition probability and signal diagnosticity). To find that maximum likelihood estimates of ΘMLE, we summed over all periods the negative natural logarithm of likelihood and used MATLAB’s fmincon function to find ΘMLE. Across subjects, we found that the mean noise estimate was 0.1735 and ranged from 0.1118 to 0.2704 (Supplementary Figure S3).”

      Compared with our original parameter recovery analysis where the maximum noise level was set at 0.1, our data indicated that some subjects’ noise was larger than this value. Therefore, we expanded our parameter recovery analysis to include noise levels beyond 0.1 to up to 0.3. The results are now updated in Supplementary Fig. S3.

      We updated the parameter recovery section (p. 47) in Methods:

      The main study is based on N=30 subjects, as are the two control studies. Since this work is about individual differences (in particular w.r.t. to neural representations of noise and transition probabilities in the frontoparietal network and the vmPFC), I'm wondering how robust the results are. Is it likely that the results would replicate with a larger number of subjects? Can the two control studies be leveraged to address this concern to some extent?

      We can address the issue of robustness through looking at the effect size. In particular, with respect to individual differences in neural sensitivity of transition probability and signal diagnosticity, since the significant correlation coefficients between neural and behavioral sensitivity were between 0.4 and 0.58 for signal diagnosticity in frontoparietal network (Fig. 5C), and -0.38 and -0.37 for transition probability in vmPFC (Fig. 5D), the effect size of these correlation coefficients was considered medium to large (Cohen, 1992).

      It would be challenging to use the control studies to address the robustness concern. The two control studies did not allow us to examine individual differences – in particular with respect to neural selectivity of noise and transition probability – and therefore we think it is less likely to leverage the control studies. Having said that, it is possible to look at neural selectivity of noise (signal diagnosticity) in the first control experiment where subjects estimated the probability of blue regime in a task where there was no regime change (transition probability was 0). However, the fact that there were no regime shifts changed the nature of the task. Instead of always starting at the Red regime in the main experiment, in the first control experiment we randomly picked the regime to draw the signals from. It also changed the meaning and the dynamics of the signals (red and blue) that would appear. In the main experiment the blue signal is a signal consistent with change, but in the control experiment this is no longer the case. In the main experiment, the frequency of blue signals is contingent upon both noise and transition probability. In general, blue signals are less frequent than red signals because of small transition probabilities. But in the first control experiment, the frequency of blue signals may not be less frequent because the regime was blue in half of the trials. Due to these differences, we do not see how analyzing the control experiments could help in establishing robustness because we do not have a good prediction as to whether and how the neural selectivity would be impacted by these differences.

      It seems that the authors have not counterbalanced the colors and that subjects always reported the probability of the blue regime. If so, I'm wondering why this was not counterbalanced.

      We are aware of the reviewer’s concern. The first reason we did not do these (color counterbalancing and report blue/red regime balancing) was to not confuse the subjects in an already complicated task. Balancing these two variables also comes at the cost of sample size, which was the second reason we did not do it. Although we can elect to do these balancing at the between-subject level to not impact the task complexity, we could have introduced another confound that is the individual differences in how people respond to these variables. This is the third reason we were hesitant to do these counterbalancing.

      Reviewer #2 (Public review):

      Summary:

      This paper focuses on understanding the behavioral and neural basis of regime shift detection, a common yet hard problem that people encounter in an uncertain world.

      Using a regime-shift task, the authors examined cognitive factors influencing belief updates by manipulating signal diagnosticity and environmental volatility. Behaviorally, they have found that people demonstrate both over and under-reaction to changes given different combinations of task parameters, which can be explained by a unified system-neglect account. Neurally, the authors have found that the vmPFC-striatum network represents current belief as well as belief revision unique to the regime detection task. Meanwhile, the frontoparietal network represents cognitive factors influencing regime detection i.e., the strength of the evidence in support of the regime shift and the intertemporal belief probability. The authors further link behavioral signatures of system neglect with neural signals and have found dissociable patterns, with the frontoparietal network representing sensitivity to signal diagnosticity when the observation is consistent with regime shift and vmPFC representing environmental volatility, respectively. Together, these results shed light on the neural basis of regime shift detection especially the neural correlates of bias in belief update that can be observed behaviorally.

      Strengths:

      (1) The regime-shift detection task offers a solid ground to examine regime-shift detection without the potential confounding impact of learning and reward. Relatedly, the system-neglect modeling framework provides a unified account for both over or under-reacting to environmental changes, allowing researchers to extract a single parameter reflecting people's sensitivity to changes in decision variables and making it desirable for neuroimaging analysis to locate corresponding neural signals.

      Thank you for recognizing our task design and our system-neglect computational framework in understanding change detection.

      (2) The analysis for locating brain regions related to belief revision is solid. Within the current task, the authors look for brain regions whose activation covary with both current belief and belief change. Furthermore, the authors have ruled out the possibility of representing mere current belief or motor signal by comparing the current study results with two other studies. This set of analyses is very convincing.

      Thank you for recognizing our control studies in ruling out potential motor confounds in our neural findings on belief revision.

      (3) The section on using neuroimaging findings (i.e., the frontoparietal network is sensitive to evidence that signals regime shift) to reveal nuances in behavioral data (i.e., belief revision is more sensitive to evidence consistent with change) is very intriguing. I like how the authors structure the flow of the results, offering this as an extra piece of behavioral findings instead of ad-hoc implanting that into the computational modeling.

      Thank you for appreciating how we showed that neural insights can lead to new behavioral findings.

      Weaknesses:

      (1) The authors have presented two sets of neuroimaging results, and it is unclear to me how to reason between these two sets of results, especially for the frontoparietal network. On one hand, the frontoparietal network represents belief revision but not variables influencing belief revision (i.e., signal diagnosticity and environmental volatility). On the other hand, when it comes to understanding individual differences in regime detection, the frontoparietal network is associated with sensitivity to change and consistent evidence strength. I understand that belief revision correlates with sensitivity to signals, but it can probably benefit from formally discussing and connecting these two sets of results in discussion. Relatedly, the whole section on behavioral vs. neural slope results was not sufficiently discussed and connected to the existing literature in the discussion section. For example, the authors could provide more context to reason through the finding that striatum (but not vmPFC) is not sensitive to volatility.

      We thank the reviewer for the valuable suggestions.

      With regard to the first comment, we wish to clarify that we did not find frontoparietal network to represent belief revision. It was the vmPFC and ventral striatum that we found to represent belief revision (delta Pt in Fig. 3). For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and -1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Eqs. 1 and 2). We added a paragraph in Discussion to talk about this.

      We added on p. 36:

      “For the frontoparietal network, we identified its involvement in our task through finding that its activity correlated with strength of change evidence (Fig. 4) and individual subjects’ sensitivity to signal diagnosticity (Fig. 5). Conceptually, these two findings reflect how individuals interpret the signals (signals consistent or inconsistent with change) in light of signal diagnosticity. This is because (1) strength of change evidence is defined as signals (+1 for signal consistent with change, and −1 for signal inconsistent with change) multiplied by signal diagnosticity and (2) sensitivity to signal diagnosticity reflects how individuals subjectively evaluate signal diagnosticity. At the theoretical level, these two findings can be interpreted through our computational framework in that both the strength of change evidence and sensitivity to signal diagnosticity contribute to estimating the likelihood of change (Equations 1 and 2 in Methods).”

      With regard to the second comment, we added a discussion on the behavioral and neural slope comparison. We pointed out previous papers conducting similar analysis (Vilares et al., 2011; Ting et al., 2015; Yang & Wu, 2020), their findings and how they relate to our results. Vilares et al. found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to prior. In the current study, transition probability acts as prior in the system-neglect framework (Eq. 1) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2011) and dynamic environments (current study).

      We added on p. 37-38:

      “In the current study, our psychometric-neurometric analysis focused on comparing behavioral sensitivity with neural sensitivity to the system parameters (transition probability and signal diagnosticity). We measured sensitivity by estimating the slope of behavioral data (behavioral slope) and neural data (neural slope) in response to the system parameters. Previous studies had adopted a similar approach (Ting et al., 2015a; Vilares et al., 2012; Yang & Wu, 2020). For example, Vilares et al. (2012) found that sensitivity to prior information (uncertainty in prior distribution) in the orbitofrontal cortex (OFC) and putamen correlated with behavioral measure of sensitivity to the prior.

      In the current study, transition probability acts as prior in the system-neglect framework (Eq. 2 in Methods) and we found that ventromedial prefrontal cortex represents subjects’ sensitivity to transition probability. Together, these results suggest that OFC (with vmPFC being part of OFC, see Wallis, 2011) is involved in the subjective evaluation of prior information in both static (Vilares et al., 2012) and dynamic environments (current study). In addition, distinct from vmPFC in representing sensitivity to transition probability or prior, we found through the behavioral-neural slope comparison that the frontoparietal network represents how sensitive individual decision makers are to the diagnosticity of signals in revealing the true state (regime) of the environment.”

      (2) More details are needed for behavioral modeling under the system-neglect framework, particularly results on model comparison. I understand that this model has been validated in previous publications, but it is unclear to me whether it provides a superior model fit in the current dataset compared to other models (e.g., a model without \alpha or \beta). Relatedly, I wonder whether the final result section can be incorporated into modeling as well - i.e., the authors could test a variant of the model with two \betas depending on whether the observation is consistent with a regime shift and conduct model comparison.

      Thank you for the great suggestion. We rewrote the final Results section to specifically focus on model comparison. To address the reviewer’s suggestion (separately estimate beta parameters for change-consistent and change-inconsistent signals), we indeed found that these models were better than the original system-neglect model.

      To incorporate these new findings, we rewrote the entire final result section “Incorporating signal dependency into system-neglect model led to better models for regime-shift detection “(p.28-30).

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Use line numbers for the next round of reviews.

      We added line numbers in the revised manuscript.

      (2) Figure 2b: Can the empirical results be reproduced by the system-neglect model? This would complement the analyses presented in Figure S4.

      Yes. We now add Figure S6 based on system-neglect model fits. For each subject, we first computed period-by-period probability estimates based on the parameter estimates of the system-neglect model. Second, we computed index of overreaction (IO) for each combination of transition probability and signal diagnosticity. Third, we plot the IO like we did using empirical results in Fig. 2b. We found that the empirical results in Fig. 2b are similar to the system-neglect model shown in Figure S6, indicating that the empirical results can be reproduced by the model.

      (3) Page 14: Instead of referring to the "Methods" in general, you could be more specific about where the relevant information can be found.

      Fixed. We changed “See Methods” to “See System-neglect model in Methods”.

      (4) Page 18: Consider avoiding the term "more significantly". Consider effect sizes if interested in comparing effects to each other.

      Fixed. On page 19, we changed that to

      “In the second analysis, we found that for both vmPFC and ventral striatum, the regression coefficient of 𝑃) was significantly different between Experiment 1 and Experiment 2 (Fig. 3C) and between Experiment 1 and Experiment 3 (Fig. 3D; also see Tables S5 and S6 in SI).”

      (5) Page 30: Cite key studies using reversal-learning paradigms. Currently, readers less familiar with the literature might have difficulties with this.

      We now cite key studies using reversal-learning paradigms on p.32:

      “Our work is closely related to the reversal-learning paradigm—the standard paradigm in neuroscience and psychology to study change detection (Fellows & Farah, 2003; Izquierdo et al., 2017; O'Doherty et al., 2001; Schoenbaum et al., 2000; Walton et al., 2010). In a typical reversal-learning task, human or animal subjects choose between two options that differ in the reward magnitude or probability of receiving a reward. Through reward feedback the participants gradually learn the reward contingencies associated with the options and have to update knowledge about reward contingencies when contingencies are switched in order to maximize rewards.”

      Reviewer #2 (Recommendations for the authors):

      (1) Some literature on change detection seems missing. For example, the author should also cite Muller, T. H., Mars, R. B., Behrens, T. E., & O'Reilly, J. X. (2019). Control of entropy in neural models of environmental state. elife, 8, e39404. This paper suggests that medial PFC is correlated with the entropy of the current state, which is closely related to regime change and environmental volatility.

      Thank you for pointing to this paper. We have now added it and other related papers in the Introduction and Discussion.

      In Introduction, we added on p.5-6:

      “Different behavioral paradigms, most notably reversal learning, and computational models were developed to investigate its neurocomputational substrates (Behrens et al., 2007; Izquierdo et al., 2017; Payzan-LeNestour et al., 2011, 2013; Nasser et al., 2010; McGuire et al., 2014; Muller et al., 2019). Key findings on the neural implementations for such learning include identifying brain areas and networks that track volatility in the environment (rate of change) (Behrens et al., 2007), the uncertainty or entropy of the current state of the environment (Muller et al., 2019), participants’ beliefs about change (Payzan-LeNestour et al., 2011; McGuire et al., 2014; Kao et al., 2020), and their uncertainty about whether a change had occurred (McGuire et al., 2014; Kao et al., 2020).”

      In Discussion (p.35), we added a new paragraph:

      “Related to OFC function in decision making and reinforcement learning, Wilson et al. (2014) proposed that OFC is involved in inferring the current state of the environment. For example, medial OFC had been shown to represent probability distribution on possible states of the environment (Chan et al., 2016), the current task state (Schuck et al., 2016) and uncertainty or entropy associated with the state of the environment (Muller et al., 2019). In the context of regime-shift detection, regimes can be regarded as states of the environment and therefore a change in regime indicates a change in the state of the environment. Muller et al. (2019) found that in dynamic environments where changes in the state of the environment happen regularly, medial OFC represented the level of uncertainty in the current state of the environment. Our finding that vmPFC represented individual participants’ probability estimates of regime shifts suggest that vmPFC and/or OFC are involved in inferring the current state of the environment through estimating whether the state has changed. Our finding that vmPFC represented individual participants’ sensitivity to transition probability further suggest that vmPFC and/or OFC contribute to individual participants’ biases in state inference (over- and underreactions to change) in how these brain areas respond to the volatility of the environment.”

      (2) The language used when describing the selective relationship between frontoparietal network activation and change-consistent signal can be clearer. When describing separating those two signals, the authors refer to them as when the 'blue' signal shows up and when the 'red' signal shows up, assuming that the current belief state is blue. This is a little confusing cuz it is hard to keep in mind what is the default color in this example. It would be more intuitive if the author used language such as the 'change consistent' signal.

      Thank you for the suggestion. We have changed the wording according to your suggestion. That is, we say ‘change-consistent (blue) signals’ and ‘change-inconsistent (red) signals’ throughout pages 22-28.

      (3) Figure 4B highlights dmPFC. However, in the associated text, it says p = .10 so it is not significant. To avoid misleading readers, I would recommend pointing this out explicitly beyond saying 'most brain regions in the frontoparietal network also correlated with the intertemporal prior'.

      Thank you for pointing this out. We now say on p.20

      “With independent (leave-one-subject-out, LOSO) ROI analysis, we examined whether brain regions in the frontoparietal network (shown to represent strength of change evidence) correlated with intertemporal prior and found that all brain regions, with the exception of dmPFC, in the frontoparietal network correlated with the intertemporal prior.”

      (4) There is a full paragraph in the discussion talking about the central opercular cortex, but this terminology has not shown up in the main body of the paper. If this is an important brain region to the authors, I would recommend mentioning it more often in the result section.

      Thank you for this suggestion. We have now added central opercular cortex in the Results section (p.18):

      “For 𝑃<sub>𝑡</sub>, we found that the ventromedial prefrontal cortex (vmPFC) and ventral striatum correlated with this behavioral measure of subjects’ belief about change. In addition, many other brain regions, including the motor cortex, central opercular cortex, insula, occipital cortex, and the cerebellum also significantly correlated with 𝑃<sub>𝑡</sub>.”

      (5) The authors have claimed that people make more extreme estimates under high diagnosticity (Supplementary Figure 1). This is an interesting point because it seems to be different from what is shown in the main graph where it seems that people are not extreme enough compared to an ideal Bayesian observer. I understand that these are effects being investigated under different circumstances. It would be helpful if for Supplementary Figure 1 the authors could overlay, or generate a different figure showing what an ideal Bayesian observer would do in this situation.

      We thank the reviewer for pointing this out. We wish to clarify that when we said “more extreme estimates under high diagnosticity” we meant compared with low diagnosticity and not with the ideal Bayesian observer. We clarified this point by rephrasing our sentence on p.11:

      “We also found that subjects tended to give more extreme Pt under high signal diagnosticity than low diagnosticity (Fig. S1 in Supplementary Information, SI).”

      When it comes to comparing subjects’ probability estimates with the normative Bayesian, subjects tended to “underreact” under high diagnosticity. This can be seen in Fig. 4B, which shows a trend of increasing underreaction (or decreasing overreaction) as diagnosticity increased (row-wise comparison for a given transition probability).

      We see the reviewer’s point in overlaying the Bayesian on Fig. S1 and update it by adding the normative Bayesian in orange.

    1. Author response:

      The following is the authors’ response to the original reviews

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      Silbaugh, Koster, and Hansel investigated how the cerebellar climbing fiber (CF) signals influence neuronal activity and plasticity in mouse primary somatosensory (S1) cortex. They found that optogenetic activation of CFs in the cerebellum modulates responses of cortical neurons to whisker stimulation in a cell-type-specific manner and suppresses potentiation of layer 2/3 pyramidal neurons induced by repeated whisker stimulation. This suppression of plasticity by CF activation is mediated through modulation of VIP- and SST-positive interneurons. Using transsynaptic tracing and chemogenetic approaches, the authors identified a pathway from the cerebellum through the zona incerta and the thalamic posterior medial (POm) nucleus to the S1 cortex, which underlies this functional modulation.

      Strengths:

      This study employed a combination of modern neuroscientific techniques, including two-photon imaging, opto- and chemo-genetic approaches, and transsynaptic tracing. The experiments were thoroughly conducted, and the results were clearly and systematically described. The interplay between the cerebellum and other brain regions - and its functional implications - is one of the major topics in this field. This study provides solid evidence for an instructive role of the cerebellum in experience-dependent plasticity in the S1 cortex.

      Weaknesses:

      There may be some methodological limitations, and the physiological relevance of the CFinduced plasticity modulation in the S1 cortex remains unclear. In particular, it has not been elucidated how CF activity influences the firing patterns of downstream neurons along the pathway to the S1 cortex during stimulation.

      Our study addresses the important question of whether CF signaling can influence the activity and plasticity of neurons outside the olivocerebellar system, and further identifies the mechanism through which this indeed occurs. We provide a detailed description of the involvement of specific neuron subtypes and how they are modulated by climbing fiber activation to impact S1 plasticity. We also identify at least one critical pathway from the cerebellar output to the S1 circuit. It is indeed correct that we did not investigate how the specific firing patterns of all of these downstream neurons are affected, or the natural behaviors in which this mechanism is involved. Now that it is established that CF signaling can impact activity and plasticity outside the olivocerebellar system -- and even in the primary somatosensory cortex -- these questions will be important to further investigate in future studies.

      (1) Optogenetic stimulation may have activated a large population of CFs synchronously, potentially leading to strong suppression followed by massive activation in numerous cerebellar nuclear (CN) neurons. Given that there is no quantitative estimation of the stimulated area or number of activated CFs, observed effects are difficult to interpret directly. The authors should at least provide the basic stimulation parameters (coordinates of stim location, power density, spot size, estimated number of Purkinje cells included, etc.).

      As discussed in the paper, we indeed expect that synchronous CF activation is needed to allow for an effect on S1 circuits under natural or optogenetic activation conditions. The basic optogenetic stimulation parameters (also stated in the methods) are as follows: 470 nm LED; Ø200 µm core, 0.39 NA rotary joint patch cable; absolute power output of 2.5 mW; spot size at the surface of the cortex 0.6 mm; estimated power density 8 mW/mm2. A serious estimate of the number of Purkinje cells that are activated is difficult to provide, in particular as ‘activation’ would refer to climbing fiber inputs, not Purkinje cells directly.

      (2) There are CF collaterals directly innervating CN (PMID:10982464). Therefore, antidromic spikes induced by optogenetic stimulation may directly activate CN neurons. On the other hand, a previous study reported that CN neurons exhibit only weak responses to CF collateral inputs (PMID: 27047344). The authors should discuss these possibilities and the potential influence of CF collaterals on the interpretation of the results.

      A direct activation of CN neurons by antidromic spikes in CF collaterals cannot be ruled out. However, we believe that this effect will not be substantial. The activation of the multi-synaptic pathway that we describe in this study is more likely to require a strong nudge as resulting from synchronized Purkinje cell input and subsequent rebound activation in CN neurons (PMID: 22198670), rather than small-amplitude input provided by CF collaterals (PMID: 27047344). A requirement for CF/PC synchronization would also set a threshold for activation of this suppressive pathway.

      (3) The rationale behind the plasticity induction protocol for RWS+CF (50 ms light pulses at 1 Hz during 5 min of RWS, with a 45 ms delay relative to the onset of whisker stimulation) is unclear.

      a) The authors state that 1 Hz was chosen to match the spontaneous CF firing rate (line 107); however, they also introduced a delay to mimic the CF response to whisker stimulation (line 108). This is confusing, and requires further clarification, specifically, whether the protocol was designed to reproduce spontaneous or sensory-evoked CF activity.

      This protocol was designed to mimic sensory-evoked CF activity as reported in Bosman et al (J. Physiol. 588, 2010; PMID: 20724365).

      b) Was the timing of delivering light pulses constant or random? Given the stochastic nature of CF firing, randomly timed light pulses with an average rate of 1Hz would be more physiologically relevant. At the very least, the authors should provide a clear explanation of how the stimulation timing was implemented.

      Light pulses were delivered at a constant 1 Hz. Our goal was to isolate synchrony as the variable distinguishing sensory-evoked from spontaneous CF activity; additionally varying stochasticity, rate, or amplitude would have confounded this. Future studies could explore how these additional parameters shape S1 responses.

      (4) CF activation modulates inhibitory interneurons in the S1 cortex (Figure 2): responses of interneurons in S1 to whisker stimulation were enhanced upon CF coactivation (Figure 2C), and these neurons were predominantly SST- and PV-positive interneurons (Figure 2H, I). In contrast, VIP-positive neurons were suppressed only in the late time window of 650-850 ms (Figure 2G). If the authors' hypothesis-that the activity of VIP neurons regulates SST- and PVneuron activity during RWS+CF-is correct, then the activity of SST- and PV-neurons should also be increased during this late time window. The authors should clarify whether such temporal dynamics were observed or could be inferred from their data.

      Yes, we see a significant activity increase in PV neurons in this late time window (see updates to Data S2). Activity was also increased in SST neurons, though this did not reach statistical significance (Data S2). One reason might be that – given the small effect size overall – such an effect would only be seen in paired recordings. Chemogenetic activity modulation in VIP neurons, which provides a more crude test, shows, however, that SST- and PV-positive interneurons are indeed regulated via inhibition from VIP-positive interneurons (Fig. 5).

      (5) Transsynaptic tracing from CN nicely identified zona incerta (ZI) neurons and their axon terminals in both POm and S1 (Figure 6 and Figure S7).

      a) Which part of the CN (medial, interposed, or lateral) is involved in this pathway is unclear.

      We used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophore) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      b) Were the electrophysiological properties of these ZI neurons consistent with those of PV neurons?

      Although most recorded cells demonstrated electrophysiological properties consistent with PV+ interneurons in other brain regions (i.e. fast spiking, narrow spike width, non-adapting; see Tremblay et al., 2016), interneuron subtypes in the ZI have been incompletely characterized, with SST+ cells showing similar features to those typically associated with PV+ cells (if interested, compare Fig. 4 in DOI: 10.1126/sciadv.abf6709 vs. Fig. S10 in https://doi.org/10.1016/j.neuron.2020.04.027). Therefore, we did not attempt to delineate cell identity based on these characteristics.

      c) There appears to be a considerable number of axons of these ZI neurons projecting to the S1 cortex (Figure S7C). Would it be possible to estimate the relative density of axons projecting to the POm versus those projecting to S1? In addition, the authors should discuss the potential functional role of this direct pathway from the ZI to the S1 cortex.

      An absolute quantification is difficult to provide based on the images that we obtained. However, any crude estimate would indicate the relative density of projections to POm is higher than the density of projections to S1 (this is apparent from the images themselves). While the anatomical and functional connections from POm to S1 have been described in detail (Audette et al., 2018), this is not the case for the direct projections to ZI. A direct ZI to S1 projection would potentially involve a different recruitment of neurons in the S1 circuit. Any discussion on the specific consequences of the activation of this direct pathway would be purely speculative.

      Reviewer #2 (Public review):

      Summary:

      The authors examined long-distance influence of climbing fiber (CF) signaling in the somatosensory cortex by manipulating whiskers through stimulation. Also, they examined CF signaling using two-photon imaging and mapped projections from the cerebellum to the somatosensory cortex using transsynaptic tracing. As a final manipulation, they used chemogenetics to perturb parvalbumin-positive neurons in the zona incerta and recorded from climbing fibers.

      Strengths:

      There are several strengths to this paper. The recordings were carefully performed, and AAVs used were selective and specific for the cell types and pathways being analyzed. In addition, the authors used multiple approaches that support climbing fiber pathways to distal regions of the brain. This work will impact the field and describes nice methods to target difficult-to-reach brain regions, such as the inferior olive.

      Weaknesses:

      There are some details in the methods that could be explained further. The discussion was very short and could connect the findings in a broader way.

      In the revised manuscript, we provide more methodological details, as requested. We provided as simple as possible explanations in the discussion, so as not to bias further investigations into this novel phenomenon. In particular, we avoid an extended discussion of the gating effect of CF activity on S1 plasticity. While this is the effect on plasticity specifically observed here, we believe that the consequences of CF signaling on S1 activity may entirely depend on the contexts in which CF signals are naturally recruited, the ongoing activity of other brain regions, and behavioral state. Our key finding is that such modulation of neocortical plasticity can occur. How CF signaling controls plasticity of the neocortex in all contexts remains unknown, but needs to be thoughtfully tested in the future.

      Reviewer #3 (Public review):

      Summary:

      The authors developed an interesting novel paradigm to probe the effects of cerebellar climbing fiber activation on short-term adaptation of somatosensory neocortical activity during repetitive whisker stimulation. Normally, RWS potentiated whisker responses in pyramidal cells and weakly suppressed them in interneurons, lasting for at least 1h. Crusii Optogenetic climbing fiber activation during RWS reduced or inverted these adaptive changes. This effect was generally mimicked or blocked with chemogenetic SST or VIP activation/suppression as predicted based on their "sign" in the circuit.

      Strengths:

      The central finding about CF modulation of S1 response adaptation is interesting, important, and convincing, and provides a jumping-off point for the field to start to think carefully about cerebellar modulation of neocortical plasticity.

      Weaknesses:

      The SST and VIP results appeared slightly weaker statistically, but I do not personally think this detracts from the importance of the initial finding (if there are multiple underlying mechanisms, modulating one may reproduce only a fraction of the effect size). I found the suggestion that zona incerta may be responsible for the cerebellar effects on S1 to be a more speculative result (it is not so easy with existing technology to effectively modulate this type of polysynaptic pathway), but this may be an interesting topic for the authors to follow up on in more detail in the future.

      Our interpretation of the anatomical and physiological findings is that a pathway via the ZI is indeed critical for the observed effects. This pathway also represents perhaps the most direct pathway (i.e. least number of synapses connecting the cerebellar nuclei to S1). However, several other direct and indirect pathways are plausible as well and we expect distinct activation requirements and consequences for neurons in the S1 circuit. These are indeed interesting topics for future investigation.

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      (1) Line 77: "CF transients" is not a standard or widely recognized term. Please use a more precise expression, such as "CF-induced calcium transients."

      We now avoid the use of the term “CF transients” and replaced it with “CF-induced calcium transients.”

      (2) Titer of AAVs injected should be provided.

      AAV titers have been included in an additional data table (Data S9).

      (3) Several citations to the figures are incorrect (for example, "Supplementary Data 2a (Line 398)" does not exist).

      We apologize for the mistakes in this version of the article. Incorrect citations to the figures have been corrected.

      (4) Line 627-628: "The tip of the patch cable was centered over Crus II in all optogenetic stimulation experiments." The stereotaxic coordinate of the tip position should be provided.

      The stereotaxic coordinate of the tip position has been provided in the methods.

      (5) Line 629: "Blue light pulses were delivered with a 470 nm Fiber-Coupled LED (Thorlabs catalog: M470F3)." The size of the light stim and estimated power density (W/mm^2) at the surface of the cortex should be provided.

      The spot size and estimated power density at the surface of the cortex has been provided in the methods.

      (6) Line 702-706: References for DCZ should be cited.

      We now cited Nagai et al, Nat. Neurosci. 23 (2020) as the original reference.

      (7) Two-photon image processing (Line 807-809): The rationale for normalizing ∆F/F traces to a pre-stimulus baseline is unclear because ∆F/F is, by definition, already normalized to baseline fluorescence: (Ft-F0)/F0. The authors should clarify why this additional normalization step was necessary and how it affected the interpretation of the data.

      A single baseline fluorescence value (F₀) was computed for each neuron across the entire recording session, which lasted ~120-minutes. However, some S1 neurons exhibit fluctuations in baseline fluorescence over time—often related to locomotive activity or spontaneous network oscillations—which can obscure stimulus-evoked changes. To isolate fluorescence changes specifically attributable to whisker stimulation, we normalized each ∆F/F trace to the prestimulus baseline for that trial. This additional normalization allowed us to quantify potentiation or depression of sensory responses themselves, independently of spontaneous oscillations or locomotion-related changes in the ongoing neural activity.

      Reviewer #2 (Recommendations for the authors):

      (1) Did the climbing fiber stimulation for Figure 1 result in any changes to motor activity? Can you make any additional comments on other behaviors that were observed during these manipulations?

      Acute CF stimulation did not cause any changes in locomotive or whisking activity. The CF stimulation also did not influence the overall level of locomotion or whisking during plasticity induction.

      (2) Figure 3B and F- it is very difficult to see the SST+ neurons. Can this be enhanced?

      We linearly adjusted the brightness and contrast for the bottom images in Figure 3B and F to improve visualization of SST+ neurons. Note the expression of both hM3D(Gq) and hM4D(Gi) in SST+ neurons is sparse, which was necessary to avoid off-target effects.

      (3) Can you be more specific about the subregions of cerebellar nuclei and cell types that are targeted in the tracing studies? Discussions of the cerebellar nuclei subregions are missing and would be interesting, as others have shown discrete pathways between cerebellar nuclei subregions and long-distance projections.

      See our response to comment 5a from Reviewer 1 (copied again here): we used a dual-injection transsynaptic tracing approach to specifically label the outputs of ZI neurons that receive input from the deep cerebellar nuclei. The anterograde viral vector injected into the CN is unlabeled (no fluorophone) and therefore, it is not possible to reliably assess the extent of viral spread in those experiments as performed. However, we have previously performed similar injections into the deep cerebellar nuclei and post hoc histology suggest all three nuclei will have at least some viral expression (Koster and Sherman, 2024). Due to size and injection location, we will mostly have reached the lateral (dentate) nuclei, but cannot exclude partial transsynaptic tracing from the interposed and medial nuclei.  

      It would indeed be interesting to further investigate the effect of CFs residing in different cerebellar lobules, which preferentially target different cerebellar nuclei, on targets of these nuclei.

      (4) Did you see any connection to the ventral tegmental area? Can you comment on whether dopamine pathways are influenced by CF and in your manipulations?

      We did not specifically look at these pathways and thus are not able to comment on this.

      (5) These are intensive surgeries, do you think glia could have influenced any results?

      This was not tested and seems unlikely, but we cannot exclude such possibility.

      (6) It is unclear in the methods how long animals were recorded for in each experiment. Can you add more detail?

      Additional detail was added to the methods. Recordings for all experimental configurations did not last more than 120 minutes in total. All data were analyzed across identical time windows for each experiment.

      (7) In the methods it was mentioned that recording length can differ between animals. Can this influence the results, and if so, how was that controlled for?

      There was a variance in recording length within experimental groups, but no systematic difference between groups.

      (8) I do not see any mention of animal sex throughout this manuscript. If animals were mixed groups, were sex differences considered? Would it be expected that CF activity would be different in male and female mice?

      As mentioned in the Methods (Animals), mice of either sex were used. No sex-dependent differences were observed.

      (9) Transsynaptic tracing results of the zona incerta are very interesting. The zona incerta is highly understudied, but has been linked to feeding, locomotion, arousal, and novelty seeking. Do you think this pathway would explain some of the behavioral results found through other studies of cerebellar lobule perturbations? Some discussion of how this brain region would be important as a cerebellar connection in animal behavior would be interesting.

      Since the multi-synaptic pathway from the cerebellum to S1 involves several brain regions with their own inputs and modulatory influences, it seems plausible to assume that behaviors controlled by these regions or affecting signaling pathways that regulate them would show some level of interaction. Our study does not address these interactions, but this will be an interesting question to be addressed in future work.

      Reviewer #3 (Recommendations for the authors):

      General comments on the data presentation:

      I'm not a huge fan of taking areas under curves ('AUC' throughout the study) when the integral of the quantity has no physical meaning - 'normalizing' the AUC (1I,L etc) is even stranger, because of course if you instead normalize the AUC by the # of data points, you literally just get the mean (which is probably what should be used instead).

      Indeed, AUC is equal to the average response in the time window used, multiplied by the window duration (thus, AUC is directly proportional to the mean). We choose to report AUC, a descriptive statistic, rather than the mean within this window. In 1I and L, we normalize the AUC across animals, essentially removing the variability across animals in the ‘Pre’ condition for visualization. Note the significance of these comparisons are consistent whether or not we normalize to the ‘Pre’ condition (non-normalized RWS data in I shows a significant increase in PN activity, p = 0.0068, signrank test; non-normalized RWS+CF data in I shows a significant decrease in PN activity, p = 0.0135, paired t-test; non-normalized RWS data in L shows a significant decrease in IN activity, p <0.001, paired t-test; non-normalized RWS+CF data in L shows no significant change in IN activity, p = 0.7789, paired t-test).

      I think unadorned bar charts are generally excluded from most journals now. Consider replacing these with something that shows the raw datapoints if not too many, or the distribution across points.

      We have replaced bar charts with box plots and violin plots. We have avoided plotting individual data points due to the quantity of points.

      In various places, the statistics produce various questionable outcomes that will draw unwanted reader scrutiny. Many of the examples below involve tiny differences in means with overlapping error bars that are "significant" or a few cases of nonoverlapping error bars that are "not significant." I think replacing the bar charts may help to resolve things here if we can see the whole distribution or the raw data points. As importantly, I think a big problem is that the statistical tests all seem to be nonparametric (they are ambiguously described in Table S3 as "Wilcoxon," which should be clarified, since there is an unpaired Wilcoxon test [rank sum] and a paired Wilcoxon test [sign rank]), and thus based on differences in the *median* whereas the bar charts are based on the *mean* (and SEM rather than MAD or IQR or other medianappropriate measure of spread). This should be fixed (either change the test or change the plots), which will hopefully allay many of the items below.

      We thank the reviewer for this important point. As mentioned in the Statistics and quantification section, Wilcoxon signed rank tests were used for non-normal data. We have replaced the bar charts with box plots which show the IQR and median, which indeed allays may of the items below.

      Here are some specific points on the statistics presentation:

      (1) 1G, the test says that following RWS+CF, the decrease in PN response is not significant. In 1I, the same data, but now over time, shows a highly significant decrease. This probably means that either the first test should be reconsidered (was this a paired comparison, which would "build in" the normalization subsequently used automatically?) or the second test should be reconsidered. It's especially strange because the n value in G, if based on cells, would seem to be ~50-times higher than that in I if based on mice.

      In Figure 1G, the analysis tests whether individual pyramidal neurons significantly changed their responses before vs. after RWS+CF stimulation. This is a paired comparison at the single-cell level, and here indicates that the average per-neuron response did not reliably decrease after RWS+CF when comparing each cell’s pre- and post-values directly. In contrast, Figure 1I examines the same dataset analyzed across time bins using a two-way ANOVA, which tests for effects of time, group (RWS vs. RWS+CF), and their interaction. The analysis showed a significant group effect (p < 0.001), indicating that the overall level of activity across all time points differed between RWS and RWS+CF conditions. The difference in significance between these two analyses arises because the first test (Fig. 1G) assesses within-neuron changes (paired), whereas the second test (Fig. 1I) assesses overall population-level differences between groups over time (independent groups). Thus, the tests address related but distinct questions—one about per-cell response changes, the other about how activity differs across experimental conditions.

      (2) 1J RWS+CF then shows a much smaller difference with overlapping error bars than the ns difference with nonoverlapping errors in 1G, but J gets three asterisks (same n-values).

      Bar graphs have been replaced with box plots.

      (3) 1K, it is very unclear what is under the asterisk could possibly be significant here, since the black and white dots overlap and trade places multiple times.

      See response to point 1. A significant group effect will exist if the aggregate difference across all time bins exceeds within-group variability. The asterisk therefore reflects a statistically significant main group effect (RWS versus RWS+CF) rather than differences at any single time point. Note, however, the very small effect size here.

      (4) 2B, 2G, 2H, 2I, 3G, 3H, 5C etc, again, significance with overlapping error bars, see suggestions above.

      Bar graphs have been replaced with box plots.

      (5) Time windows: e.g., L149-153 / 2B - this section reads weirdly. I think it would be less offputting to show a time-varying significance, if you want to make this point (there are various approaches to this floating around), or a decay rate, or something else.

      Here, we wanted to understand the overall direction of influence of CFs on VIP activity. We find that CFs exert a suppressive effect on VIP activity, which is statistically significant in this later time window. The specific effect of CF modulation on the activity of S1 neurons across multiple time points will be described in more detail in future investigations.

      (6) 4G, 6I, these asterisks again seem impossible (as currently presented).

      Bar graphs have been replaced with box plots.

      The writing is in generally ok shape, but needs tightening/clarifying:

      (1) L45 "mechanistic capacity" not clear.

      We have simplified this term to “capacity.” We use the term here to express that the central question we pose is whether CF signals are able to impact S1 circuits. We demonstrate CF signals indeed influence S1 circuits and further describe the mechanism through which this occurs, but we do not yet know all of the natural conditions in which this may occur. We feel that “capacity” describes the question we pose -- and our findings -- very well.

      (2) L48-58 there's a lot of material here, not clear how much is essential to the present study.

      We would like to give an overview of the literature on instructive CF signaling within the cerebellum. Here, we feel it is important to describe how CFs supervise learning in the cerebellum via coincident activation of parallel fiber inputs and CF inputs. Our results demonstrate CFs have the capacity to supervise learning in the neocortex in a similar manner, as coincident CF activation with sensory input modulates plasticity of S1 neurons.

      (3) L59 "has the capacity to" maybe just "can".

      This has been adopted. We agree that “can” is a more straightforward way of saying “has the capacity to” here. In this sentence, “can” and “has the capacity to” both mean a general ability to do something, without explicit knowledge about the conditions of use.

      (4) L61-62 some of this is circular "observation that CF regulates plasticity in S1..has consequences for plasticity in S1".

      We now changed this to read “…consequences for input processing in S1.”

      (5) L91 "already existing whisker input" although I get it, strictly speaking, not clear what this means.

      This sentence has been reworded for clarity.

      (6) L94 "this form of plasticity" what form?

      Edited to read “sensory-evoked plasticity.”

      (7) L119 should say "to test the".

      This has been corrected.

      (8) L120 should say "well-suited to measure receptive fields".

      We agree; this wording has been adopted.

      (9) L130 should say "optical imaging demonstrated that receptive field".

      This has been adopted.

      (10) L138, the disclaimer is helpful, but wouldn't it be less confusing to just pick a different set of terms? Response potentiation etc.

      Perhaps, but we want to stress that components of LTP and LTD (traditionally tested using electrophysiological methods to specifically measure synaptic gain changes) can be optically measured as long as it is specified what is recorded.

      (11) L140, this whole section is not very clear. What was the experiment? What was done and how?

      The text in this section has been updated.

      (12) L154, 156, 158, 160, 960, what is a "basic response"? Is this supposed to contrast with RWS? If so, I would just say "we measured the response to whisker stimulation without first performing RWS, and compared this to the whisker stimulation with simultaneous CF activation."

      What we meant by “basic response” was the acute response of S1 neurons to a single 100 ms air puff. Here, we indeed measured the acute responses of S1 neurons to whisker stimulation (100 ms air puff) and compared them to whisker stimulation with simultaneous CF activation (100 ms air puff with a 50 ms light pulse; the light pulse was delayed 45 ms with respect to the air puff). This paragraph has been reworded for clarity.

      (13) L156 "comprised of a majority" unclear. You mean most of the nonspecific IN group is either PV or SST?

      Yes, that was meant here. This paragraph has been reworded for clarity.

      (14) L165 tense. "are activated" "we tested" prob should be "were activated."

      This sentence was reworded.

      (15) L173 Not requesting additional experiments, but demonstrating that the effect is mimicked by directly activating SST or suppressing VIP questions the specificity of CF activation per se, versus presumably many other pathways upstream of the same mechanisms, which might be worth acknowledging in the text.

      We indeed observe that directly activating SST or suppressing VIP neurons in S1 is sufficient to mediate the effect of CF activation on S1 pyramidal neurons, implicating SST and VIP neurons as the local effectors of CF signaling. In the text, we wrote “...the notion of sufficiency does not exclude potential effects of plasticity processes elsewhere that might well modulate effector activation in this context and others not yet tested.” Here, we mean that CFs are certainly not the only modulators of the inhibitory network in S1. One example we highlight in the discussion is that projections from M1 are known to modulate this disinhibitory VIP-to-SST-to-PN microcircuit in S1. We conclude from our chemogenetic manipulation experiments that CFs ultimately have the capacity to modulate S1 interneurons, which must occur indirectly (either through the thalamus or “upstream” regions as this reviewer points out). The fact that many other brain regions may also modulate the interneuron network in S1 -- or be modulated by CF activity themselves -- only expands the capacity of CFs to exert a variety of effects on S1 neurons in different contexts.

      (16) L247 "induced ChR2" awkward.

      We changed this to read “we expressed ChR2.”

      (17) 6C, what are the three colors supposed to represent?

      We apologize for the missing labels in this version of the manuscript. Figure 6C and the figure legend have been updated.

  4. social-media-ethics-automation.github.io social-media-ethics-automation.github.io
    1. 21.6. Bibliography# [u1] Plato. Phaedrus: Translated by Benjamin Jowett. January 2013. Page Version ID: 1189255462. [u2] Luddite. December 2023. Page Version ID: 1189255462. URL: https://en.wikipedia.org/w/index.php?title=Luddite&oldid=1189255462 (visited on 2023-12-10). [u3] Ted Chiang. Will A.I. Become the New McKinsey? The New Yorker, May 2023. URL: https://www.newyorker.com/science/annals-of-artificial-intelligence/will-ai-become-the-new-mckinsey (visited on 2023-12-10). [u4] xkcd comics. The Pace of Modern Life. June 2013. URL: https://xkcd.com/1227/ (visited on 2023-12-10). [u5] xkcd comics. 1227: The Pace of Modern Life - explain xkcd. June 2013. URL: https://www.explainxkcd.com/wiki/index.php/1227:_The_Pace_of_Modern_Life (visited on 2023-12-10). [u6] Steven Spielberg. Jurassic Park. June 1993. URL: https://www.imdb.com/title/tt0107290/. [u7] Alex Blechman [@AlexBlechman]. Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus. November 2021. URL: https://twitter.com/AlexBlechman/status/1457842724128833538 (visited on 2023-12-10). [u8] Silicon Valley. April 2014. URL: https://www.imdb.com/title/tt2575988/. [u9] Eli Whitney. December 2023. Page Version ID: 1189351897. URL: https://en.wikipedia.org/w/index.php?title=Eli_Whitney&oldid=1189351897 (visited on 2023-12-10). [u10] Alfred Nobel. December 2023. Page Version ID: 1189282550. URL: https://en.wikipedia.org/w/index.php?title=Alfred_Nobel&oldid=1189282550 (visited on 2023-12-10). [u11] Einstein and the Manhattan Project. URL: https://www.amnh.org/exhibitions/einstein/peace-and-war/the-manhattan-project (visited on 2023-12-10). [u12] Steve Krenzel [@stevekrenzel]. With Twitter's change in ownership last week, I'm probably in the clear to talk about the most unethical thing I was asked to build while working at Twitter. 🧵. November 2022. URL: https://twitter.com/stevekrenzel/status/1589700721121058817 (visited on 2023-12-10). [u13] Britney Nguyen. Ex-Twitter engineer says he quit years ago after refusing to help sell identifiable user data, worries Elon Musk will 'do far worse things with data'. November 2022. URL: https://www.businessinsider.com/former-twitter-engineer-worried-how-elon-musk-treat-user-data-2022-11 (visited on 2023-12-10). [u14] Alphabet Workers Union-Communications Workers of America Local 9009. Our People: Workers are coming together to build power across Alphabet. URL: https://www.alphabetworkersunion.org/our-people (visited on 2023-12-10). [u15] Jason Parham. A People’s History of Black Twitter, Part I. Wired, July 2021. URL: https://www.wired.com/story/black-twitter-oral-history-part-i-coming-together/ (visited on 2023-12-10). [u16] Jason Parham. There Is No Replacement for Black Twitter. Wired, November 2022. URL: https://www.wired.com/story/black-twitter-elon-musk/ (visited on 2023-12-10). [u17] Catherine Buni. Media, company, behemoth: What, exactly, is Facebook? November 2016. URL: https://www.theverge.com/2016/11/16/13655102/facebook-journalism-ethics-media-company-algorithm-tax (visited on 2023-12-10). [u18] Rafi Letzter. A teenager on TikTok disrupted thousands of scientific studies with a single video. September 2021. URL: https://www.theverge.com/2021/9/24/22688278/tiktok-science-study-survey-prolific (visited on 2023-12-10). [u19] Catherine D'Ignazio and Lauren F. Klein. Data Feminism. Strong Ideas. MIT Libraries Experimental Collections Fund, Cambridge, 1 edition, 2020. ISBN 978-0-262-04400-4. URL: https://direct.mit.edu/books/oa-monograph/4660/Data-Feminism, doi:10.7551/mitpress/11805.001.0001. [u20] Janet Abbate. Recoding Gender: Women's Changing Participation in Computing. MIT Press, Cambridge, UNITED STATES, 2012. ISBN 978-0-262-30546-4. URL: http://ebookcentral.proquest.com/lib/washington/detail.action?docID=3339524 (visited on 2023-12-10). [u21] Mar Hicks. Programmed Inequality: How Britain Discarded Women Technologists and Lost Its Edge in Computing. MIT Press, Cambridge, UNITED STATES, 2017. ISBN 978-0-262-34294-0. URL: http://ebookcentral.proquest.com/lib/washington/detail.action?docID=6246618 (visited on 2023-12-10). [u22] Charlton D. McIlwain. Black software: the internet and racial justice, from the AfroNet to Black Lives Matter. 2020. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162262159401452. [u23] Simone Browne. Dark Matters: On the Surveillance of Blackness. Duke University Press, September 2015. ISBN 978-0-8223-7530-2. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161921055701452 (visited on 2023-12-10), doi:10.1215/9780822375302. [u24] Safiya Umoja Noble. Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press, New York, UNITED STATES, 2018. ISBN 978-1-4798-3364-1. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162068349301452 (visited on 2023-12-10). [u25] Shalini Kantayya. Coded Bias. November 2020. URL: https://www.netflix.com/title/81328723 (visited on 2023-12-10). [u26] Tarleton Gillespie. Custodians of the Internet: Platforms, Content Moderation, and the Hidden Decisions That Shape Social Media. Yale University Press, New Haven, UNITED STATES, 2018. ISBN 978-0-300-23502-9. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162362661601452 (visited on 2023-12-10). [u27] Sarah T. Roberts. Behind the screen: content moderation in the shadows of social media. 2019. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162217744201452. [u28] Jean Burgess, Alice Marwick, and Thomas Poell. The SAGE Handbook of Social Media. SAGE Publications, 55 City Road, London, 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162105658401452 (visited on 2023-12-10), doi:10.4135/9781473984066. [u29] Yuri Takhteyev. Coding Places: Software Practice in a South American City. The MIT Press, September 2012. ISBN 978-0-262-30559-4. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161981926801452 (visited on 2023-12-10), doi:10.7551/mitpress/9109.001.0001. [u30] Virginia Eubanks. Automating inequality: how high-tech tools profile, police, and punish the poor. 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162064355601452. [u31] Mary L. Gray and Siddharth Suri. Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Houghton Mifflin Harcourt Publishing Company, Boston, United States, 2019. ISBN 978-1-328-56628-7. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162207131801452 (visited on 2023-12-10). [u32] Shoshana Zuboff. The age of surveillance capitalism: the fight for a human future at the new frontier of power. 2019. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162177355601452. [u33] Cathy O'Neil. Weapons of math destruction: how big data increases inequality and threatens democracy. 2016. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99161951137601452. [u34] Sasha Costanza-Chock. Design justice: community-led practices to build the worlds we need. Information policy series. The MIT Press, Cambridge, Massachesetts, 2020. ISBN 978-0-262-35686-2. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162363060401452. [u35] Thomas S. Mullaney, Benjamin Peters, Mar Hicks, and Kavita Philip. Your computer is on fire. The MIT Press, Cambridge, Massachusetts, 2021. ISBN 978-0-262-36077-7. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99162423945901452, doi:10.7551/mitpress/10993.001.0001. [u36] Sara Wachter-Boettcher. Technically wrong: sexist apps, biased algorithms, and other threats of toxic tech. October 2018. URL: https://orbiscascade-washington.primo.exlibrisgroup.com/permalink/01ALLIANCE_UW/8iqusu/alma99329653362401451. [u37] Saunders, Joe and Carl Fox, editors. Media Ethics, Free Speech, and the Requirements of Democracy. Routledge, New York, December 2018. ISBN 978-0-203-70244-4. URL: https://www.taylorfrancis.com/books/edit/10.4324/9780203702444/media-ethics-free-speech-requirements-democracy-carl-fox-joe-saunders, doi:10.4324/9780203702444. [u38] Ruha Benjamin. Viral Justice: How We Grow the World We Want. Princeton University Press, October 2022. ISBN 978-0-691-22288-2. URL: https://press.princeton.edu/books/hardcover/9780691222882/viral-justice (visited on 2023-12-10). [u39] Meta for Developers. 2023. URL: https://developers.facebook.com/ (visited on 2023-12-10). [u40] API Reference — Facebook SDK for Python 4.0.0-pre documentation. 2015. URL: https://facebook-sdk.readthedocs.io/en/latest/api.html (visited on 2023-12-10). [u41] TikTok for Developers. 2023. URL: https://developers.tiktok.com/ (visited on 2023-12-10). [u42] Getting started with Official Account Developer Mode. January 2013. URL: https://developers.weixin.qq.com/doc/offiaccount/en/Getting_Started/Getting_Started_Guide.html (visited on 2023-12-10).

      After checking out Coded Bias, I was honestly surprised how much everyday technology relies on algorithms that were never tested on diverse groups of people. The documentary shows how facial recognition failed on darker-skinned women, which made me think about how “neutral” tech isn’t neutral at all. What really got me is how the developers didn’t seem to think about these consequences until people called them out. It connects perfectly to the chapter’s theme that innovation often ignores ethics until harm already happens. It also made me wonder how many other systems we use every day have hidden biases we just haven’t noticed yet.

    2. Ted Chiang. Will A.I. Become the New McKinsey? The New Yorker, May 2023. URL: https://www.newyorker.com/science/annals-of-artificial-intelligence/will-ai-become-the-new-mckinsey (visited on 2023-12-10).

      I appreciate how Chiang reframes the fear of AI “taking over” by comparing it to management-consulting logic rather than superintelligence. His argument that powerful institutions often use technology as a justification for harmful decisions — rather than technology making those decisions itself — really stuck with me. It made me think about how often companies claim, “The algorithm says we have to do this,” the same way executives once said, “McKinsey says we have to cut costs.”

    1. “I believe that this isan important test of the separation of church and state as we may see inour lifetime—as important a test—and it is critically important that weget it right” (Bloomberg ). His argument that the government should notprohibit people from worshiping as they wish could have been made with-out these exigent circumstances, but their inclusion changes the tone fromone of a defensive posture to a more vigorous one.

      I think that the separation of church and state is an important standard that our government should follow, i also think that the way the writer uses the text to show this really helps to prove that point

    1. Author response:

      The following is the authors’ response to the original reviews

      We would like to thank all reviewers for their constructive and in-depth reviews. Thanks to your feedback, we realized that the main objective of the paper was not presented clearly enough, and that our use of the same “modality-agnostic” terminology for both decoders and representations caused confusion. We addressed these two major points as outlined in the following. 

      In the revised manuscript, we highlight that the main contribution of this paper is to introduce modality-agnostic decoders. Apart from introducing this new decoder type, we put forward their advantages in comparison to modality-specific decoders in terms of decoding performance and analyze the modality-invariant representations (cf. updated terminology in the following paragraph) that these decoders rely on. The dataset that these analyses are based on is released as part of this paper, in the spirit of open science (but this dataset is only a secondary contribution for our paper). 

      Regarding the terminology, we clearly define modality-agnostic decoders as decoders that are trained on brain imaging data from subjects exposed to stimuli in multiple modalities. The decoder is not given any information on which modality a stimulus was presented in, and is therefore trained to operate in a modality-agnostic way. In contrast, modality-specific decoders are trained only on data from a single stimulus modality. These terms are explained in Figure 2. While these terms describe different ways of how decoders can be trained, there are also different ways to evaluate them afterwards (see also Figure 3); but obviously, this test-time evaluation does not change the nature of the decoder, i.e., there is no contradiction in applying a modality-specific decoder to brain data from a different modality.

      Further, we identify representations that are relevant for modality-agnostic decoders using the searchlight analysis. We realized that our choice of using the same “modality-agnostic” term to describe these brain representations created unnecessary debate and confusion. In order to not conflate the terminology, in the updated manuscript we call these representations modality-invariant (and the opposite modality-dependent). Our methodology does not allow us to distinguish whether certain representations merely share representational structure to a certain degree, or are truly representations that abstract away from any modality-dependent information. However, in order to be useful for modality-agnostic decoding, a significant degree of shared representational structure is sufficient, and it is this property of brain representations that we now define as “modality-invariant”. 

      We updated the manuscript in line with this new terminology and focus: in particular, the first Related Work section on Modality-invariant brain representations, as well as the Introduction and Discussion.

      Public Reviews:

      Reviewer #1 (Public review):

      Summary:

      The authors introduce a densely-sampled dataset where 6 participants viewed images and sentence descriptions derived from the MS Coco database over the course of 10 scanning sessions. The authors further showcase how image and sentence decoders can be used to predict which images or descriptions were seen, using pairwise decoding across a set of 120 test images. The authors find decodable information widely distributed across the brain, with a left-lateralized focus. The results further showed that modality-agnostic models generally outperformed modality-specific models, and that data based on captions was not explained better by caption-based models but by modality-agnostic models. Finally, the authors decoded imagined scenes.

      Strengths:

      (1) The dataset presents a potentially very valuable resource for investigating visual and semantic representations and their interplay.

      (2) The introduction and discussion are very well written in the context of trying to understand the nature of multimodal representations and present a comprehensive and very useful review of the current literature on the topic.

      Weaknesses:

      (1) The paper is framed as presenting a dataset, yet most of it revolves around the presentation of findings in relation to what the authors call modality-agnostic representations, and in part around mental imagery. This makes it very difficult to assess the manuscript, whether the authors have achieved their aims, and whether the results support the conclusions.

      Thanks for this insightful remark. The dataset release is only a secondary contribution of our study; this was not clear enough in the previous version. We updated the manuscript to make the main objective of the paper more clear, as outlined in our general response to the reviews (see above).

      (2) While the authors have presented a potential use case for such a dataset, there is currently far too little detail regarding data quality metrics expected from the introduction of similar datasets, including the absence of head-motion estimates, quality of intersession alignment, or noise ceilings of all individuals.

      As already mentioned in the general response, the main focus of the paper is to introduce modality-agnostic decoders. The dataset is released in addition, this is why we did not focus on reporting extensive quality metrics in the original manuscript. To respond to your request, we updated the appendix of the manuscript to include a range of data quality metrics. 

      The updated appendix includes head motion estimates in the form of realignment parameters and framewise displacement, as well as a metric to assess the quality of intersession alignment. More detailed descriptions can be found in Appendix 1 of the updated manuscript.

      Estimating noise ceilings based on repeated presentations of stimuli (as for example done in Allen et al. (2022)) requires multiple betas for each stimulus. All training stimuli were only presented once, so this could only be done for the test stimuli which were presented repeatedly. However, during our preprocessing procedure we directly calculated stimulus-specific betas based on data from all sessions using one single GLM, which means that we did not obtain separate betas for repeated presentations of the same stimulus. We will however share the raw data publicly, so that such noise ceilings can be calculated using an adapted preprocessing procedure if required.

      Allen, E. J., St-Yves, G., Wu, Y., Breedlove, J. L., Prince, J. S., Dowdle, L. T., Nau, M., Caron, B., Pestilli, F., Charest, I., Hutchinson, J. B., Naselaris, T., & Kay, K. (2022). A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1), 116–126. https://doi.org/10.1038/s41593-021-00962-x

      (3) The exact methods and statistical analyses used are still opaque, making it hard for a reader to understand how the authors achieved their results. More detail in the manuscript would be helpful, specifically regarding the exact statistical procedures, what tests were performed across, or how data were pooled across participants.

      In the updated manuscript, we improved the level of detail for the descriptions of statistical analyses wherever possible (see also our response to your “Recommendations for the authors”, Point 6).

      Regarding data pooling across participants: 

      Figure 8 shows averaged results across all subjects (as indicated in the caption)

      Regarding data pooling for the estimation of the significance threshold of the searchlight analysis for modality-invariant regions: We updated the manuscript to clarify that we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution: “For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results.”

      Additionally, we indicated that the same permutation testing methods were applied to assess the significance threshold for the imagery decoding searchlight maps (Figure 10). 

      (4) Many findings (e.g., Figure 6) are still qualitative but could be supported by quantitative measures.

      The Figures 6 and 7 are intentionally qualitative results to support the quantitative decoding results presented in Figures 4 and 5. (see also Reviewer 2 Comment 2)

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (5) Results are significant in regions that typically lack responses to visual stimuli, indicating potential bias in the classifier. This is relevant for the interpretation of the findings. A classification approach less sensitive to outliers (e.g., 70-way classification) could avoid this issue. Given the extreme collinearity of the experimental design, regressors in close temporal proximity will be highly similar, which could lead to leakage effects.

      It is true that our searchlight analysis revealed significant activity in regions outside of the visual cortex. However, it is assumed that the processing of visual information does not stop at the border of the visual cortex. The integration of information such as the semantics of the image is progressively processed in other higher-level regions of the brain. Recent studies have shown that activity in large areas of the cortex (including many outside of the visual cortex) can be related to visual stimulation (Solomon et al. 2024; Raugel et al. 2025). Our work confirms this finding and we therefore do not see reason to believe that this is due to a bias in our decoders.

      Further, you are suggesting that we could replace our regression approach with a 70-way classification. However, this is difficult using our fMRI data as we do not see a straightforward way to assign the training and testing stimuli with class labels (the two datasets consist of non-overlapping sets of naturalistic images).

      To address your concerns regarding the collinearity of the experimental design and possible leakage effects, we trained and evaluated a decoder for one subject after running a “null-hypothesis” adapted preprocessing. More specifically, for all sessions, we shifted the functional data of all runs by one run (moving the data of the last run to the very front), but leaving the design matrices in place. Thereby, we destroyed the relationship of stimuli and brain activity but kept the original data and design with its collinearity (and possible biases). We preprocessed this adapted data for subject 1, and ran a whole-brain decoding using Imagebind features and verified that the decoding performance was at chance level:  Pairwise accuracy (captions): 0.43 | Pairwise accuracy (images): 0.47 | Pairwise accuracy (imagery): 0.50. This result provides evidence against the notion that potential collinearity or biases in our experimental design or evaluation procedure could have led to inflated results.

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      Solomon, S. H., Kay, K., & Schapiro, A. C. (2024). Semantic plasticity across timescales in the human brain. bioRxiv, 2024-02.

      (6) The manuscript currently lacks a limitations section, specifically regarding the design of the experiment. This involves the use of the overly homogenous dataset Coco, which invites overfitting, the mixing of sentence descriptions and visual images, which invites imagery of previously seen content, and the use of a 1-back task, which can lead to carry-over effects to the subsequent trial.

      Regarding the dataset CoCo: We agree that CoCo is somewhat homogenous, it is however much more diverse and naturalistic than the smaller datasets used in previous fMRI experiments with multimodal stimuli. Additionally, CoCo has been widely adopted as a benchmark dataset in the Machine Learning community, and features rich annotations for each image (e.g. object labels, segmentations, additional captions, people’s keypoints) facilitating many more future analyses based on our data.

      Regarding the mixing of sentence descriptions and images: Subjects were not asked to visualize sentences and different techniques for the one-back tasks might have been used. Generally, we do not see it as problematic if subjects are performing visual imagery to some degree while reading sentences, and this might even be the case during normal reading as well. A more targeted experiment comparing reading with and without interleaved visual stimulation in the form of images and a one-back task would be required to assess this, but this was not the focus of our study. For now, it is true that we can not be sure that our results generalize to cases in which subjects are just reading and are less incentivized to perform mental imagery.

      Regarding the use of a 1-back task: It was necessary to make some design choices in order to realize this large-scale data collection with approximately 10 hours of recording per subject. Specifically, the 1-back task was included in the experimental setup in order to assure continuous engagement of the participant during the rather long sessions of 1 hour. The subjects did indeed need to remember the previous stimulus to succeed at the 1-back task, which means that some brain activity during the presentation of a stimulus is likely to be related to the previous stimulus. We aimed to account for this confound during the preprocessing stage when fitting the GLM, which was fit to capture only the response to the presented image/caption, not the preceding one. Still, it might have picked up on some of the activity from preceding stimuli, causing some decrease of the final decoding performance.

      We added a limitations section to the updated manuscript to discuss these important issues.

      (7) I would urge the authors to clarify whether the primary aim is the introduction of a dataset and showing the use of it, or whether it is the set of results presented. This includes the title of this manuscript. While the decoding approach is very interesting and potentially very valuable, I believe that the results in the current form are rather descriptive, and I'm wondering what specifically they add beyond what is known from other related work. This includes imagery-related results. This is completely fine! It just highlights that a stronger framing as a dataset is probably advantageous for improving the significance of this work.

      Thanks a lot for pointing this out. Based on this comment and feedback from the other reviewers we restructured the abstract, introduction and discussion section of the paper to better reflect the primary aim. (cf. general response above).

      You further mention that it is not clear what our results add beyond what is known from related work. We list the main contributions here:

      A single modality-agnostic decoder can decode the semantics of visual and linguistic stimuli irrespective of the presentation modality with a performance that is not lagging behind modality-specific decoders.

      Modality-agnostic decoders outperform modality-specific decoders for decoding captions and mental imagery.

      Modality-invariant representations are widespread across the cortex (a range of previous work has suggested they were much more localized (Bright et al. 2004; Jung et al. 2018; Man et al. 2012; Simanova et al. 2014).

      Regions that are useful for imagery are largely overlapping with modality-invariant regions

      Bright, P., Moss, H., & Tyler, L. K. (2004). Unitary vs multiple semantics: PET studies of word and picture processing. Brain and language, 89(3), 417-432.

      Jung, Y., Larsen, B., & Walther, D. B. (2018). Modality-Independent Coding of Scene Categories in Prefrontal Cortex. Journal of Neuroscience, 38(26), 5969–5981.

      Liuzzi, A. G., Bruffaerts, R., Peeters, R., Adamczuk, K., Keuleers, E., De Deyne, S., Storms, G., Dupont, P., & Vandenberghe, R. (2017). Cross-modal representation of spoken and written word meaning in left pars triangularis. NeuroImage, 150, 292–307. https://doi.org/10.1016/j.neuroimage.2017.02.032

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Simanova, I., Hagoort, P., Oostenveld, R., & van Gerven, M. A. J. (2014). Modality-Independent Decoding of Semantic Information from the Human Brain. Cerebral Cortex, 24(2), 426–434.

      Reviewer #2 (Public review):

      Summary:

      This study introduces SemReps-8K, a large multimodal fMRI dataset collected while subjects viewed natural images and matched captions, and performed mental imagery based on textual cues. The authors aim to train modality-agnostic decoders--models that can predict neural representations independently of the input modality - and use these models to identify brain regions containing modality-agnostic information. They find that such decoders perform comparably or better than modality-specific decoders and generalize to imagery trials.

      Strengths:

      (1) The dataset is a substantial and well-controlled contribution, with >8,000 image-caption trials per subject and careful matching of stimuli across modalities - an essential resource for testing theories of abstract and amodal representation.

      (2) The authors systematically compare unimodal, multimodal, and cross-modal decoders using a wide range of deep learning models, demonstrating thoughtful experimental design and thorough benchmarking.

      (3) Their decoding pipeline is rigorous, with informative performance metrics and whole-brain searchlight analyses, offering valuable insights into the cortical distribution of shared representations.

      (4) Extension to mental imagery decoding is a strong addition, aligning with theoretical predictions about the overlap between perception and imagery.

      Weaknesses:

      While the decoding results are robust, several critical limitations prevent the current findings from conclusively demonstrating truly modality-agnostic representations:

      (1) Shared decoding ≠ abstraction: Successful decoding across modalities does not necessarily imply abstraction or modality-agnostic coding. Participants may engage in modality-specific processes (e.g., visual imagery when reading, inner speech when viewing images) that produce overlapping neural patterns. The analyses do not clearly disambiguate shared representational structure from genuinely modality-independent representations. Furthermore, in Figure 5, the modality-agnostic encoder did not perform better than the modality-specific decoder trained on images (in decoding images), but outperformed the modality-specific decoder trained on captions (in decoding captions). This asymmetry contradicts the premise of a truly "modality-agnostic" encoder. Additionally, given the similar performance between modality-agnostic decoders based on multimodal versus unimodal features, it remains unclear why neural representations did not preferentially align with multimodal features if they were truly modality-independent.

      We agree that successful modality-agnostic and cross-modal decoding does not necessarily imply that abstract patterns were decoded. In the updated manuscript, we therefore refer to these representations as modality-invariant (see also the updated terminology explained in the general response above).

      If participants are performing mental imagery when reading, and this is allowing us to perform cross-decoding, then this means that modality-invariant representations are formed during this mental imagery process, i.e. that the representations formed during this form of mental imagery are compatible with representations during visual perception (or, in your words, produce overlapping neural patterns). While we can not know to what extent people were performing mental imagery while reading (or having inner speech while viewing images), our results demonstrate that their brain activity allows for decoding across modalities, which implies that modality-invariant representations are present.

      It is true that our current analyses can not disambiguate modality-invariant representations (or, in your words, shared representational structure) from abstract representations (in your words, genuinely modality-independent representations). As the main goal of the paper was to build modality-agnostic decoders, and these only require what we call “modality-invariant” representations (see our updated terminology in the general reviewer response above), we leave this question open for future work. We do however discuss this important limitation in the Discussion section of the updated manuscript.

      Regarding the asymmetry of decoding results when comparing modality-agnostic decoders with the two respective modality-specific decoders for captions and images: We do not believe that this asymmetry contradicts the premise of a modality-agnostic decoder. Multiple explanations for this result are possible: (1) The modality-specific decoder for images might benefit from the more readily decodable lower-level modality-dependent neural activity patterns in response to images, which are less useful for the modality-agnostic decoder because they are not useful for decoding caption trials. The modality-specific decoders for captions might not be able to pick up on low-level modality-dependent neural activity patterns as these might be less easily decodable. 

      The signal-to-noise ratio for caption trials might be lower than for image trials (cf. generally lower caption decoding performance), therefore the addition of training data (even if it is from another modality) improves the decoding performance for captions, but not for images (which might be at ceiling already).

      Regarding the similar performance between modality-agnostic decoders based on multimodal versus unimodal features: Unimodal features are based on rather high-level features of the respective modality (e.g. last-layer features of a model trained for semantic image classification), which can be already modality-invariant to some degree. Additionally, as already mentioned before, in the updated manuscript we only require representations to be modality-invariant and not necessarily abstract.

      (2) The current analysis cannot definitively conclude that the decoder itself is modality-agnostic, making "Qualitative Decoding Results" difficult to interpret in this context. This section currently provides illustrative examples, but lacks systematic quantitative analyses.

      The qualitative decoding results in Figures 6 and 7 present exemplary qualitative results for the quantitative results presented in Figures 4 and 5 (see also Reviewer 1 Comment 4).

      Figures 4 and 5 show pairwise decoding accuracy as a quantitative measure for evaluation of the decoders. This metric is the main metric we used to compare different decoder types and features. Based on the finding that modality-agnostic decoders using imagebind features achieve the best score on this metric, we performed the additional qualitative analysis presented in Figures 6 and 7. (Note that we expanded the candidate set for the qualitative analysis in order to have a larger and more diverse set of images.)

      (3) The use of mental imagery as evidence for modality-agnostic decoding is problematic.

      Imagery involves subjective, variable experiences and likely draws on semantic and perceptual networks in flexible ways. Strong decoding in imagery trials could reflect semantic overlap or task strategies rather than evidence of abstraction.

      It is true that mental imagery does not necessarily rely on modality-agnostic representations. In the updated manuscript we revised our terminology and refer to the analyzed representations as modality-invariant, which we define as “representations that significantly overlap between modalities”. 

      The manuscript presents a methodologically sophisticated and timely investigation into shared neural representations across modalities. However, the current evidence does not clearly distinguish between shared semantics, overlapping unimodal processes, and true modality-independent representations. A more cautious interpretation is warranted.

      Nonetheless, the dataset and methodological framework represent a valuable resource for the field.

      We fully agree with these observations, and updated our terminology as outlined in the general response.

      Reviewer #3 (Public review):

      Summary:

      The authors recorded brain responses while participants viewed images and captions. The images and captions were taken from the COCO dataset, so each image has a corresponding caption, and each caption has a corresponding image. This enabled the authors to extract features from either the presented stimulus or the corresponding stimulus in the other modality.

      The authors trained linear decoders to take brain responses and predict stimulus features.

      "Modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. The decoders were evaluated on brain responses while the participants viewed and imagined new stimuli, and prediction performance was quantified using pairwise accuracy. The authors reported the following results:

      (1) Decoders trained on brain responses to both images and captions can predict new brain responses to either modality.

      (2) Decoders trained on brain responses to both images and captions outperform decoders trained on brain responses to a single modality.

      (3) Many cortical regions represent the same concepts in vision and language.

      (4) Decoders trained on brain responses to both images and captions can decode brain responses to imagined scenes.

      Strengths:

      This is an interesting study that addresses important questions about modality-agnostic representations. Previous work has shown that decoders trained on brain responses to one modality can be used to decode brain responses to another modality. The authors build on these findings by collecting a new multimodal dataset and training decoders on brain responses to both modalities.

      To my knowledge, SemReps-8K is the first dataset of brain responses to vision and language where each stimulus item has a corresponding stimulus item in the other modality. This means that brain responses to a stimulus item can be modeled using visual features of the image, linguistic features of the caption, or multimodal features derived from both the image and the caption. The authors also employed a multimodal one-back matching task, which forces the participants to activate modality-agnostic representations. Overall, SemReps-8K is a valuable resource that will help researchers answer more questions about modality-agnostic representations.

      The analyses are also very comprehensive. The authors trained decoders on brain responses to images, captions, and both modalities, and they tested the decoders on brain responses to images, captions, and imagined scenes. They extracted stimulus features using a range of visual, linguistic, and multimodal models. The modeling framework appears rigorous, and the results offer new insights into the relationship between vision, language, and imagery. In particular, the authors found that decoders trained on brain responses to both images and captions were more effective at decoding brain responses to imagined scenes than decoders trained on brain responses to either modality in isolation. The authors also found that imagined scenes can be decoded from a broad network of cortical regions.

      Weaknesses:

      The characterization of "modality-agnostic" and "modality-specific" decoders seems a bit contradictory. There are three major choices when fitting a decoder: the modality of the training stimuli, the modality of the testing stimuli, and the model used to extract stimulus features. However, the authors characterize their decoders based on only the first choice-"modality-specific" decoders were trained on brain responses to either images or captions, while "modality-agnostic" decoders were trained on brain responses to both stimulus modalities. I think that this leads to some instances where the conclusions are inconsistent with the methods and results.

      In our analysis setup, a decoder is entirely determined by two factors: (1) the modality of the stimuli that the subject was exposed to, and (2) the machine learning model used to extract stimulus features.

      The modality of the testing stimuli defines whether we are evaluating the decoder in a within-modality or cross-modality setting, but is not an inherent characteristic of a trained decoder

      First, the authors suggest that "modality-specific decoders are not explicitly encouraged to pick up on modality-agnostic features during training" (line 137) while "modality-agnostic decoders may be more likely to leverage representations that are modality-agnostic" (line 140). However, whether a decoder is required to learn modality-agnostic representations depends on both the training responses and the stimulus features. Consider the case where the stimuli are represented using linguistic features of the captions. When you train a "modality-specific" decoder on image responses, the decoder is forced to rely on modality-agnostic information that is shared between the image responses and the caption features. On the other hand, when you train a "modality-agnostic" decoder on both image responses and caption responses, the decoder has access to the modality-specific information that is shared by the caption responses and the caption features, so it is not explicitly required to learn modality-agnostic features. As a result, while the authors show that "modality-agnostic" decoders outperform "modality-specific" decoders in most conditions, I am not convinced that this is because they are forced to learn more modality-agnostic features.

      It is true that for example a modality-specific decoder trained on fmri data from images with stimulus features extracted from captions might also rely on modality-invariant features. We still call this decoder modality-specific, as it has been trained to decode brain activity recorded from a specific stimulus modality. In the updated manuscript we corrected the statement that “modality-specific decoders are not explicitly encouraged to pick up on modality-invariant features during training” to include the case of decoders trained on features from the other modality which might also rely on modality-invariant features.

      It is true that a modality-agnostic decoder can also have access to modality-dependent information for captions and images. However, as it is trained jointly with both modalities and the modality-dependent features are not compatible, it is encouraged to rely on modality-invariant features. The result that modality-agnostic decoders are outperforming modality-specific decoders trained on captions for decoding captions confirms this, because if the decoder was only relying on modality-dependent features the addition of additional training data from another stimulus modality could not increase the performance. (Also, the lack of a performance drop compared to modality-specific decoders trained on images is only possible thanks to the reliance on modality-invariant features. If the decoder only relied on modality-dependent features the addition of data from another modality would equal an addition of noise to the training data which must result in a performance drop at test time.). We can not exclude the possibility that modality-agnostic decoders are also relying on modality-dependent features, but our results suggest that they are relying at least to some degree on modality-invariant features.

      Second, the authors claim that "modality-specific decoders can be applied only in the modality that they were trained on, while "modality-agnostic decoders can be applied to decode stimuli from multiple modalities, even without knowing a priori the modality the stimulus was presented in" (line 47). While "modality-agnostic" decoders do outperform "modality-specific" decoders in the cross-modality conditions, it is important to note that "modality-specific" decoders still perform better than expected by chance (figure 5). It is also important to note that knowing about the input modality still improves decoding performance even for "modality-agnostic" decoders, since it determines the optimal feature space-it is better to decode brain responses to images using decoders trained on image features, and it is better to decode brain responses to captions using decoders trained on caption features.

      Thanks for this important remark. We corrected this statement and now say that “modality-specific decoders that are trained to be applied only in the modality that they were trained on”, highlighting that their training process optimizes them for decoding in a specific modality. They can indeed be applied to the other modality at test time, this however results in a substantial performance drop.

      It is true that knowing the input modality can improve performance even for modality-agnostic decoders. This can most likely be explained by the fact that in that case the decoder can leverage both, modality-invariant and modality-dependent features. We will not further focus on this result however as the main motivation to build modality-agnostic decoders is to be able to decode stimuli without knowing the stimulus modality a priori. 

      Recommendations for the authors:

      Reviewer #1 (Recommendations for the authors):

      I will list additional recommendations below in no specific order:

      (1) I find the term "modality agnostic" quite unusual, and I believe I haven't seen it used outside of the ML community. I would urge the authors to change the terminology to be more common, or at least very early explain why the term is much better suited than the range of existing terms. A modality agnostic representation implies that it is not committed to a specific modality, but it seems that a representation cannot be committed to something.

      In the updated manuscript we now refer to the identified brain patterns as modality-invariant, which has previously been used in the literature (Man et al. 2012; Devereux et al. 2013; Patterson et al. 2016; Deniz et al. 2019, Nakai et al. 2021) (see also the general response on top and the Introduction and Related Work sections of the updated manuscript).

      We continue to refer to the decoders as modality-agnostic, as this is a new type of decoder, and describes the fact that they are trained in a way that abstracts away from the modality of the stimuli. We chose this term as we are not aware of any work in which brain decoders were trained jointly on multiple stimulus modalities and in order not to risk contradictions/confusions with other definitions.

      Deniz, F., Nunez-Elizalde, A. O., Huth, A. G., & Gallant, J. L. (2019). The Representation of Semantic Information Across Human Cerebral Cortex During Listening Versus Reading Is Invariant to Stimulus Modality. Journal of Neuroscience, 39(39), 7722–7736. https://doi.org/10.1523/JNEUROSCI.0675-19.2019

      Devereux, B. J., Clarke, A., Marouchos, A., & Tyler, L. K. (2013). Representational Similarity Analysis Reveals Commonalities and Differences in the Semantic Processing of Words and Objects. The Journal of Neuroscience, 33(48).

      Nakai, T., Yamaguchi, H. Q., & Nishimoto, S. (2021). Convergence of Modality Invariance and Attention Selectivity in the Cortical Semantic Circuit. Cerebral Cortex, 31(10), 4825–4839. https://doi.org/10.1093/cercor/bhab125

      Man, K., Kaplan, J. T., Damasio, A., & Meyer, K. (2012). Sight and Sound Converge to Form Modality-Invariant Representations in Temporoparietal Cortex. Journal of Neuroscience, 32(47), 16629–16636.

      Patterson, K., & Lambon Ralph, M. A. (2016). The Hub-and-Spoke Hypothesis of Semantic Memory. In Neurobiology of Language (pp. 765–775). Elsevier. https://doi.org/10.1016/B978-0-12-407794-2.00061-4

      (2) The table in Figure 1B would benefit from also highlighting the number of stimuli that have overlapping captions and images.

      The number of overlapping stimuli is rather small (153-211 stimuli depending on the subject). We added this information to Table 1B. 

      (3) The authors wrote that training stimuli were presented only once, yet they used a one-back task. Did the authors also exclude the first presentation of these stimuli?

      Thanks for pointing this out. It is indeed true that some training stimuli were presented more than once, but only for the case of one-back target trials. In these cases the second presentation of the stimulus was excluded, but not the first. As the subject can not be aware of the fact that the upcoming presentation is going to be a one-back target, the first presentation can not be affected by the presence of the subsequent repeated presentation. We updated the manuscript to clarify this issue.

      (4) Coco has roughly 80-90 categories, so many image captions will be extremely similar (e.g., "a giraffe walking", "a surfer on a wave", etc.). How can people keep these apart?

      It is true that some captions and images are highly similar even though they are not matching in the dataset. This might result in several false button presses because the subjects identified an image-caption pair as matching when in fact it wasn't intended to. However, as there was no feedback given on the task performance, this issue should not have had a major influence on the brain activity of the participants.

      (5) Footnotes for statistics are quite unusual - could the authors integrate statistics into the text?

      Thanks for this remark, in the updated manuscript all statistics are part of the main text.

      (6) It may be difficult to achieve the assumptions of a permutation test - exchangeability, which may bias statistical results. It is not uncommon for densely sampled datasets to use bootstrap sampling on the predictions of the test data to identify if a given percentile of that distribution crosses 0. The lowest p-value is given by the number of bootstrap samples (e.g., if all 10,000 bootstrap samples are above chance, then p < 0.0001). This may turn out to be more effective.

      Thanks for this comment. Our statistical procedure was in fact involving a bootstrapping procedure to generate a null distribution on the group-level. We updated the manuscript to describe this method in more detail. Here is the updated paragraph: “To estimate the statistical significance of the resulting clusters we performed a permutation test, combined with a bootstrapping procedure to estimate a group-level null distribution see also Stelzer et al., 2013). For each subject, we evaluated the decoders 100 times with shuffled labels to create per-subject chance-level results. Then, we randomly selected one of the 100 chance-level results for each of the 6 subjects and calculated group-level statistics (TFCE values) the exact same way as described in the preceding paragraph. We repeated this procedure 10,000 times resulting in 10,000 permuted group-level results. We ensured that every permutation was unique, i.e. no two permutations were based on the same combination of selected chance-level results. Based on this null distribution, we calculated p-values for each vertex by calculating the proportion of sampled permutations where the TFCE value was greater than the observed TFCE value. To control for multiple comparisons across space, we always considered the maximum TFCE score across vertices for each group-level permutation (Smith and Nichols, 2009).”

      (7) The authors present no statistical evidence for some of their claims (e.g., lines 335-337). It would be good if they could complement this in their description. Further, the visualization in Figure 4 is rather opaque. It would help if the authors could add a separate bar for the average modality-specific and modality-agnostic decoders or present results in a scatter plot, showing modality-specific on the x-axis and modality-agnostic on the y-axis and color-code the modality (i.e., making it two scatter colors, one for images, one for captions). All points will end up above the diagonal.

      We updated the manuscript and added statistical evidence for the claims made:

      We now report results for the claim that when considering the average decoding performance for images and captions, modality-agnostic decoders perform better than modality-specific decoders, irrespective of the features that the decoders were trained on.

      Additionally, we report the average modality-agnostic and modality-specific decoding accuracies corresponding to Figure 4. For modality-agnostic decoders the average value is 81.86\%, for modality-specific decoders trained on images 78.15\%, and for modality-specific decoders trained on captions 72.52\%. We did not add a separate bar to Figure 4 as this would add additional information to a Figure which is already very dense in its information content (cf. Reviewers 2’s recommendations for the authors). We therefore believe it is more useful to report the average values in the text and provide results for a statistical test comparing the decoder types. A scatter plot would make it difficult to include detailed information on the features, which we believe is crucial.

      We further provide statistical evidence for the observation regarding the directionality of cross-modal decoding.

      Reviewer #2 (Recommendations for the authors):

      For achieving more evidence to support modality-agnostic representations in the brain, I suggest more thorough analyses, for example:

      (1) Traditional searchlight RSA using different deep learning models. Through this approach, it might identify different brain areas that are sensitive to different formats of information (visual, text, multimodal); subsequently, compare the decoding performance using these ROIs.

      (2) Build more dissociable decoders for information of different modality formats, if possible. While I do not have a concrete proposal, more targeted decoder designs might better dissociate representational formats (i.e., unimodal vs. modality-agnostic).

      (3) A more detailed exploration of the "qualitative decoding results"--for example, quantitatively examining error types produced by modality-agnostic versus modality-specific decoders--would be informative for clarifying what specific content the decoder captures, potentially providing stronger evidence for modality-agnostic representations.

      Thanks for these suggestions. As the main goal of the paper is to introduce modality-agnostic decoders (which should be more clear from the updated manuscript, see also the general response to reviews), we did not include alternative methods for identifying modality-invariant regions. Nonetheless, we agree that in order to obtain more in-depth insight into the nature of representations that were recorded, performing analyses with additional methods such as RSA, comparisons with more targeted decoder designs in terms of their target features will be indispensable, as well as more in-depth error type analyses. We leave these analyses as promising directions for future work.

      The writing could be further improved in the introduction and, accordingly, the discussion. The authors listed a series of theories about conceptual representations; however, they did not systematically explain the relationships and controversies between them, and it seems that they did not aim to address the issues raised by these theories anyway. Thus, the extraction of core ideas is suggested. The difference between "modality-agnostic" and terms like "modality-independent," "modality-invariant," "abstract," "amodal," or "supramodal," and the necessity for a novel term should be articulated.

      The updated manuscript includes an improved introduction and discussion section that highlight the main focus and contributions of the study.

      We believe that a systematic comparison of theories on conceptual representations involving their relationships and controversies would require a dedicated review paper. Here, we focused on the aspects that are relevant for the study at hand (modality-invariant representations), for which we find that none of the considered theories can be rejected based on our results.

      Regarding the terminology (modality-agnostic vs. modality-invariant, ..) please refer to the general response.

      The figures also have room to improve. For example, Figures 4 and 5 present dense bar plots comparing multiple decoding settings (e.g., modality-specific vs. modality-agnostic decoders, feature space, within-modal vs. cross-modal, etc.); while comprehensive, they would benefit from clearer labels or separated subplots to aid interpretation. All figures are recommended to be optimized for greater clarity and directness in future revisions.

      Thanks for this remark. We agree that the figures are quite dense in information. However, splitting them up into subplots (e.g. separate subplots for different decoder types) would make it much less straightforward to compare the accuracy scores between conditions. As the main goal of these figures is to compare features and decoder types, we believe that it is useful to keep all information in the same plot. 

      You are also suggesting to improve the clarity of the labels. It is true that the top left legend of Figures 4 and 5 was mixing information about decoder type and broad classes of features  (vision/language/multimodal). To improve clarity, we updated the figures and clearly separated information on decoder type (the hue of different bars) and features (x-axis labels).  The broad classes of features (vision/language/multimodal) are distinguished by alternating light gray background colors and additional labels at the very bottom of the plots.

      The new plots allow for easy performance comparison of the different decoder types and additionally provide information on confidence intervals for the performance of modality-specific decoders, which was not available in the previous figures.

      Reviewer #3 (Recommendations for the authors):

      (1) As discussed in the Public Review, I think the paper would greatly benefit from clearer terminology. Instead of describing the decoders as "modality-agnostic" and "modality-specific", perhaps the authors could describe the decoding conditions based on the train and test modalities (e.g., "image-to-image", "caption-to-image", "multimodal-to-image") or using the terminology from Figure 3 (e.g., "within-modality", "cross-modality", "modality-agnostic").

      We updated our terminology to be clearer and more accurate, as outlined in the general response. The terms modality-agnostic and modality-specific refer to the training conditions, and the test conditions are described in Figure 3 and are used throughout the paper.

      (2) Line 244: I think the multimodal one-back task is an important aspect of the dataset that is worth highlighting. It seems to be a relatively novel paradigm, and it might help ensure that the participants are activating modality-agnostic representations.

      It is true that the multimodal one-back task could play an important role for the activation of modality-invariant representations. Future work could investigate to what degree the presence of widespread modality-invariant representations is dependent on such a paradigm.

      (3) Line 253: Could the authors elaborate on why they chose a random set of training stimuli for each participant? Is it to make the searchlight analyses more robust?

      A random set of training stimuli was chosen in order to maximize the diversity of the training sets, i.e. to avoid bias based on a specific subsample of the CoCo dataset. Between-subject comparisons can still be made based on the test set which was shared for all subjects, with the limitation that performance differences due to individual differences or to the different training sets can not be disentangled. However, the main goal of the data collection was not to make between-subject comparisons based on common training sets, but rather to make group-level analyses based on a large and maximally diverse dataset. 

      (4) Figure 4: Could the authors comment more on the patterns of decoding performance in Figure 5? For instance, it is interesting that ResNet is a better target than ViT, and BERT-base is a better target than BERT-large.

      A multitude of factors influence the decoding performance, such as features dimensionality, model architecture, training data, and training objective(s) (Conwell et al. 2023; Raugel et al. 2025). Bert-base might be better than bert-large because the extracted features are of lower dimension. Resnet might be better than ViT because of its architecture (CNN vs. Transformer). To dive deeper into these differences further controlled analysis would be necessary, but this is not the focus of this paper. The main objective of the feature comparison was to provide a broad overview over visual/linguistic/multimodal feature spaces and to identify the most suitable features for modality-agnostic decoding.

      Conwell, C., Prince, J. S., Kay, K. N., Alvarez, G. A., & Konkle, T. (2023). What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? (p. 2022.03.28.485868). bioRxiv. https://doi.org/10.1101/2022.03.28.485868

      Raugel, J., Szafraniec, M., Vo, H.V., Couprie, C., Labatut, P., Bojanowski, P., Wyart, V. and King, J.R. (2025). Disentangling the Factors of Convergence between Brains and Computer Vision Models. arXiv preprint arXiv:2508.18226.

      (5) Figure 7: It is interesting that the modality-agnostic decoder predictions mostly appear traffic-related. Is there a possibility that the model always produces traffic-related predictions, making it trivially correct for the presented stimuli that are actually traffic-related? It could be helpful to include some examples where the decoder produces other types of predictions to dispel this concern.

      The presented qualitative examples were randomly selected. To make sure that the decoder is not always predicting traffic-related content, we included 5 additional randomly selected examples in Figures 6 and 7 of the updated manuscript. In only one of the 5 new examples the decoder was predicting traffic-related content, and in this case the stimulus had actually been traffic-related (a bus).

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      __Reviewer #1 (Evidence, reproducibility and clarity (Required)): __

      This study explores chromatin organization around trans-splicing acceptor sites (TASs) in the trypanosomatid parasites Trypanosoma cruzi, T. brucei and Leishmania major. By systematically re-analyzing MNase-seq and MNase-ChIP-seq datasets, the authors conclude that TASs are protected by an MNase-sensitive complex that is, at least in part, histone-based, and that single-copy and multi-copy genes display differential chromatin accessibility. Altogether, the data suggest a common chromatin landscape at TASs and imply that chromatin may modulate transcript maturation, adding a new regulatory layer to an unusual gene-expression system.

      I value integrative studies of this kind and appreciate the careful, consistent data analysis the authors implemented to extract novel insights. That said, several aspects require clarification or revision before the conclusions can be robustly supported. My main concerns are listed below, organized by topic/result section.

      TAS prediction * Why were TAS predictions derived only from insect-stage RNA-seq data? Restricting TAS calls to one life stage risks biasing predictions toward transcripts that are highly expressed in that stage and may reduce annotation accuracy for lowly expressed or stage-specific genes. Please justify this choice and, if possible, evaluate TAS robustness using additional transcriptomes or explicitly state the limitation.

      TAS predictions derived only from insect-stage RNA-seq data because in a previous study it was shown that there are no significant differences between stages in the 5'UTR procesing in T. cruzi life stages (https://doi.org/10.3389/fgene.2020.00166) We are not testing an additional transcriptome here, because the robustness of the software was already probed in the original article were UTRme was described (Radio S, 2018 doi:10.3389/fgene.2018.00671).

      Results - "There is a distinctive average nucleosome arrangement at the TASs in TriTryps": * You state that "In the case of L. major the samples are less digested." However, Supplementary Fig. S1 suggests that replicate 1 of L. major is less digested than the T. brucei samples, while replicate 2 of L. major looks similarly digested. Please clarify which replicates you reference and correct the statement if needed.

      The reviewer has a good point. We made our statement based on the value of the maximum peak of the sequenced DNA molecules, which in general is a good indicative of the extension of the digestion achieved by the sample (Cole H, NAR, 2011).

      As the reviewer correctly points, we should have also considered the length of the DNA molecules in each percentile. However, in this case both, T. brucei's and L major's samples were gel purified before sequencing and it is hard to know exactly what fragments were left behind in each case. Therefore, it is better not to over conclude on that regard.

      We have now comment on this in the main manuscript, and we have clarified in the figure legends which data set we used in each case in the figure legends and in Table S1.

      * It appears you plot one replicate in Fig. 1b and the other in Suppl. Fig. S2. Please indicate explicitly which replicate is in each plot. For T. brucei, the NDR upstream of the TAS is clearer in Suppl. Fig. S2 while the TAS protection is less prominent; based on your digestion argument, this should correspond to the more-digested replicate. Please confirm.

      The replicates used for the construction of each figure are explicitly indicated in Table S1. Although we have detailed in the table the original publication, the project and accession number for each data set, the reviewer is correct that in this case it was still not completely clear to which length distribution heatmap was each sample associated with. To avoid this confusion, we have now added the accession number for each data set to the figure legends and also clarified in Table S1. Regarding the reviewer's comment on the correspondence between the observed TAS protection and the extent of samples digestion, he/she is correct that for a more digested sample we would expect a clearer NDR. In this case, the difference in the extent of digestion between these two samples is minor, as observed the length of the main peak in the length distribution histogram for sequenced DNA molecules is the same. These two samples GSM5363006, represented in Fig1 b, and GSM5363007, represented in S2, belong to the same original paper (Maree et al 2017), and both were gel purified before sequencing. Therefore, any difference between them could not only be the result of a minor difference in the digestion level achieved in each experiment but could be also biased by the fragments included or not during gel purification. Therefore, I would not over conclude about TAS protection from this comparison. We have now included a brief comment on this, in the figure discussion

      * The protected region around the TAS appears centered on the TAS in T. brucei but upstream in L. major. This is an interesting difference. If it is technical (different digestion or TAS prediction offset), explain why; if likely biological, discuss possible mechanisms and implications.

      We appreciate the reviewer suggestion. We cannot assure if it is due to technical or biological reasons, but there is evidence that L. major 's genome has a different dinucleotide content and it might have an impact on nucleosome assembly. We have now added a comment about this observation in the final discussion of the manuscript.

      Additionally, we analyzed DRIP-seq data for L. major, recently published doi: 10.1038/s41467-025-56785-y, and we observed that the R-loop footprint co-localized with the MNase-protected region upstream of the TAS (new S5 Fig), suggesting that the shift is not related to the MNase-seq technique.

      Results - "An MNase sensitive complex occupies the TASs in T. brucei": * The definition of "MNase activity" and the ordering of samples into Low/Intermediate/High digestion are unclear. Did you infer digestion levels from fragment distributions rather than from controlled experimental timepoints? In Suppl. Fig. S3a it is not obvious how "Low digestion" was defined; that sample's fragment distribution appears intermediate. Please provide objective metrics (e.g., median fragment length, fraction 120-180 bp) used to classify digestion levels.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fixed time point adding increasing amounts of MNase. However, even when making controlled experimental timepoints, you need to check the length distribution histogram of sequenced DNA molecules to be sure which level of digestion you have achieved.

      In this particular case, we used public available data sets to make this analysis. We made an arbitrary definition of low, intermediate and high level of digestion, not as an absolute level of digestion, but as a comparative output among the tested samples. We based our definition on the comparison of __the main peak in length distribution heatmaps because this parameter is the best metric to estimate the level of digestion of a given sample. It represents the percentage of the total DNA sequenced that contains the predominant length in the sample tested. __Hence, we considered:

      low digestion: when the main peak is longer than the expected protection for a nucleosome (longer than 150 bp). We expect this sample to contain additional longer bands that correspond to less digested material.

      intermediate digestion, when the main peak is the expected for the nucleosome core-protection (˜146-150bp).

      high digestion, when the main peak is shorter than that (shorter than 146 bp). This case, is normally accompanied by a bigger dispersion in fragment sizes.

      To do this analysis, we chose samples that render different MNase protection of the TAS when plotting all the sequenced DNA molecules relative to this point and we used this protection as a predictor of the extent of sample digestion (Figure 2). To corroborate our hypothesis, that the degree of TAS protection was indeed related to the extent of the MNase digestion of a given sample, we looked at the length distribution histogram of the sequenced DNA molecules in each case. It is the best measurement of the extent of the digestion achieved, especially, when sequencing the whole sample without any gel purification and representing all the reads in the analysis as we did. The only caveat is with the sample called "intermediate digestion 1" that belongs to the original work of Mareé 2017, since only this data set was gel purified. To avoid this problem, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      * Several fragment distributions show a sharp cutoff at ~100-125 bp. Was this due to gel purification or bioinformatic filtering? State this clearly in Methods. If gel purification occurred, that can explain why some datasets preserve the MNase-sensitive region.

      The sharp cutoff is neither due to gel purification or bioinformatic filtering, it is just due to the length of the paired-end read used in each case. In earlier works the most common was to sequence only 50bp, with the improvement of technologies it went up to 75,100 or 125 bp. We have now clarified in Table S1 the length of the paired-reads used in each case when possible.

      * Please reconcile cases where samples labeled as more-digested contain a larger proportion of >200 bp fragments than supposedly less-digested samples; this ordering affects the inference that digestion level determines the loss/preservation of TAS protection. Based on the distributions I see, "Intermediate digestion 1" appears most consistent with an expected MNase curve - please confirm and correct the manuscript accordingly.

      As explained above, it's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme, which has a preference for AT reach sequences.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would be to get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always get some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well, originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, or by containing a poor AT sequence content, making their linker DNA extremely resistant to initial cleavage. Once the majority of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, you end up observing a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or over digested samples. Our main point, is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Results - "The MNase sensitive complexes protecting the TASs in T. brucei and T. cruzi are at least partly composed of histones": * The evidence that histones are part of the MNase-sensitive complex relies on H3 MNase-ChIP signal in subnucleosomal fragment bins. This seems to conflict with the observation (Fig. 1) that fragments protecting TASs are often nucleosome-sized. Please reconcile these points: are H3 signals confined to subnucleosomal fragments flanking the TAS while the TAS itself is depleted of H3? Provide plots that compare MNase-seq and H3 ChIP signals stratified by consistent fragment-size bins to clarify this.

      What we learned from other eukaryotic organisms that were deeply studied, such as yeast, is that NDRs are normally generated at regulatory points in the genome. In this sense, yeast tRNA genes have a complex with a bootprint smaller than a nucleosome formed by TFIIIC-TFIIB (Nagarajavel, doi: 10.1093/nar/gkt611). On the other hand, many promotor regions have an MNase-sensitive complex with a nucleosome-size footprint, but it does not contain histones (Chereji, et al 2017, doi:10.1016/j.molcel.2016.12.009). The reviewer is right that from Figure 1 and S2 we could observe that the footprint of whatever occupies the TAS region, especially in T. brucei, is nucleosome-size. However, it only shows the size, but it doesn't prove the nature of its components. Nevertheless, those are only MNase-seq data sets. Since it does not include a precipitation with specific antibodies, we cannot confirm the protecting complex is made up by histones. In parallel, a complementary study by Wedel 2017, from Siegel's lab, shows that using a properly digested sample and further immunoprecipitating with a-H3 antibody, the TAS is not protected by nucleosomes at least not when analyzing nucleosome size-DNA molecules. Besides, Briggs et. al 2018 (doi: 10.1093/nar/gky928) showed that at least at intergenic regions H3 occupancy goes down while R-loops accumulation increases. We have now added a new figure 4 replotting R-loops and MNase-ChIP-seq for H3 relative to our predicted TAS showing this anti-correlation and how it partly correlates with MNase protection as well. As a control we show that Rpb9 trends resembles H3 as Siegel's lab have shown in Wedel 2018. Moreover, we analyzed redate from a recently published paper (doi: 10.1038/s41467-025-56785-y) added a new supplemental figure 5 showing that a similar correlation between MNase protection and R-loop footprint occurs in L. major (S5 Fig).

      * Please indicate which datasets are used for each panel in Suppl. Fig. S4 (e.g., Wedel et al., Maree et al.), and avoid calling data from different labs "replicates" unless they are true replicates.

      In most of our analysis we used real replicated experiments. Such is the case MNase-seq data used in Figure 1, with the corresponding replicate experiments used in Figure S2; T. cruzi MNase-ChIP-seq data used in Figure 3b and 4a with the respective replicate used in Figures S4 and S5 (now S6 in the revised manuscript). The only case in which we used experiments coming from two different laboratories, is in the case of MNase-ChIP-seq for H3 from T. brucei. Unfortunately, there are only two public data sets coming each of them from different laboratories. The samples used in Fig 3 (from Siegel's lab) whether the IP from H3 represented in S4 and S5 (S6 n the updated version) comes from another lab (Patterton's). To be more rigorous, we now call them data 1 and 2 when comparing these particular case.

      The reviewer is right that in this particular case one is native chromatin (Pattertons') while the other one is crosslinked (Siegel's). We have now clarified it in the main text that unfortunately we do not count on a replicate but even under both condition the result remains the same, and this is compatible with my own experience, were crosslinking does not affect the global nucleosome patterns (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi: 10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

      * Several datasets show a sharp lower bound on fragment size in the subnucleosomal range (e.g., ~80-100 bp). Is this a filtering artifact or a gel-size selection? Clarify in Methods and, if this is an artifact, consider replotting after removing the cutoff.

      We have only filtered adapter dimmer or overrepresented sequences when needed. In Figures 2 and S3 we represented all the sequenced reads. In other figures when we sort fragments sizes in silico, such as nucleosome range, dinucleosome or subnucleosome size, we make a note in the figure legends. What the reviewer points is related to the length of the sequence DNA fragment in each experiment. As we explained above, the older data-sets were performed with 50 bp paired-end reads, the newer ones are 75, 100 or 125bp. This is information is now clarified in Table S1.

      __Results - "The TASs of single and multi-copy genes are differentially protected by nucleosomes": __

      __ __* Please include T. brucei RNA-seq data in Suppl. Fig. S5b as you did for T. cruzi.

      We have shown chromatin organization for T. brucei in previous S5b to illustrate that there is a similar trend. Unfortunately, we did not get a robust list of multi-copy genes for T. brucei as we did get for T. cruzi, therefore we do not want to over conclude showing the RNA-seq for these subsets of genes. The limitation is related to the fact that UTRme restrict the search and is extremely strict when calling sites at repetitive regions. Additionally, attending to the request of one reviewer we have now changed the UTR predictions for T. brucei using a different RNA-seq data set from Lister 427(detail in method section). Given that with the new predictions it was even harder to obtain the list of multicopy genes for T. brucei, we decided to remove that figure in the updated version of the manuscript.

      * Discuss how low or absent expression of multigene families affects TAS annotation (which relies on RNA-seq) and whether annotation inaccuracies could bias the observed chromatin differences.

      The mapping of occurrence and annotations that belong to repetitive regions has great complexity. UTRme is specially designed to avoid overcalling those sites. In other words, there is a chance that we could be underestimating the number of predicted TASs at multi-copy genes. Regarding the impact on chromatin analysis, we cannot rule out that it might have an impact, but the observation favors our conclusion, since even when some TASs at multi-copy genes can remain elusive, we observe more nucleosome density at those places.

      * The statement that multi-copy genes show an "oscillation" between AT and GC dinucleotides is not clearly supported: the multi-copy average appears noisier and is based on fewer loci. Please tone down this claim or provide statistical support that the pattern is periodic rather than noisy.

      We have fixed this now in the preliminary revised version

      * How were multi-copy genes defined in T. brucei? Include the classification method in Methods.

      This classification was done the same way it was explained for T. cruzi. However, decided to remove the supplemental figure that included this sorting.

      Genomes and annotations: * If transcriptomic data for the Y strain was used for T. cruzi, please explain why a Y strain genome was not used (e.g., Wang et al. 2021 GCA_015033655.1), or justify the choice. For T. brucei, consider the more recent Lister 427 assembly (Tb427_2018) from TriTrypDB. Use strain-matched genomes and transcriptomes when possible, or discuss limitations.

      The most appropriate way to analyze high throughput data, is to aline it to the same genome were the experiments were conducted. This was clearly illustrated in a previous publication from our group were we explained how should be analyzed data from the hybrid CL Brener strain. A common practice in the past was to use only Esmeraldo-like genome for simplicity, but this resulted in output artifacts. Therefore, we aligned it to CL Brener genome, and then focused the main analysis on the Esmeraldo haplotype (Beati Plos ONE, 2023). Ideally, we should have counted on transcriptomic data for the same strain (CL Brener or Esmeraldo). Since this was not the case at that moment, we used data from Y strain that belongs to the same DTU with Esmeraldo.

      In the case of T. brucei, when we started our analysis and the software code for UTRme was written, the previous version of the genome was available. Upon 2018 version came up, we checked chromatin parameters and observed that it did not change the main observations. Therefore, we continue working with our previous setups.

      Reproducibility and broader integration: * Please share the full analysis pipeline (ideally on GitHub/Zenodo) so the results are reproducible from raw reads to plots.

      We are preparing a full pipeline in GitHub. We will make it available before manuscript full revision

      * As an optional but helpful expansion, consider including additional datasets (other life stages, BSF MNase-seq, ATAC-seq, DRIP-seq) where available to strengthen comparative claims.

      We are now including a new figure 4 and a supplemental figure 5 including DRIP-seq and Rp9 ChIP-seq for T. brucei (revised Fig 4) and DRIP-seq for L. major (S5 Fig). Additionally, we added FAIRE-seq data to previous Fig 4 now Fig 5 (revised Fig 5C).

      We are analyzing ATAC-seq data for T. brucei.

      Regarding BSF MNase-seq, the original article by Mareé 2017 claims that there is not significant difference for average chromatin organization between the two life forms; therefore, is not worth including that analysis.

      Optional analyses that would strengthen the study: * Stratify single-copy genes by expression (high / medium / low) and examine average nucleosome occupancy at TASs for each group; a correlation between expression and NDR depth would strengthen the functional link to maturation.

      We have now included a panel in suplemental figure 5 (now revised S6), showing the concordance for chromatin organization of stratified genes by RNA-seq levels relative to TAS.

      __Minor / editorial comments: __ * In the Introduction, the sentence "transcription is initiated from dispersed promoters and in general they coincide with divergent strand switch regions" should be qualified: such initiation sites also include single transcription start regions.

      We have clarified this in the preliminary revised version

      * Define the dotted line in length distribution plots (if it is not the median, please clarify) and consider placing it at 147 bp across plots to ease comparison.

      The dotted line is just to indicate where the maximum peak is located. It is now clarified in figure legends.

      * In Suppl. Fig. 4b "Replicate2" the x-axis ticks are misaligned with labels - please fix.

      We have now fixed the figure. Thanks for noticing this mistake.

      * Typo in the Introduction: "remodellingremodeling" → "remodeling

      Thanks for noticing this mistake, it is fixed in the current version of the manuscript

      **Referee cross-commenting** Comment 1: I think Reviewer #2 and Reviewer #3 missed that they authors of this manuscript do cite and consider the results from Wedel at al. 2017. They even re-analysed their data (e.g. Figure 3a). I second Reviewer #2 comment indicating that the inclusion of a schematic figure to help readers visualize and better understand the findings would be an important addition.

      Comment 2: I agree with Reviewer #3 that the use of different MNase digestion procedures in the different datasets have to be considered. On the other hand, I don't think there is a problem with figure 1 showing an MNase-protected TAS for T. brucei as it is based on MNase-seq data and reproduces the reported results (Maree et al. 2017). What the Siegel lab did in Wedel et al. 2017 was MNase-ChIPseq of H3 showing nucleosome depletion at TAS, but both results are not necessary contradictory: There could still be something else (which does not contain H3) sitting on the TAS protecting it from MNase digestion.

      Reviewer #1 (Significance (Required)):

      This study provides a systematic comparative analysis of chromatin landscapes at trans-splicing acceptor sites (TASs) in trypanosomatids, an area that has been relatively underexplored. By re-analyzing and harmonizing existing MNase-seq and MNase-ChIP-seq datasets, the authors highlight conserved and divergent features of nucleosome occupancy around TASs and propose that chromatin contributes to the fidelity of transcript maturation. The significance lies in three aspects: 1. Conceptual advance: It broadens our understanding of gene regulation in organisms where transcription initiation is unusual and largely constitutive, suggesting that chromatin can still modulate post-transcriptional processes such as trans-splicing. 2. Integrative perspective: Bringing together data from T. cruzi, T. brucei and L. major provides a comparative framework that may inspire further mechanistic studies across kinetoplastids. 3. Hypothesis generation: The findings open testable avenues about the role of chromatin in coordinating transcript maturation, the contribution of DNA sequence composition, and potential interactions with R-loops or RNA-binding proteins. Researchers in parasitology, chromatin biology, and RNA processing will find it a useful resource and a stimulus for targeted experimental follow-up.

      My expertise is in gene regulation in eukaryotic parasites, with a focus on bioinformatic analysis of high-throughput sequencing data

      __Reviewer #2 (Evidence, reproducibility and clarity (Required)): __

      Siri et al. perform a comparative analysis using publicly available MNase-seq data from three trypanosomatids (T. brucei, T. cruzi, and Leishmania), showing that a similar chromatin profile is observed at TAS (trans-splicing acceptor site) regions. The original studies had already demonstrated that the nucleosome profile at TAS differs from the rest of the genome; however, this work fills an important gap in the literature by providing the most reliable cross-species comparison of nucleosome profiles among the tritryps. To achieve this, the authors applied the same computational analysis pipeline and carefully evaluated MNase digestion levels, which are known to influence nucleosome profiling outcomes.

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. The manuscript could be improved with some clarifications and adjustments:

      1. The authors state from the beginning that available MNase data indicate altered nucleosome occupancy around the TAS. However, they could also emphasize that the conclusions across the different trypanosomatids are inconsistent and even contradictory: NDR in T. cruzi versus protection-in different locations-in T. brucei and Leishmania.

      We start our manuscript by referring to the first MNase-seq data sets publicly available for each TriTryp and we point that one of the main observations, in each of them, is the occurrence of a change in nucleosome density or occupancy at intergenic regions. In T. cruzi, in a previous publication from our group, we stablished that this intergenic drop in nucleosome density occurs near the trans-splicing acceptor site. In this work, we extend our study to the other members of TriTryps: T. brucei and L. major.

      In T. brucei the papers from Patterton's lab and Siegel's lab came out almost simultaneously in 2017. Hence, they do not comment on each other's work. The first one claims the presence of a well-positioned nucleosome at the TAS by using MNase-seq, while the second one, shows an NDR at the TAS by using MNase-ChIP-seq. However, we do not think they are contradictory, or they have inconsistency. We brought them together along the manuscript because we think these works can provide complementary information.

      On one hand, we infer data from Pattertons lab is slightly less digested than the sample from Siegel's lab. Therefore, we discuss that this moderate digestion must be the reason why they managed to detect an MNase protecting complex sitting at the TAS (Figure 1). On the other hand, Sigel's lab includes an additional step by performing MNase-ChIP-seq, showing that when analyzing nucleosome size fragments, histones are not detected at the TAS. Here, we go further in this analysis on figure 3, showing that only when looking at subnucleosome-size fragments, we can detect histone H3. And this is also true for T. cruzi.

      By integrating every analysis in this work and the previous ones, we propose that TASs are protected by an MNase-sensitive complex (proved in Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). To be sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs, 2018 doi: 10.1093/nar/gky928) and that R-loops have plenty of interacting proteins (Girasol, 2023 10.1093/nar/gkad836). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules, possibly involved in trans-splicing. We have now added a new figure 4 showing R-loop co-localization with the NDR.

      Regarding the comparison between different organisms, after explaining the sensitivity to MNase of the TAS protecting complex, we discuss that when comparing equally digested samples T. cruzi and T. brucei display a similar chromatin landscape with a mild NDR at the TAS (See T. cruzi represented in Figure 1 compared to T. brucei represented in Intermediate digestion 2 in Figure 2, intermediate digestion in the revised manuscript). Unfortunately, we cannot make a good comparison with L. major, since we do not count on a similar level of digestion. However, by analyzing a recently published DRIP-seq data-set for L. major we show that R-loop signal co localize with MNase-protection in a similar way (new S5 Fig).

      Another point that requires clarification concerns what the authors mean in the introduction and discussion when they write that trypanosomes have "...poorly organized chromatin with nucleosomes that are not strikingly positioned or phased." On the other hand, they also cite evidence of organization: "...well-positioned nucleosome at the spliced-out region.. in Leishmania (ref 34)"; "...a well-positioned nucleosome at the TASs for internal genes (ref37)"; "...a nucleosome depletion was observed upstream of every gene (ref 35)." Aren't these examples of organized chromatin with at least a few phased nucleosomes? In addition, in ref 37, figure 4 shows at least two (possibly three to four) nucleosomes that appear phased. In my opinion, the authors should first define more precisely what they mean by "poorly organized chromatin" and clarify that this interpretation does not contradict the findings highlighted in the cited literature.

      For a better understanding of nucleosome positioning and phasing I recommend the review: Clark 2010 doi:10.1080/073911010010524945, Figure 4. Briefly, in a cell population there are different alternative positions that a given nucleosome can adopt. However, some are more favorable. When talking about favorable positions, we refer to the coordinates in the genome that are most likely covered by a nucleosome and are predominant in the cell population. Additionally, nucleosomes could be phased or not. This refers not only the position in the genome, but to the distance relative to a given point. In yeast, or in highly transcribed genes of more complex eukaryotes, nucleosomes are regularly spaced and phased relative to the transcription start site (TSS) or to the +1 nucleosome (Ocampo, NAR, 2016, doi:10.1093/nar/gkw068). In trypanosomes, nucleosomes have some regular distribution when making a browser inspection but, given that they are not properly phased with respect to any point, it is almost impossible to make a spacing estimation from paired-end data. This is also consistent with a chromatin that is transcribed in an almost constitutive manner.

      As the reviewer mention, we do site evidence of organization. We think the original observations are correct, but we do not fully agree with some of the original statements. In this manuscript our aim is to take the best we learned from their original works and to make a constructive contribution adding to the original discussions. In this regard, in trypanosomes there are some conserved patterns in the chromatin landscape, but their nucleosomes are far from being well-positioned or phased. For a better understanding, compare the variations observed in the y axis when representing av. nucleosome occupancy in yeast with those observed in trypanosomes and you will see that the troughs and peaks are much more prominent in yeast than the ones observed in any TryTryp member.

      Following the reviewer's suggestion we have now clarified this in the main text.

      The paper would also benefit from the inclusion of a schematic figure to help readers visualize and better understand the findings. What is the biological impact of having nucleosomes, di-nucleosomes, or sub-nucleosomes at TAS? This is not obvious to readers outside the chromatin field. For example, the following statement is not intuitive: "We observed that, when analyzing nucleosome-size (120-180 bp) DNA molecules or longer fragments (180-300 bp), the TASs of either T. cruzi or T. brucei are mostly nucleosome-depleted. However, when representing fragments smaller than a nucleosome-size (50-120 bp) some histone protection is unmasked (Fig. 3 and Fig. S4). This observation suggests that the MNase sensitive complex sitting at the TASs is at least partly composed of histones." Please clarify.

      We appreciate the reviewer's suggestion to make a schematic figure. We have now added a new Figure 6.

      Regarding the biological impact of having mono, di or subnucleosome fragments, it is important to unveil the fragment size of the protected DNA to infer the nature of the protecting complex. In the case of tRNA genes in yeast, at pol III promoters they found footprints smaller than a nucleosome size that ended up being TFIIB-TFIIC (Nagarajavel, doi: 10.1093/nar/gkt611). Therefore, detecting something smaller than a nucleosome might suggest the binding of trans-acting factors different than histones or involving histones in a mixed complex. These mixed complexes are also observed, and that is the case of the centromeric nucleosome which has a very peculiar composition (Ocampo and Clark, Cells Reports, 2015). On the other hand, if instead we detect bigger fragments, it could be indicative of the presence of bigger protecting molecules or that those regions are part of higher order chromatin organization still inaccessible for MNase linker digestions.

      Here we show on 2Dplots, that complex or components protecting the TAS have nucleosome size, but we cannot assure they are entirely made up by histones, since, only when looking at subnucleosome-size fragments, we are able to detect histone H3. We have now added part of this explanation to the discussion.

      By integrating every analysis in this work and the previous ones, we propose that the TAS is protected by an MNase-sensitive complex (Figure 2). This complex most likely is only partly formed by histones, since only when analyzing sub-nucleosomes size DNA molecules we can detect histone H3 (Figure 3). As explained above, to be sure that the complex is not entirely made up by histones, future studies should perform an MNse-ChIP-seq with less digested samples. However, it was previously shown that R-loops are enriched at those intergenic NDRs (Briggs 2018) and that R-loops have plenty of interacting proteins (Girasol, 2023). Therefore, most likely, this MNase-sensitive complexed have a hybrid nature made up by H3 and some other regulatory molecules. We have now added a new figure 4 showing R-loop partial co-localization with MNase protection.

      Some references are missing or incorrect:

      we will make a thorough revision

      "In trypanosomes, there are no canonical promoter regions." - please check Cordon-Obras et al. (Navarro's group). Thank you for the appropiate suggestion.

      Thank you for the appropriate suggestion. We have now added this reference

      Please, cite the study by Wedel et al. (Siegel's group), which also performed MNase-seq analysis in T. brucei.

      We understand that reviewer number 2# missed that we cited this reference and that we did used the raw data from the manuscript of Wedel et. al 2017 form Siegel's group. We used the MNase-ChIP-seq data set of histone H3 in our analysis for Figures 3, S4 and S6 (in the revised version), also detailed in table S1. To be even more explicit, we have now included the accession number of each data set in the figure legends.

      Figure-specific comments: Fig. S3: Why does the number of larger fragments increase with greater MNase digestion? Shouldn't the opposite be expected?

      This a good observation. As we also explained to reviewer#1:

      It's a common observation in MNase digestion of chromatin that more extensive digestion can still result in a broad range of fragment sizes, including some longer fragments. This seemingly counter-intuitive result is primarily due to the non-uniform accessibility of chromatin and the sequence preference of the MNase enzyme.

      The rationale of this is as follows: when you digest chromatin with MNase and the objective is to map nucleosomes genome-wide, the ideal situation would get the whole material contained in the mononucleosome band. Given that MNase is less efficient to digest protected DNA but, if the reaction proceeds further, it always ends up destroying part of it, the result is always far from perfect. The better situation we can get, is to obtain samples were ˜80% of the material is contained in the mononucloesome band. __And here comes the main point: __even in the best scenario, you always have some additional longer bands, such as those for di or tri nucleosomes. If you keep digesting, you will get less than 80 % in the nucleosome band and, those remaining DNA fragments that use to contain di and tri nucleosomes start getting digested as well originating a bigger dispersion in fragments sizes. How do we explain persistence of Long Fragments? The longest fragments (di-, tri-nucleosomes) that persist in a highly digested sample are the ones that were originally most highly protected by proteins or higher-order structure, making their linker DNA extremely resistant to initial cleavage. Once most of the genome is fragmented, these few resistant longer fragments become a more visible component of the remaining population, contributing to a broader size dispersion. Hence, there you end up having a bigger dispersion in length distributions in the final material. Bottom line, it is not a good practice to work with under or overdirected samples. Our main point is to emphasize that especially when comparing samples, it important to compare those with comparable levels of digestion. Otherwise, a different sampling of the genome will be represented in the remaining sequenced DNA.

      Minor points:

      There are several typos throughout the manuscript.

      Thanks for the observation. We will check carefully.

      Methods: "Dinucelotide frecuency calculation."

      We will add a code in GitHub

      Reviewer #2 (Significance (Required)):

      In my view, the main conclusion is that the profiles are indeed similar-even when comparing T. brucei and T. cruzi. This was not clear in previous studies (and even appeared contradictory, reporting nucleosome depletion versus enrichment) largely due to differences in chromatin digestion across these organisms. Audience: basic science and specialized readers.

      Expertise: epigenetics and gene expression in trypanosomatids.

      __Reviewer #3 (Evidence, reproducibility and clarity (Required)): __

      The authors analysed publicly accessible MNase-seq data in TriTryps parasites, focusing on the chromatin structure around trans-splicing acceptor sites (TASs), which are vital for processing gene transcripts. They describe a mild nucleosome depletion at the TAS of T. cruzi and L. major, whereas a histone-containing complex protects the TASs of T. brucei. In the subsequent analysis of T. brucei, they suggest that a Mnase-sensitive complex is localised at the TASs. For single-copy versus multi-copy genes, the authors show different di-nucleotide patterns and chromatin structures. Accordingly, they propose this difference could be a novel mechanism to ensure the accuracy of trans-splicing in these parasites.

      Before providing an in- depth review of the manuscript, I note that some missing information would have helped in assessing the study more thoroughly; however, in the light of the available information, I provide the following comments for consideration.

      The numbering of the figures, including the figure legends, is missing in the PDF file. This is essential for assessing the provided information.

      We apologized for not including the figure numbers in the main text, although they are located in the right place when called in the text. The omission was unwillingly made when figure legends were moved to the bottom of the main text. This is now fixed in the updated version of the manuscript.

      The publicly available Mnase- seq data are manyfold, with multiple datasets available for T. cruzi, for example. It is unclear from the manuscript which dataset was used for which figure. This must be clarified.

      This was detailed in Table S1. We have now replaced the table by an improved version, and we have also included the accession number of each data set used in the figure legends.

      Why do the authors start in figure 1 with the description of an MNase- protected TAS for T.brucei, given that it has been clearly shown by the Siegel lab that there is a nucleosome depletion similar to other parasites?

      We did not want to ignore the paper from Patterton's lab because it was the first one to map nucleosomes genome-wide in T. brucei and the main finding of that paper claimed the existence of a well-positioned nucleosome at intergenic regions, what we though constitutes a point worth to be discussed. While Patterton's work use MNase-seq from gel-purified samples and provides replicated experiments sequenced in really good depth; Siegel's lab uses MNase-ChIP-seq of histone H3 but performs only one experiment and its input was not sequenced. So, each work has its own caveats and provides different information that together contributes to make a more comprehensive study. We think that bringing up both data sets to the discussion, as we have done in Figures 1 and 3, helps us and the community working in the field to enrich the discussion.

      If the authors re- analyse the data, they should compare their pipeline to those used in the other studies, highlighting differences and potential improvements.

      We are working on this point. We will provide a more detail description in the final revision.

      Since many figures resemble those in already published studies, there seems little reason to repeat and compare without a detailed comparison of the pipelines and their differences.

      Following the reviewer advice, we are now working on highlighting the main differences that justify analyzing the data the way we did and will be added in the finally revised method section.

      At a first glance, some of the figures might look similar when looking at the original manuscripts comparing with ours. However, with a careful and detailed reading of our manuscripts you can notice that we have added several analyses that allow to unveil information that was not disclosed before.

      First, we perform a systematic comparison analyzing every data set the same way from beginning to end, being the main difference with previous studies the thorough and precise prediction of TAS for the three organisms. Second, we represent the average chromatin organization relative to those predicted TASs for TriTryps and discuss their global patterns. Third, by representing the average chromatin into heatmaps, we show for the very first time, that those average nucleosome landscape are not just an average, they keep a similar organization in most of the genome. These was not done in any of the previous manuscripts except for our own (Beati, PLOS One 2023). Additionally, we introduce the discussion of how the extension of MNase reaction can affect the output of these experiments and we show 2D-plots and length distribution heatmaps to discuss this point (a point completely ignored in all the chromatin literature for trypanosomes). Furthermore, we made a far-reaching analysis by considering the contributions of each publish work even when addressed by different techniques. Finally, we discuss our findings in the context of a topic of current interest in the field, such as TriTryp's genome compartmentalization.

      Several previous Mnase- seq analysis studies addressing chromatin accessibility emphasized the importance of using varying degrees of chromatin digestion, from low to high digestion (30496478, 38959309, 27151365).

      The reviewer is correct, and this point is exactly what we intended to illustrate in figure number 2. We appreciate he/she suggests these references that we are now citing in the final discussion. Just to clarify, using varying degrees of chromatin digestion is useful to make conclusions about a given organism but when comparing samples, strains, histone marks, etc. It is extremely important to do it upon selection of similar digested samples.

      No information on the extent of DNA hydrolysis is provided in the original Mnase- seq studies. This key information can not be inferred from the length distribution of the sequenced reads.

      The reviewer is correct that "No information on the extent of DNA hydrolysis is provided in the original Mnase-seq studies" and this is another reason why our analysis is so important to be published and discussed by the scientific community working in trypanosomes. We disagree with the reviewer in the second statement, since the level of digestion of a sequenced sample is actually tested by representing the length distribution of the total DNA sequenced. It is true that before sequencing you can, and should, check the level of digestion of the purified samples in an agarose gel and/or in a bioanalyzer. It could be also tested after library preparation, but before sequencing, expecting to observe the samples sizes incremented in size by the addition of the library adapters. But, the final test of success when working with MNase digested samples is to analyze length of DNA molecules by representing the histograms with length distribution of the sequenced DNA molecules. Remarkably, on occasions different samples might look very similar when run in a gel, but they render different length distribution histograms and this is because the nucleosome core could be intact but they might have suffered a differential trimming of the linker DNA associated to it or even be chewed inside (see Cole Hope 2011, section 5.2, doi: 10.1016/B978-0-12-391938-0.00006-9, for a detailed explanation).

      As the input material are selected, in part gel- purified mono- nucleosomal DNA bands. Furthermore the datasets are not directly comparable, as some use native MNase, while others employ MNase after crosslinking; some involve short digestion times at 37 {degree sign} C, while others involve longer digestion at lower temperatures. Combining these datasets to support the idea of an MNase- sensitive complex at the TAS of T. brucei therefore may not be appropriate, and additional experiments using consistent methodologies would strengthen the study's conclusions.

      In my opinion, describing an MNase- sensitive complex based solely on these data is not feasible. It requires specifically designed experiments using a consistent method and well- defined MNase digestion kinetics.

      As the reviewer suggests, the ideal experiment would be to perform a time course of MNase reaction with all the samples in parallel, or to work with a fix time point adding increasing amounts of MNase. However, the information obtained from the detail analysis of the length distribution histogram of sequenced DNA molecules the best test of the real outcome. In fact, those samples with different digestion levels were probably not generated on purpose.

      The only data sets that were gel purified are those from Mareé 2017 (Patterton's lab), used in Figures 1, S1 and S2 and those from L. major shown in Fig 1. It was a common practice during those years, then we learned that is not necessary to gel purify, since we can sort fragment sizes later in silico when needed.

      As we explained to reviewer #1, to avoid this conflict, we decided to remove this data from figures 2 and S3. In summary, the 3 remaining samples comes from the same lab, and belong to the same publication (Mareé 2022). These sample are the inputs of native MNase ChIp-seq, obtain the same way, totally comparable among each other.

      Reviewer #3 (Significance (Required)):

      Due to the lack of controlled MNase digestion, use of heterogeneous datasets, and absence of benchmarking against previous studies, the conclusions regarding MNase-sensitive complexes and their functional significance remain speculative. With standardized MNase digestion and clearly annotated datasets, this study could provide a valuable contribution to understanding chromatin regulation in TriTryps parasites.

      As we have explained in the previous point our conclusions are valid since we do not compare in any figure samples coming from different treatments. The only exception to this comment could be in figure 3 when talking about MNase-ChIP-seq. We have now added a clear and explicit comment in the section and the discussion that despite having subtle differences in experimental procedures we arrive to the same results. This is the case for T. cruzi IP, run from crosslinked chromatin, compared to T. brucei's IP, run from native chromatin.

      Along the years it was observed in the chromatin field that nucleosomes are so tightly bound to DNA that crosslinking is not necessary. However, it is still a common practice specially when performing IPs. In our own hands, we did not observe any difference at the global level neither in T. cruzi (unpublished) nor in my previous work with yeast (compared nucleosome organization from crosslinked chromatin MNAse-seq inputs Chereji, Mol Cell, 2017 doi:10.1016/j.molcel.2016.12.009 and native MNase-seq from Ocampo, NAR, 2016 doi: 10.1093/nar/gkw068).

    1. Author response:

      Reviewer #1:

      Comment 1: The authors use a confusing timeline for their behavioral experiments, i.e., day 1 is the first day of training in the MWM, and day 6 is the probe trial, but in reality, day 6 is the first day after the last training day. So this is really day 1 post-training, and day 20 is 14 days post-training.

      We thank this reviewer for pointing out the issue of the behavioral timeline. We will revise the behavioral timeline as suggested by this reviewer. Days 1–5 will be labeled as “Training phase day 1–5”. Day 6 will be labeled as the “Day 1 post-training” and Day 20 will be labeled as the “Day 14 post-training”.

      Comment 2: The authors inaccurately use memory as a term. During the training period in the MWM, the animals are learning, while memory is only probed on day 6 (after learning). Thus, day 6 reflects memory consolidation processes after learning has taken place.

      We will revise the manuscript to distinguish between "learning" and "memory." We will refer to the performance during the 5-day training period as "spatial learning" and restrict the term "memory" to the probe tests on Day 6, which reflect memory processes after learning has taken place.

      Comment 3: The NAT10 cKO mice are useful... but all the experiments used AAV-CRE injections in the dorsal hippocampus that showed somewhat modest decreases... For these experiments, it would be better to cross the NAT10 floxed animals to CRE lines where a better knockdown of NAT10 can be achieved, with less variability.

      We want to clarify the reason for using AAV-Cre injection rather than Cre lines. Indeed, we attempted to generate Nat10 conditional knockouts by crossing Nat10<sup>flox/flox</sup> mice with several CNS-specific Cre lines. Crossing with Nestin-Cre and Emx1-Cre resulted in embryonic and premature lethality, respectively, consistent with the essential housekeeping function of NAT10 during neurodevelopment. We are currently using the Camk2α-Cre line which starts to express Cre after postnatal 3 weeks specifically in hippocampal pyramidal neurons (Tsien et al., 1996).

      Comment 4: Because knockdown is only modest (~50%), it is not clear if the remaining ac4c on mRNAs is due to remaining NAT10 protein or due to an alternative writer (as the authors pose).

      Our results suggest the existence of alternative writers. As shown in Figure 6D, we identified a population of "NAT10-independent" MISA mRNAs (present in MISA but not downregulated in NASA). Remarkably, these mRNAs possess a consensus motif (RGGGCACTAACY) that is fundamentally different from the canonical NAT10 motif (AGCAGCTG). This distinct motif usage suggests that the residual ac4C signals are not merely due to incomplete knockdown of NAT10, but reflect the activity of other, as-yet-unidentified ac4C writers. Nonetheless, we think that generation of a Nat10 knockout line with completely loss of NAT10 proteins is useful to address this reviewer’s concern.

      Reviewer #2:

      Comment 1: It is known that synaptosomes are contaminated with glial tissue... So the candidate mRNAs identified by acRIP-seq might also be mixed with glial mRNAs. Are the GO BP terms shown in Figure 3A specifically chosen, or unbiasedly listed for all top ones?

      It is true that some ac4C-mRNAs identified by acRIP-seq from the synaptosomes are highly expressed in astrocyte, such as Aldh1l1, ApoE, Sox9 and Aqp4 (Table S3, Fig. S6H). In agreement, we found that NAT10 was also expressed in astrocyte in addition to neurons. We will show representative image for the expression of NAT10-Cre in astrocytes in the revised MS. The BP items shown in Fig. 3A were chosen from top 30 and highly related with synaptic plasticity and memory. We will show the full list of significant BP items for MISA in the revised MS.

      Comment 2: Where does NAT10-mediated mRNA acetylation take place within cells generally? Is there evidence that NAT10 can catalyze mRNA acetylation in the cytoplasm?

      The previous studies from non-neuronal cells showed that NAT10 can catalyze mRNA acetylation in the cytoplasm and enhance translational efficiency (Arango et al., 2018; Arango et al., 2022). In this study, we showed that mRNA acetylation occurred both in the homogenates and synapses (see ac4C-mRNA lists in Table S2 and S3). However, spatial memory upregulated mRNA acetylation mainly in the synapses rather than in the homogenates (Fig. 2 and Fig. S2).

      Comment 3: "The NAT10 proteins were significantly reduced in the cytoplasm (S2 fraction) but increased in the PSD fraction..." The small increase in synaptic NAT10 might not be enough to cause a decrease in soma NAT10 protein level.

      We showed that the NAT10 protein levels were increased by one-fold in the PSD fraction, but were reduced by about 50% in the cytoplasm after memory formation (Fig. 5J and K). The protein levels of NAT10 in the homogenates and nucleus were not altered after memory formation (Fig. 5F and I). Due to these facts, we hypothesized that NAT10 proteins may have a relocation from cytoplasm to synapses after memory formation, which was also supported by the immunofluorescent results from cultured neurons (Fig. S4). However, we agree with this reviewer that drawing such a conclusion may require the time-lapse imaging of NAT10 protein trafficking in living animals, which is technically challenging at this moment.

      Comment 4: It is difficult to separate the effect on mRNA acetylation and protein mRNA acetylation when doing the loss of function of NAT10.

      This is a good point. We agree with this reviewer that NAT10 may acetylate both mRNA and proteins. We examined the acetylation levels of -tubulin and histone H3, two substrate proteins of NAT10 in the hippocampus of Nat10 cKO mice. As shown in Fig S5C, E, and F, the acetylation levels of -tubulin and histone H3 remained unchanged in the Nat10 cKO mice, likely due to the compensation by other protein acetyltransferases. In contrast, mRNA ac4C levels were significantly decreased in the Nat10 cKO mice (Figure S5G–H). These results suggest that the memory deficits seen in Nat10 cKO mice may be largely due to the impaired mRNA acetylation. Nonetheless, we believe that developing a new technology which enables selective erasure of mRNA acetylation would be helpful to address the function of mRNA. We discussed these points in the MS (line 585-592).

      References

      Arango, D., Sturgill, D., Alhusaini, N., Dillman, A. A., Sweet, T. J., Hanson, G., Hosogane, M., Sinclair, W. R., Nanan, K. K., & Mandler, M. D. (2018). Acetylation of cytidine in mRNA promotes translation efficiency. Cell, 175(7), 1872-1886. e1824.

      Arango, D., Sturgill, D., Yang, R., Kanai, T., Bauer, P., Roy, J., Wang, Z., Hosogane, M., Schiffers, S., & Oberdoerffer, S. (2022). Direct epitranscriptomic regulation of mammalian translation initiation through N4-acetylcytidine. Molecular cell, 82(15), 2797-2814. e2711.

      Tsien, J. Z., Chen, D. F., Gerber, D., Tom, C., Mercer, E. H., Anderson, D. J., Mayford, M., Kandel, E. R., & Tonegawa, S. (1996). Subregion-and cell type–restricted gene knockout in mouse brain. Cell, 87(7), 1317-1326.

    1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

      Learn more at Review Commons


      Reply to the reviewers

      Manuscript number: RC-2025-03195R

      Point-by-Point Response to Reviewers

      We thank the reviewers for their thoughtful and constructive evaluations, which have helped us substantially improve the clarity, rigor, and balance of our manuscript. We are grateful for their recognition that our integrated ATAC-seq and RNA-seq analyses provide a valuable and technically sound contribution to understanding soxB1-2 function and regenerative neurogenesis in planarians.

      We have carefully addressed the reviewers' major points as follows:

      1. Direct versus indirect regulation by SoxB1-2:____ In the revision, we explicitly acknowledge the limitations of inferring direct regulation from our current datasets and have revised statements throughout the Results and Discussion to emphasize that our findings are correlative.
      2. Evidence for pioneer activity:____ Although the pioneer role of SoxB1 transcription factors in well established in other systems, we agree that additional binding or motif data would be required to formally demonstrate SoxB1-2 pioneer function. Accordingly, we performed motif analysis and revised the text throughout to frame SoxB1-2's proposed role as consistent with, rather than demonstrating transcriptional activator activity.
      3. Motif enrichment and downstream regulatory interactions:____ In response to Reviewer #1's suggestion, we have included a new motif enrichment analysis in the supplement to contextualize possible co-regulators within the SoxB1-2 network.
      4. Data reproducibility and peak-calling consistency:____ We have included sample correlations ____and peak overlaps for ATAC-seq samples in the revision, providing a clearer assessment of reproducibility.
      5. Clarification of co-expression and downstream targets:____ We included co-expression plots for soxB1-2 with mecom and castor in the supplemental materials. These plots were generated from previously published scRNA-seq data and demonstrate that cells expressing soxB1-2 also express mecom and __ __We appreciate the reviewers' recognition that our methods are rigorous and our data accessible. We have incorporated all major revisions suggested and believe have strengthened the manuscript's precision, interpretations, and conclusions. Below, we respond to each comment in detail.

      Reviewer #1 (Evidence, reproducibility and clarity (Required)):

      Summary

      The authors of this interesting study take the approach of combining RNAi, RNA-seq and ATAC-seq to try to build a regulatory network surrounding the function of a planarian SoxB1 ortholog, broadly required for neural specification during planarian regeneration. They find a number of chromatin regions that differentially accessible (measured by ATAC-seq), associate these with potential genes by proximity to the TSS. They then compare this set of genes with those that are differentially regulated (using RNA-seq), after SoxB1 RNAi mediated knockdown. This allows them the authors some focus on potential directly regulated targets of the planarian SoxB1. Two of these downstream targets, the mecom and castor transcription factors are then studied in greater detail.

      Major Comments

      I have no suggestions for new experiments that fit sensibly with the scope of the current work. There are other analyses that could be appropriate with the ATAC-seq data, but may not make sense in the content of SoxB1 acting as pioneer factor.

      I would like to see motif enrichment analysis under the set of peaks to see if SoxB1 is opening chromatin for a restricted set of other transcription factors to then bind. Much of this could be taken from Neiro et al, eLife 2022 (which also used ATAC-seq) and matched planarians TF families to likely binding motifs. This could add some breadth to the regulatory network. It could be revealing for example if downstream TF also help regulate other targets that SoxB1 makes available, this is pattern often seen for cell specification (as I am sure the authors are aware). Alternatively, it may reveal other candidate regulators.

      Thank you for this suggestion. We agree with the reviewers that this analysis should be done. We ran the motif enrichment analysis using the same methods as outlined in Neiro et al. eLife, 2022. We have included a new motif enrichment analysis in the supplement to contextualize possible co-regulators within the SoxB1-2 network.

      Overall peak calling consistency with ATAC-sample would be useful to report as well, to give readers an idea of noise in the data. What was the correlation between samples?

      __Excellent point. In response to this comment, we ran a Pearson correlation test on replicates within gfp and soxB1-2 RNAi replicates to get an idea of overall correlation between replicates. Additionally, we calculated percent overlap of peaks for biological replicates and between treatment groups. __

      While it is logical to focus on downregulated genes, it would also be interesting to look at upregulated genes in some detail. In simple terms would we expect to see the representation of an alternate set of fate decisions being made by neoblast progeny?

      This is also an important point that we considered but initially did not pursue it due to the lack of tools to test upregulated gene function. However, the reviewer is correct that this is straightforward to perform computationally. Thus, we have performed Gene Ontology analysis on the upregulated genes in all RNA-seq datasets (soxB1-2 RNAi, mecom RNAi, and castor RNAi). Both mecom and castor datasets did not reveal enrichment within the upregulated portion of the dataset. Genes upregulated after soxB1-2 RNAi were enriched for metabolic, xenobiotic detoxification, potassium homeostasis, and endocytic programs. Rather than indicating a shift toward alternative lineages, including non-ectodermal fates, these signatures are consistent with stress-responsive and homeostatic programs activated following loss of soxB1-2. We did not detect enrichment patterns strongly associated with alternative cell fates. We conclude that this analysis does not formally exclude potential shifts in lineage-specific transcriptional programs, but does support our hypothesis that soxB1-2 functions as a transcriptional activator.

      Can the authors be explicit about whether they have evidence for co-expression of SoxB1/castor and SoxB1/mecom? I could find this clearly and it would be important to be clear whether this basic piece of evidence is in place or not at this stage.

      We included co-expression plots for soxB1-2 with mecom and castor in the supplemental material. These plots were generated from previously published scRNA-seq data and demonstrate that cells expressing soxB1-2 also express mecom and castor. We have not done experiments showing co-expression via in situ at this time.

      Minor comments

      Formally loss of castor and mecom expression does mean these cells are absent, strictly the cell absence needs an independent method. It might be useful to clarify this with the evidence of be clear that cells are "very probably" not produced.

      We agree that loss of castor and mecom expression does not formally demonstrate the physical absence of these cells, and that independent methods would be required to definitively confirm their loss. In response, we have revised our wording to indicate that castor- and mecom-expressing cells are very likely not being produced, rather than stating that they are absent.

      Reviewer #1 (Significance (Required)):

      Significance

      Strengths and limitations.

      The precise exploitation of the planarian system to identify potential targets, and therefore regulatory mechanisms, mediated by SoxB1 is an interesting contribution to the fi eld. We know almost nothing about the regulatory mechanisms that allow regeneration and how these might have evolved, and this work is well-executed step in that direction.

      Advance

      The paper makes a clear advance in our understanding of an important process in animals (neural specification) and how this happens in the context in the context during an example of animal regeneration. The methods are state-of-the-art with respect to what is possible in the planarian system.

      Audience

      This will be of wide interest to developmental biologists, particularly those studying regeneration in planarians and other regenerative systems,and those who study comparative neurodevelopment.

      Expertise

      I have expertise in functional genomics in the context of stem cells and regeneration, particularly in the planarian model system

      Reviewer #2 (Evidence, reproducibility and clarity (Required)):

      Review - Cathell, et al (RC-2025-03195)

      Summary and Significance:

      Understanding regenerative neurogenesis has been difficult due to the limited amount of neurogenesis that occurs after injury in most animal species. Planarians, with their adult neurogenesis and robust post-injury response, allow us to get a glimpse into regenerative neurogenesis. The Zayas laboratory previously revealed a key role for SoxB1-2 in maintenance and regeneration of a broad set of sensory and peripheral neurons in the planarian body. SoxB1-2 also has a role in many epidermal fates. Their previous work left open the tempting possibility that SoxB1-2 acts as a very upstream regulator of epidermal and neuronal fates, potentially acting as a pioneer transcription factor within these lineages. In the manuscript currently under review, Cathell and colleagues use ATAC-Seq and RNA-Seq to investigate chromatin changes after SoxB1-2(RNAi). With the experimental limitations in planarians, this is a strong first step toward testing their hypothesis that SoxB1-2acts as a pioneer within a set of planarian lineages. Beyond these cell types, this work is also important because planarian cell fates often rely on a suite of transcription factors, but the nature of transcription factor cooperation has been much less well understood. Indeed, the authors do show that loss of SoxB1-2 by RNAi causes changes in a number of accessible regions of the genome; many of these chromatin changes correspond to changes in gene expression of genes nearby these peaks. The authors also examine in more detail two genes that have genomic and transcriptomic changes after SoxB1-2(RNAi), mecom and castor. The authors completed RNA-Seq on mecom(RNAi) and castor(RNAi) animals, identifying genes downregulated after loss of either factor that are also seen in SoxB1-2(RNAi). The results in this paper are rigorous and very well presented. I will share two major limitations of the study and some suggestions for addressing them, but this work may also be acceptable without those changes at some journals.

      Limitation 1:

      The paper aims to test the hypothesis that SoxB1-2 is a pioneer transcription factor. Observation that SoxB1-2(RNAi) leads to loss of many accessible regions in the chromatin supports the hypothesis. However, an alternate possibility is that SoxB1-2 leads to transcription of another factor that is a pioneer factor or a chromatin remodeling enzyme; in either of these cases, the accessibility peak changes may not be due to SoxB1-2 directly but due to another protein that SoxB1-2 promotes. The authors describe how they can address this limitation in the future; in the meantime, is it known what the likely binding for SoxB1-2 would be (experimentally or based on homology)? If so, could the authors examine the relative abundance of SoxB1-2 binding sites in peaks that change after SoxB1-2(RNAi)? This could be compared to the abundance of the same binding sequence in non-changing peaks. Enrichment of SoxB1-2 binding sites in ATAC peaks that change after its RNAi would support the argument that chromatin changes are directly due to SoxB1-2.

      We appreciate the feedback and agree that distinguishing between direct SoxB1-2 pioneer activity and indirect effects mediated through downstream regulators is an important consideration. While we did not perform a direct abundance analysis of potential chromatin-remodeling cofactors, we conducted a motif enrichment analysis following the approach of Neiro et al. (eLife, 2022), comparing control and soxB1-2(RNAi) peak sets. This analysis revealed that Sox-family motifs, particularly SoxB1-like motifs, were among the most enriched in regions that remain accessible in control animals relative to soxB1-2(RNAi) animals, consistent with a model in which SoxB1-2 directly contributes to establishing or maintaining accessibility at these loci. We have now included this analysis in the supplemental materials to further contextualize potential co-regulators and transcriptional partners within the SoxB1-2 regulatory network. We agree and acknowledge in the report that future studies assessing chromatin remodeling factor expression and abundance will be valuable to definitively separate direct and indirect pioneer activity.

      Limitation 2:

      The characterization of mecom and castor is somewhat preliminary relative to the deep work in the rest of the paper. I think this could be addressed with a few experiments. The authors could validate RNA-seq findings with ISH to show that cells are lost after reduction of either TF (this would support the model figure). The authors could also try to define whether loss of either TF causes behavioral phenotypes that might be similar to SoxB1-2(RNAi); this would be a second line of evidence that the TFs are downstream of key events in the SoxB1-2

      pathway.

      Thank you for this suggestion. We agree that additional validation of the mecom and castor RNA-seq results and further phenotypic characterization would strengthen this section. We are currently conducting in situ hybridization experiments to validate transcriptional changes in mecom and castor using the same experimental framework applied to soxB1-2 downstream candidates. We anticipate completing these studies within the next three months and will incorporate the results into future work.

      Regarding behavioral phenotypes, we performed preliminary screening for robust behavioral responses, including mechanosensory responses, but did not observe overt defects. However, the lack of established, standardized behavioral assays in planarians presents a current limitation; such assays need to be developed de novo, and predicting specific behavioral phenotypes in advance remains challenging. We fully agree that functional behavioral assays represent an important next step and are actively exploring strategies to systematically develop and implement them going forward.

      Other questions or comments for the authors:

      Is it known how other Sox factors work as pioneer TFs? Are key binding partners known? I wondered if it would be possible to show that SoxB1-2 is co-expressed with the genes that encode these partners and/or if RNAi of these factors would phenocopy SoxB1-2. This is likely beyond the scope of this paper, but if the authors wanted to further support their argument about SoxB1-2 acting as a pioneer in planarians, this might be an additional way to do it.

      In other systems, Sox pioneer factors often act together with POU family transcription factors (for example, Oct4 and Brn2) and PAX family members such as Pax6. In planarians, a POU homolog (pou-p1) is expressed in neoblasts and may represent an interesting candidate co-factor for future investigation in the context of SoxB1-2 pioneer activity. We have also previously examined the relationship between SoxB1-2 and the POU family transcription factors pou4-1 and pou4-2. Although RNAi of these factors does not fully phenocopy soxB1-2 knockdown, pou4-2(RNAi) results in loss of mechanosensation, suggesting that downstream POU factors may contribute to aspects of neural function regulated by SoxB1-2 (McCubbin et al. eLife 2025). We agree that co-expression and functional interaction studies with these candidates would be highly informative, and we view this as an exciting future direction beyond the scope of the current manuscript.

      This paper is one of few to use ATAC-Seq in planarians. First, I think the authors should make a bigger deal of their generation of a dataset with this tool! Second, it would be great to know whether the ATAC-Seq data (controls and/or RNAi) will be browsable in any planarian databases or in a new website for other scientists. I believe that in addition to the data being used to test hypotheses about planarians, the data could also be a huge hypothesis generating resource in the planarian community, so I would encourage the authors to both self-promote their contribution and make plans to share it as widely and usably as possible.

      Thank you very much for this encouraging feedback. We appreciate the suggestion and have strengthened the text to emphasize the significance of generating this ATAC-seq resource for the planarian field. We agree that these datasets represent a valuable community resource and are committed to making all control and soxB1-2(RNAi) ATAC-seq data publicly accessible.

      Reviewer #2 (Significance (Required)):

      This paper's strengths are that it addresses an important problem in regenerative biology in a rigorous manner. The writing and presentation of the data are excellent. The paper also provides excellent datasets that will be very useful to other researchers in the fi eld. Finally, the work is one of, if not the first to examine how the action of one transcription factor in planarians leads to changes in the cellular and chromatin environment that could then be acted upon by subsequent factors. This is an important contribution to the planarian fi eld, but also one that will be useful for other developmental neuroscientists and regenerative biologists.

      I described a couple of limitations in the review above, but the strengths outweigh the weaknesses.

      Reviewer #3 (Evidence, reproducibility and clarity (Required)):

      The authors investigated the role of soxB1-2 in planarian neural and epidermal lineage specification. Using ATAC-seq and RNA-seq from head fragments after soxB1-2 RNAi, they identified regions of decreased chromatin accessibility and reduced gene expression, demonstrating that soxB1-2 induces neural and sensory programs. Integration of the datasets yielded 31 overlapping candidate targets correlating ATAC-seq and RNA-seq. Downstream analyses of transcription factors that had either/or differentially accessible regulatory region or showed differential expression (castor and mecom) implicated these transcription factors in mechanosensory and ciliary modules. The authors combined additional techniques, such as in situ hybridization to support the observations based on the ATACseq/RNAseq data. The manuscript is clearly written as well as data presentation in the main and supplementary figures. The major claim of the manuscript is that SoxB1-2 is likely a pioneer transcription factor that alters the accessibility of the chromatin, which if true, would be one of the first demonstrations of direct transcriptional regulation in planarians. As described below, I am not certain that this interpretation of the data is more valid than alternative interpretations.

      Major comments

      1. Direct vs. indirect regulation. The current analysis does not distinguish between direct and indirect soxB1-2 targets, therefore, this analysis cannot indicate whether soxB1-2 functions as a pioneer transcription. ATAC-seq and RNA-seq, as performed here, do not determine whether reduced accessibility or downregulation of gene expression represents a change within existing cells or a reduction in the proportion of specific cell types in the libraries produced. This limitation should be explicitly recognized where causal statements are made. In fact, several pieces of information strongly suggest that indirect effects are abundant in the data: (1) the observed loss of accessibility and gene expression in late epidermal progenitors likely represent indirect effects, indicating that within the timeframe of the experiment, it is impossible (using these techniques) to distinguish between the scenarios. (2) The finding that castor knockdown reduces soxB1-2 expression likely reflects population loss rather than direct regulation, given overlapping expression domains. This further illustrates the difficulty in inferring directionality from such datasets. In order to provide evidence for a more direct association between soxB1-2 and the differentially accessible chromatin regions, a sequence(e.g., motif) analysis would be required. Other approaches to infer direct regulation would have been useful, but they are not available in planarians to the best of my knowledge.

      We agree that distinguishing between direct SoxB1-2 pioneer activity and indirect chromatin changes mediated by downstream factors is an important consideration. As suggested, examining the enrichment of SoxB1-2 binding motifs in regions that lose accessibility following soxB1-2(RNAi) can provide supporting evidence for direct regulation.

      While we did not conduct a direct abundance analysis of all potential chromatin-remodeling cofactors, we performed a motif enrichment analysis following the methodology of Neiro et al. (eLife, 2022), comparing control-specific and soxB1-2(RNAi)-specific accessible peak sets. Consistent with a direct role for SoxB1-2 in chromatin regulation, Sox-family motifs, particularly SoxB1-like motifs, were among the most significantly enriched in regions that maintain accessibility in control animals relative to soxB1-2(RNAi) animals.

      Evidence for pioneer activity. The authors correctly acknowledge that they do not present direct evidence of soxB1-2 binding or chromatin opening. However, the section title in the Discussion could be interpreted as implying otherwise. The claim of pioneer activity should remain explicitly tentative until supported (at least) by motif or binding data.

      We have performed suggested motif analysis and changed the language in this section to better fit the data.

      Replication and dataset comparability. Both ATAC-seq and soxB1-2 RNA-seq were performed on head fragments, but the number of replicates differ between assays (ATAC-seq n=2 per group, RNA-seq n=4-6). This is of course acceptable, but when interpreting the results, it should be taken into consideration that the statistical power is different when using data collected using different techniques and having a varied number of replicates.

      Thank you for raising this important point regarding replication and comparability across datasets. We agree that the differing number of biological replicates between the ATAC-seq and RNA-seq experiments results in different statistical power across assays. We have now clarified this consideration in the manuscript text.

      Minor comments

      "Thousands of accessible chromatin sites". Please state the number of peaks and the thresholds for calling them. Ensure consistency between text (264 DA peaks) and Figure 1 legend (269 DA peaks).

      __We have clarified specific peak numbers and will include the calling parameters in the methods section. Additionally, we will fix the discrepancies between differential peaks. __

      Specify the y-axis normalization units in all coverage plots.

      We have specified this across plots.

      Clarify replicate numbers consistently in the text and figure legends.

      We have identified and corrected discrepancies in the figure legends vs text and correct them and ensured they are included consistently across datasets.

      Referees cross commenting

      The reviews are highly consistent. They recognize the value of the work, and raise similar points. The main shared view is that the current data do not distinguish direct from indirect effects, and claims about pioneer activity should be softened, and further analysis of the differentially accessible peaks could strengthen the link between SoxB1-2 and the chromatin changes.

      -I don't think that it's necessary to further characterize experimentally mecom or castor (as suggested), but of course that it could have value.

      We thank all three reviewers for their positive assessment of the value of our work aiming to elucidate mechanisms by which SoxB1-2 programs planarian stem cells. In the revision, we have improved the presentation and carefully edited conclusions about the function of SoxB1-2. Performing motif analysis and GO annotation of upregulated genes has strengthened our observation that SoxB1-2 acts as an activator and has revealed putative binding sites.

      The preliminary revision does not yet include further characterization of mecom and castor downstream genes. In response to Reviewer #2, we appreciate that additional validation of the mecom and castor RNA-seq results and further phenotypic characterization would strengthen this section. Although we are currently conducting in situ hybridization experiments to validate transcriptional changes in mecom and castor using the same experimental framework applied to soxB1-2 downstream candidates, we also reconsidered, as we did in our first revision, whether this is necessary or better suited for future investigations.

      In the revision, we noted that our Discussion points were not balanced and that we emphasized the mecom and castor results in a manner that distracted from the major focus of the work, likely contributing to the impression that additional experimental evidence was required. Therefore, we have revised the section accordingly and streamlined the Discussion to avoid repetitive statements and to focus on the insights gained into the mechanism of SoxB1-2 function in planarian neurogenesis. We remain open to including these additional experiments if the reviewers or handling editors consider them essential; however, we agree that their inclusion is not absolutely necessary.

      Reviewer #3 (Significance (Required)):

      General assessment. The study offers valuable observations by combining chromatin and transcriptional analysis of planarian neural differentiation. The integration with in situ validation convincingly demonstrates effects on neural tissues and provides a solid resource for future functional work. However, mechanistic interpretation remains limited, partly because of technical limitations of the system. The data support an important role for soxB1-2 in neural and epidermal lineage regulation, but not direct binding or chromatin-opening activity. The authors have previously published analysis of soxB1-2 in planarians, so the addition of ATAC-seq data contributes to solving another piece of the puzzle.

      __Advance. __

      This is one of the first studies to couple ATAC-seq and RNA-seq in planarian tissue to dissect regulatory logic during regeneration. It identifies new candidate regulators of sensory and epidermal differentiation and identifies soxB1-2 as a likely upstream factor in ectodermal lineage networks. The work extends previous studies on soxB1-2 activity and neural cell production by integrating chromatin and transcriptional layers. In that respect the results are very solid, although the study remains correlative at the mechanistic level.

      Audience.

      This work will potentially interest researchers interested in regeneration and transcriptional networks. The datasets and gene lists will be valuable references for follow-up studies on planarian ectodermal lineages, and therefore will appeal to this community.